All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Flush G2H handler during a GT reset
@ 2022-01-18 21:43 ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom, john.c.harrison

After a small fix to error capture code, we now can flush G2H during a
GT reset which simplifies code and seals some extreme corner case races. 

v2:
 (CI)
  - Don't trigger GT reset from G2H handler

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (3):
  drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  drm/i915/guc: Add work queue to trigger a GT reset
  drm/i915/guc: Flush G2H handler during a GT reset

 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 +++++++++----------
 drivers/gpu/drm/i915/i915_gpu_error.c         |  2 +-
 3 files changed, 26 insertions(+), 22 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Intel-gfx] [PATCH 0/3] Flush G2H handler during a GT reset
@ 2022-01-18 21:43 ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom

After a small fix to error capture code, we now can flush G2H during a
GT reset which simplifies code and seals some extreme corner case races. 

v2:
 (CI)
  - Don't trigger GT reset from G2H handler

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (3):
  drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  drm/i915/guc: Add work queue to trigger a GT reset
  drm/i915/guc: Flush G2H handler during a GT reset

 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 +++++++++----------
 drivers/gpu/drm/i915/i915_gpu_error.c         |  2 +-
 3 files changed, 26 insertions(+), 22 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
@ 2022-01-18 21:43   ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom, john.c.harrison

Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
GFP_KERNEL do fully decouple the error capture from fence signalling.

Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 67f3515f07e7a..aee42eae4729f 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
 	if (!ee)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Intel-gfx] [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
@ 2022-01-18 21:43   ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom

Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
GFP_KERNEL do fully decouple the error capture from fence signalling.

Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 67f3515f07e7a..aee42eae4729f 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
 	if (!ee)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
@ 2022-01-18 21:43   ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom, john.c.harrison

The G2H handler needs to be flushed during a GT reset but a G2H
indicating engine reset failure can trigger a GT reset. Add a worker to
trigger the GT when a engine reset failure is received to break this
circular dependency.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9d26a86fe557a..60ea8deef5392 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -119,6 +119,11 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @reset_worker: worker to trigger a GT reset after an engine
+		 * reset fails
+		 */
+		struct work_struct reset_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 23a40f10d376d..cdd8d691251ff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 }
 
 static void destroyed_worker_func(struct work_struct *w);
+static void reset_worker_func(struct work_struct *w);
 
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_WORK(&guc->submission_state.reset_worker,
+		  reset_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 	return gt->engine_class[engine_class][instance];
 }
 
+static void reset_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.reset_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	intel_gt_handle_error(gt, ALL_ENGINES,
+			      I915_ERROR_CAPTURE,
+			      "GuC failed to reset a engine\n");
+}
+
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
 		guc_class, instance, engine->name, reason);
 
-	intel_gt_handle_error(gt, engine->mask,
-			      I915_ERROR_CAPTURE,
-			      "GuC failed to reset %s (reason=0x%08x)\n",
-			      engine->name, reason);
+	/*
+	 * A GT reset flushes this worker queue (G2H handler) so we must use
+	 * another worker to trigger a GT reset.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
@ 2022-01-18 21:43   ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom

The G2H handler needs to be flushed during a GT reset but a G2H
indicating engine reset failure can trigger a GT reset. Add a worker to
trigger the GT when a engine reset failure is received to break this
circular dependency.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9d26a86fe557a..60ea8deef5392 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -119,6 +119,11 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @reset_worker: worker to trigger a GT reset after an engine
+		 * reset fails
+		 */
+		struct work_struct reset_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 23a40f10d376d..cdd8d691251ff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 }
 
 static void destroyed_worker_func(struct work_struct *w);
+static void reset_worker_func(struct work_struct *w);
 
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_WORK(&guc->submission_state.reset_worker,
+		  reset_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 	return gt->engine_class[engine_class][instance];
 }
 
+static void reset_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.reset_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	intel_gt_handle_error(gt, ALL_ENGINES,
+			      I915_ERROR_CAPTURE,
+			      "GuC failed to reset a engine\n");
+}
+
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
 		guc_class, instance, engine->name, reason);
 
-	intel_gt_handle_error(gt, engine->mask,
-			      I915_ERROR_CAPTURE,
-			      "GuC failed to reset %s (reason=0x%08x)\n",
-			      engine->name, reason);
+	/*
+	 * A GT reset flushes this worker queue (G2H handler) so we must use
+	 * another worker to trigger a GT reset.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
@ 2022-01-18 21:43   ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom, john.c.harrison

Now that the error capture is fully decoupled from fence signalling
(request retirement to free memory, which is turn depends on resets) we
can safely flush the G2H handler during a GT reset. This is eliminates
corner cases where GuC generated G2H (e.g. engine resets) race with a GT
reset.

Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cdd8d691251ff..1a11e8986948b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Intel-gfx] [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
@ 2022-01-18 21:43   ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-18 21:43 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: thomas.hellstrom

Now that the error capture is fully decoupled from fence signalling
(request retirement to free memory, which is turn depends on resets) we
can safely flush the G2H handler during a GT reset. This is eliminates
corner cases where GuC generated G2H (e.g. engine resets) race with a GT
reset.

Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cdd8d691251ff..1a11e8986948b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Flush G2H handler during a GT reset (rev2)
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
                   ` (3 preceding siblings ...)
  (?)
@ 2022-01-18 22:01 ` Patchwork
  -1 siblings, 0 replies; 33+ messages in thread
From: Patchwork @ 2022-01-18 22:01 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Flush G2H handler during a GT reset (rev2)
URL   : https://patchwork.freedesktop.org/series/98855/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
49cfdd902b91 drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
ac359b31efe8 drm/i915/guc: Add work queue to trigger a GT reset
7ccc4752e12a drm/i915/guc: Flush G2H handler during a GT reset
-:49: WARNING:FROM_SIGN_OFF_MISMATCH: From:/Signed-off-by: email address mismatch: 'From: Matthew Brost <matthew.brost@intel.com>' != 'Signed-off-by: Matthew Brost <mattthew.brost@intel.com>'

total: 0 errors, 1 warnings, 0 checks, 30 lines checked



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Flush G2H handler during a GT reset (rev2)
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
                   ` (4 preceding siblings ...)
  (?)
@ 2022-01-18 22:02 ` Patchwork
  -1 siblings, 0 replies; 33+ messages in thread
From: Patchwork @ 2022-01-18 22:02 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Flush G2H handler during a GT reset (rev2)
URL   : https://patchwork.freedesktop.org/series/98855/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Flush G2H handler during a GT reset (rev2)
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
                   ` (5 preceding siblings ...)
  (?)
@ 2022-01-18 22:32 ` Patchwork
  -1 siblings, 0 replies; 33+ messages in thread
From: Patchwork @ 2022-01-18 22:32 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 5769 bytes --]

== Series Details ==

Series: Flush G2H handler during a GT reset (rev2)
URL   : https://patchwork.freedesktop.org/series/98855/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11094 -> Patchwork_22019
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/index.html

Participating hosts (46 -> 41)
------------------------------

  Additional (1): fi-kbl-soraka 
  Missing    (6): fi-bdw-samus shard-tglu fi-bsw-cyan shard-rkl shard-dg1 fi-skl-6600u 

Known issues
------------

  Here are the changes found in Patchwork_22019 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@semaphore:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][1] ([fdo#109271]) +31 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-bdw-5557u/igt@amdgpu/amd_basic@semaphore.html

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][2] ([fdo#109271]) +8 similar issues
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][3] ([fdo#109271] / [i915#2190])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@parallel-random-engines:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#4613]) +3 similar issues
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@gem_lmem_swapping@parallel-random-engines.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][5] ([i915#1886] / [i915#2291])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@kms_chamelium@dp-edid-read:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][6] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@kms_chamelium@dp-edid-read.html

  * igt@kms_chamelium@vga-edid-read:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][7] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-bdw-5557u/igt@kms_chamelium@vga-edid-read.html

  * igt@kms_frontbuffer_tracking@basic:
    - fi-cfl-8109u:       [PASS][8] -> [DMESG-FAIL][9] ([i915#295])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][10] ([fdo#109271] / [i915#533])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-kbl-soraka/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-b:
    - fi-cfl-8109u:       [PASS][11] -> [DMESG-WARN][12] ([i915#295]) +10 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/fi-cfl-8109u/igt@kms_pipe_crc_basic@read-crc-pipe-b.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-cfl-8109u/igt@kms_pipe_crc_basic@read-crc-pipe-b.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s3@smem:
    - fi-bdw-5557u:       [INCOMPLETE][13] ([i915#146]) -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@i915_selftest@live@gt_heartbeat:
    - {fi-tgl-dsi}:       [INCOMPLETE][15] -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/fi-tgl-dsi/igt@i915_selftest@live@gt_heartbeat.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/fi-tgl-dsi/igt@i915_selftest@live@gt_heartbeat.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#146]: https://gitlab.freedesktop.org/drm/intel/issues/146
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2291]: https://gitlab.freedesktop.org/drm/intel/issues/2291
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#295]: https://gitlab.freedesktop.org/drm/intel/issues/295
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533


Build changes
-------------

  * Linux: CI_DRM_11094 -> Patchwork_22019

  CI-20190529: 20190529
  CI_DRM_11094: 6ce31c986ee8beaa0f98fd4e200b7a421fd4adf9 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6327: 0d559158c2d3b5723abbfc2cb4b04532e28663b2 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_22019: 7ccc4752e12a008e995d16dc22f9a057e8268cfb @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

7ccc4752e12a drm/i915/guc: Flush G2H handler during a GT reset
ac359b31efe8 drm/i915/guc: Add work queue to trigger a GT reset
49cfdd902b91 drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/index.html

[-- Attachment #2: Type: text/html, Size: 7088 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Flush G2H handler during a GT reset (rev2)
  2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
                   ` (6 preceding siblings ...)
  (?)
@ 2022-01-19  1:02 ` Patchwork
  -1 siblings, 0 replies; 33+ messages in thread
From: Patchwork @ 2022-01-19  1:02 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30265 bytes --]

== Series Details ==

Series: Flush G2H handler during a GT reset (rev2)
URL   : https://patchwork.freedesktop.org/series/98855/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11094_full -> Patchwork_22019_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (13 -> 13)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_22019_full:

### IGT changes ###

#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@gem_flink_race@flink_name:
    - {shard-rkl}:        [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-2/igt@gem_flink_race@flink_name.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-5/igt@gem_flink_race@flink_name.html

  * igt@kms_frontbuffer_tracking@psr-2p-scndscrn-cur-indfb-draw-mmap-wc:
    - {shard-rkl}:        NOTRUN -> [FAIL][3]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-5/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-cur-indfb-draw-mmap-wc.html

  
Known issues
------------

  Here are the changes found in Patchwork_22019_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_eio@in-flight-suspend:
    - shard-kbl:          NOTRUN -> [DMESG-WARN][4] ([i915#180])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@gem_eio@in-flight-suspend.html

  * igt@gem_exec_balancer@parallel-keep-in-fence:
    - shard-iclb:         [PASS][5] -> [SKIP][6] ([i915#4525])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb1/igt@gem_exec_balancer@parallel-keep-in-fence.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb5/igt@gem_exec_balancer@parallel-keep-in-fence.html

  * igt@gem_exec_capture@pi@bcs0:
    - shard-skl:          NOTRUN -> [INCOMPLETE][7] ([i915#4547])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl10/igt@gem_exec_capture@pi@bcs0.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-skl:          NOTRUN -> [FAIL][8] ([i915#2846])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl6/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - shard-apl:          [PASS][9] -> [FAIL][10] ([i915#2842])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-apl3/igt@gem_exec_fair@basic-none-solo@rcs0.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl4/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-none@vcs1:
    - shard-iclb:         NOTRUN -> [FAIL][11] ([i915#2842])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb4/igt@gem_exec_fair@basic-none@vcs1.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [PASS][12] -> [FAIL][13] ([i915#2842]) +1 similar issue
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-kbl1/igt@gem_exec_fair@basic-pace@vecs0.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_huc_copy@huc-copy:
    - shard-kbl:          NOTRUN -> [SKIP][14] ([fdo#109271] / [i915#2190])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl4/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@heavy-verify-random:
    - shard-skl:          NOTRUN -> [SKIP][15] ([fdo#109271] / [i915#4613]) +2 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl2/igt@gem_lmem_swapping@heavy-verify-random.html

  * igt@gem_softpin@allocator-evict-all-engines:
    - shard-glk:          [PASS][16] -> [DMESG-WARN][17] ([i915#118])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-glk6/igt@gem_softpin@allocator-evict-all-engines.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-glk3/igt@gem_softpin@allocator-evict-all-engines.html

  * igt@gem_userptr_blits@dmabuf-sync:
    - shard-kbl:          NOTRUN -> [SKIP][18] ([fdo#109271] / [i915#3323])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@gem_userptr_blits@dmabuf-sync.html
    - shard-skl:          NOTRUN -> [SKIP][19] ([fdo#109271] / [i915#3323])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl8/igt@gem_userptr_blits@dmabuf-sync.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-skl:          NOTRUN -> [FAIL][20] ([i915#3318])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl9/igt@gem_userptr_blits@vma-merge.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-skl:          NOTRUN -> [FAIL][21] ([i915#454])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl4/igt@i915_pm_dc@dc6-psr.html

  * igt@i915_pm_dc@dc9-dpms:
    - shard-iclb:         [PASS][22] -> [SKIP][23] ([i915#4281])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb4/igt@i915_pm_dc@dc9-dpms.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb3/igt@i915_pm_dc@dc9-dpms.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-0-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][24] ([fdo#109271] / [i915#3777])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl4/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-0-hflip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-skl:          NOTRUN -> [SKIP][25] ([fdo#109271] / [i915#3777]) +2 similar issues
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl4/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-180-async-flip:
    - shard-skl:          NOTRUN -> [FAIL][26] ([i915#3743]) +1 similar issue
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl9/igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-180-async-flip.html

  * igt@kms_ccs@pipe-a-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs:
    - shard-apl:          NOTRUN -> [SKIP][27] ([fdo#109271]) +18 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl6/igt@kms_ccs@pipe-a-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs.html

  * igt@kms_ccs@pipe-a-missing-ccs-buffer-y_tiled_gen12_mc_ccs:
    - shard-kbl:          NOTRUN -> [SKIP][28] ([fdo#109271] / [i915#3886]) +4 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl4/igt@kms_ccs@pipe-a-missing-ccs-buffer-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-skl:          NOTRUN -> [SKIP][29] ([fdo#109271] / [i915#3886]) +11 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl7/igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_color_chamelium@pipe-b-ctm-limited-range:
    - shard-apl:          NOTRUN -> [SKIP][30] ([fdo#109271] / [fdo#111827]) +1 similar issue
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl8/igt@kms_color_chamelium@pipe-b-ctm-limited-range.html

  * igt@kms_color_chamelium@pipe-b-ctm-max:
    - shard-skl:          NOTRUN -> [SKIP][31] ([fdo#109271] / [fdo#111827]) +29 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl9/igt@kms_color_chamelium@pipe-b-ctm-max.html

  * igt@kms_color_chamelium@pipe-c-ctm-max:
    - shard-kbl:          NOTRUN -> [SKIP][32] ([fdo#109271] / [fdo#111827]) +4 similar issues
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl1/igt@kms_color_chamelium@pipe-c-ctm-max.html

  * igt@kms_color_chamelium@pipe-d-ctm-blue-to-red:
    - shard-snb:          NOTRUN -> [SKIP][33] ([fdo#109271] / [fdo#111827])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-snb4/igt@kms_color_chamelium@pipe-d-ctm-blue-to-red.html

  * igt@kms_cursor_crc@pipe-b-cursor-32x32-onscreen:
    - shard-skl:          NOTRUN -> [SKIP][34] ([fdo#109271]) +384 similar issues
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl2/igt@kms_cursor_crc@pipe-b-cursor-32x32-onscreen.html

  * igt@kms_cursor_crc@pipe-c-cursor-suspend:
    - shard-kbl:          [PASS][35] -> [DMESG-WARN][36] ([i915#180]) +1 similar issue
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-kbl1/igt@kms_cursor_crc@pipe-c-cursor-suspend.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@kms_cursor_crc@pipe-c-cursor-suspend.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-iclb:         [PASS][37] -> [FAIL][38] ([i915#2346])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2:
    - shard-glk:          [PASS][39] -> [FAIL][40] ([i915#79])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-glk3/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-glk4/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2.html

  * igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1:
    - shard-skl:          NOTRUN -> [FAIL][41] ([i915#2122])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl10/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-blt:
    - shard-kbl:          NOTRUN -> [SKIP][42] ([fdo#109271]) +72 similar issues
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl1/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@fbcpsr-rgb101010-draw-blt:
    - shard-snb:          NOTRUN -> [SKIP][43] ([fdo#109271]) +35 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-snb4/igt@kms_frontbuffer_tracking@fbcpsr-rgb101010-draw-blt.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          NOTRUN -> [FAIL][44] ([i915#1188])
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl6/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_hdr@bpc-switch-suspend:
    - shard-skl:          [PASS][45] -> [FAIL][46] ([i915#1188])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-skl8/igt@kms_hdr@bpc-switch-suspend.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl9/igt@kms_hdr@bpc-switch-suspend.html

  * igt@kms_pipe_crc_basic@hang-read-crc-pipe-d:
    - shard-skl:          NOTRUN -> [SKIP][47] ([fdo#109271] / [i915#533]) +2 similar issues
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl6/igt@kms_pipe_crc_basic@hang-read-crc-pipe-d.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
    - shard-kbl:          [PASS][48] -> [INCOMPLETE][49] ([i915#794])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-kbl3/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl4/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d:
    - shard-kbl:          NOTRUN -> [SKIP][50] ([fdo#109271] / [i915#533]) +1 similar issue
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl4/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes:
    - shard-apl:          [PASS][51] -> [DMESG-WARN][52] ([i915#180]) +8 similar issues
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-apl7/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl1/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb:
    - shard-skl:          NOTRUN -> [FAIL][53] ([i915#265]) +1 similar issue
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl7/igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-constant-alpha-min:
    - shard-skl:          NOTRUN -> [FAIL][54] ([fdo#108145] / [i915#265]) +6 similar issues
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl7/igt@kms_plane_alpha_blend@pipe-c-constant-alpha-min.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area:
    - shard-skl:          NOTRUN -> [SKIP][55] ([fdo#109271] / [i915#658]) +4 similar issues
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl7/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area.html

  * igt@kms_psr@psr2_cursor_plane_onoff:
    - shard-iclb:         [PASS][56] -> [SKIP][57] ([fdo#109441])
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb2/igt@kms_psr@psr2_cursor_plane_onoff.html
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb1/igt@kms_psr@psr2_cursor_plane_onoff.html

  * igt@perf@polling-parameterized:
    - shard-glk:          [PASS][58] -> [FAIL][59] ([i915#1542])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-glk7/igt@perf@polling-parameterized.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-glk5/igt@perf@polling-parameterized.html

  * igt@sysfs_clients@fair-0:
    - shard-skl:          NOTRUN -> [SKIP][60] ([fdo#109271] / [i915#2994]) +4 similar issues
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl10/igt@sysfs_clients@fair-0.html

  
#### Possible fixes ####

  * igt@feature_discovery@psr2:
    - {shard-rkl}:        [SKIP][61] ([i915#658]) -> [PASS][62]
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@feature_discovery@psr2.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@feature_discovery@psr2.html

  * igt@gem_ctx_persistence@smoketest:
    - {shard-dg1}:        [DMESG-WARN][63] ([i915#4892]) -> [PASS][64]
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-dg1-19/igt@gem_ctx_persistence@smoketest.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-dg1-13/igt@gem_ctx_persistence@smoketest.html

  * igt@gem_ctx_shared@q-smoketest-all:
    - {shard-rkl}:        ([INCOMPLETE][65], [PASS][66]) -> [PASS][67]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@gem_ctx_shared@q-smoketest-all.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@gem_ctx_shared@q-smoketest-all.html
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-2/igt@gem_ctx_shared@q-smoketest-all.html

  * igt@gem_eio@in-flight-contexts-1us:
    - shard-tglb:         [TIMEOUT][68] ([i915#3063]) -> [PASS][69]
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-tglb8/igt@gem_eio@in-flight-contexts-1us.html
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-tglb3/igt@gem_eio@in-flight-contexts-1us.html

  * igt@gem_exec_balancer@parallel:
    - shard-iclb:         [SKIP][70] ([i915#4525]) -> [PASS][71]
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb8/igt@gem_exec_balancer@parallel.html
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb1/igt@gem_exec_balancer@parallel.html

  * igt@gem_exec_create@forked@smem:
    - {shard-tglu}:       [INCOMPLETE][72] -> [PASS][73]
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-tglu-7/igt@gem_exec_create@forked@smem.html
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-tglu-8/igt@gem_exec_create@forked@smem.html

  * igt@gem_exec_create@legacy@smem:
    - {shard-rkl}:        [DMESG-WARN][74] -> [PASS][75]
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@gem_exec_create@legacy@smem.html
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-1/igt@gem_exec_create@legacy@smem.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-glk:          [FAIL][76] ([i915#2846]) -> [PASS][77]
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-glk8/igt@gem_exec_fair@basic-deadline.html
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-glk3/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-flow@rcs0:
    - shard-tglb:         [FAIL][78] ([i915#2842]) -> [PASS][79]
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-tglb2/igt@gem_exec_fair@basic-flow@rcs0.html
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-tglb7/igt@gem_exec_fair@basic-flow@rcs0.html

  * igt@gem_exec_fair@basic-none-vip@rcs0:
    - shard-kbl:          [FAIL][80] ([i915#2842]) -> [PASS][81] +1 similar issue
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-kbl4/igt@gem_exec_fair@basic-none-vip@rcs0.html
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@gem_exec_fair@basic-none-vip@rcs0.html

  * igt@gem_exec_suspend@basic-s3@smem:
    - shard-apl:          [DMESG-WARN][82] ([i915#180]) -> [PASS][83] +2 similar issues
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-apl1/igt@gem_exec_suspend@basic-s3@smem.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl8/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@gem_linear_blits@interruptible:
    - {shard-tglu}:       [FAIL][84] -> [PASS][85]
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-tglu-7/igt@gem_linear_blits@interruptible.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-tglu-6/igt@gem_linear_blits@interruptible.html

  * igt@gem_mmap_offset@close-race:
    - {shard-rkl}:        [INCOMPLETE][86] -> [PASS][87]
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@gem_mmap_offset@close-race.html
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-5/igt@gem_mmap_offset@close-race.html

  * igt@gen9_exec_parse@allowed-all:
    - shard-skl:          [DMESG-WARN][88] ([i915#1436] / [i915#716]) -> [PASS][89]
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-skl7/igt@gen9_exec_parse@allowed-all.html
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl2/igt@gen9_exec_parse@allowed-all.html

  * igt@i915_pm_backlight@fade:
    - {shard-rkl}:        ([SKIP][90], [SKIP][91]) ([i915#3012]) -> [PASS][92]
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@i915_pm_backlight@fade.html
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@i915_pm_backlight@fade.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@i915_pm_backlight@fade.html

  * igt@i915_pm_dc@dc6-dpms:
    - shard-iclb:         [FAIL][93] ([i915#454]) -> [PASS][94]
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb3/igt@i915_pm_dc@dc6-dpms.html
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb4/igt@i915_pm_dc@dc6-dpms.html

  * igt@i915_pm_rpm@dpms-mode-unset-lpsp:
    - {shard-rkl}:        [SKIP][95] ([i915#1397]) -> [PASS][96]
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@i915_pm_rpm@dpms-mode-unset-lpsp.html
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@i915_pm_rpm@dpms-mode-unset-lpsp.html

  * igt@i915_pm_rpm@drm-resources-equal:
    - {shard-rkl}:        ([SKIP][97], [SKIP][98]) ([fdo#109308]) -> [PASS][99]
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@i915_pm_rpm@drm-resources-equal.html
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@i915_pm_rpm@drm-resources-equal.html
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@i915_pm_rpm@drm-resources-equal.html

  * igt@i915_selftest@live@hangcheck:
    - shard-snb:          [INCOMPLETE][100] ([i915#3921]) -> [PASS][101]
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-snb6/igt@i915_selftest@live@hangcheck.html
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-snb4/igt@i915_selftest@live@hangcheck.html

  * igt@kms_atomic@atomic_plane_damage:
    - {shard-rkl}:        [SKIP][102] ([i915#4098]) -> [PASS][103] +3 similar issues
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_atomic@atomic_plane_damage.html
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_atomic@atomic_plane_damage.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-0:
    - {shard-rkl}:        [SKIP][104] ([i915#1845]) -> [PASS][105] +8 similar issues
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-0.html
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-0.html

  * igt@kms_ccs@pipe-a-missing-ccs-buffer-y_tiled_gen12_rc_ccs:
    - {shard-rkl}:        [SKIP][106] ([i915#1845] / [i915#4098]) -> [PASS][107]
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_ccs@pipe-a-missing-ccs-buffer-y_tiled_gen12_rc_ccs.html
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_ccs@pipe-a-missing-ccs-buffer-y_tiled_gen12_rc_ccs.html

  * igt@kms_cursor_crc@pipe-a-cursor-128x42-rapid-movement:
    - {shard-rkl}:        [SKIP][108] ([fdo#112022]) -> [PASS][109]
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_cursor_crc@pipe-a-cursor-128x42-rapid-movement.html
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_crc@pipe-a-cursor-128x42-rapid-movement.html

  * igt@kms_cursor_crc@pipe-a-cursor-64x21-rapid-movement:
    - {shard-rkl}:        ([SKIP][110], [SKIP][111]) ([fdo#112022]) -> [PASS][112]
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_cursor_crc@pipe-a-cursor-64x21-rapid-movement.html
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@kms_cursor_crc@pipe-a-cursor-64x21-rapid-movement.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_crc@pipe-a-cursor-64x21-rapid-movement.html

  * igt@kms_cursor_crc@pipe-a-cursor-alpha-transparent:
    - {shard-rkl}:        ([SKIP][113], [SKIP][114]) ([fdo#112022] / [i915#4070]) -> [PASS][115]
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_cursor_crc@pipe-a-cursor-alpha-transparent.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_cursor_crc@pipe-a-cursor-alpha-transparent.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_crc@pipe-a-cursor-alpha-transparent.html

  * igt@kms_cursor_crc@pipe-b-cursor-256x85-onscreen:
    - {shard-rkl}:        [SKIP][116] ([fdo#112022] / [i915#4070]) -> [PASS][117]
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_cursor_crc@pipe-b-cursor-256x85-onscreen.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_crc@pipe-b-cursor-256x85-onscreen.html

  * igt@kms_cursor_edge_walk@pipe-a-128x128-top-edge:
    - {shard-rkl}:        [SKIP][118] ([i915#1849] / [i915#4070]) -> [PASS][119] +2 similar issues
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_cursor_edge_walk@pipe-a-128x128-top-edge.html
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_edge_walk@pipe-a-128x128-top-edge.html

  * igt@kms_cursor_legacy@cursora-vs-flipa-atomic-transitions-varying-size:
    - {shard-rkl}:        ([SKIP][120], [SKIP][121]) ([fdo#111825] / [i915#4070]) -> [PASS][122]
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_cursor_legacy@cursora-vs-flipa-atomic-transitions-varying-size.html
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_cursor_legacy@cursora-vs-flipa-atomic-transitions-varying-size.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_cursor_legacy@cursora-vs-flipa-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@cursora-vs-flipa-varying-size:
    - shard-tglb:         [DMESG-WARN][123] ([i915#1982]) -> [PASS][124]
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-tglb2/igt@kms_cursor_legacy@cursora-vs-flipa-varying-size.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-tglb8/igt@kms_cursor_legacy@cursora-vs-flipa-varying-size.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-iclb:         [FAIL][125] ([i915#2346]) -> [PASS][126]
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-iclb7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-iclb7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-skl:          [FAIL][127] ([i915#2346] / [i915#533]) -> [PASS][128]
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-skl2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@pipe-c-single-bo:
    - {shard-rkl}:        [SKIP][129] ([i915#4070]) -> [PASS][130]
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_cursor_legacy@pipe-c-single-bo.html
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-4/igt@kms_cursor_legacy@pipe-c-single-bo.html

  * igt@kms_cursor_legacy@short-flip-after-cursor-atomic-transitions-varying-size:
    - shard-skl:          [FAIL][131] ([i915#2346]) -> [PASS][132]
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-skl2/igt@kms_cursor_legacy@short-flip-after-cursor-atomic-transitions-varying-size.html
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl10/igt@kms_cursor_legacy@short-flip-after-cursor-atomic-transitions-varying-size.html

  * igt@kms_draw_crc@draw-method-rgb565-mmap-wc-untiled:
    - {shard-rkl}:        [SKIP][133] ([fdo#111314]) -> [PASS][134] +2 similar issues
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-1/igt@kms_draw_crc@draw-method-rgb565-mmap-wc-untiled.html
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_draw_crc@draw-method-rgb565-mmap-wc-untiled.html

  * igt@kms_draw_crc@draw-method-xrgb8888-pwrite-ytiled:
    - {shard-rkl}:        ([SKIP][135], [SKIP][136]) ([fdo#111314] / [i915#4098]) -> [PASS][137]
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_draw_crc@draw-method-xrgb8888-pwrite-ytiled.html
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@kms_draw_crc@draw-method-xrgb8888-pwrite-ytiled.html
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_draw_crc@draw-method-xrgb8888-pwrite-ytiled.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-apl:          [INCOMPLETE][138] ([i915#180]) -> [PASS][139]
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-apl3/igt@kms_fbcon_fbt@fbc-suspend.html
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-apl4/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@plain-flip-fb-recreate@b-edp1:
    - shard-skl:          [FAIL][140] ([i915#2122]) -> [PASS][141] +1 similar issue
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-skl2/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-skl10/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-shrfb-plflip-blt:
    - {shard-rkl}:        ([SKIP][142], [SKIP][143]) ([i915#1849] / [i915#4098]) -> [PASS][144] +2 similar issues
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-shrfb-plflip-blt.html
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-4/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-shrfb-plflip-blt.html
   [144]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-rkl-6/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-shrfb-plflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-suspend:
    - shard-kbl:          [DMESG-WARN][145] ([i915#180]) -> [PASS][146] +3 similar issues
   [145]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-kbl6/igt@kms_frontbuffer_tracking@fbc-suspend.html
   [146]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shard-kbl6/igt@kms_frontbuffer_tracking@fbc-suspend.html

  * igt@kms_frontbuffer_tracking@psr-rgb565-draw-pwrite:
    - {shard-rkl}:        [SKIP][147] ([i915#1849]) -> [PASS][148] +10 similar issues
   [147]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11094/shard-rkl-5/igt@kms_frontbuffer_tracking@psr-rgb565-draw-pwrite.html
   [148]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/shar

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22019/index.html

[-- Attachment #2: Type: text/html, Size: 33150 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
@ 2022-01-19  1:29     ` John Harrison
  -1 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:29 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
> GFP_KERNEL do fully decouple the error capture from fence signalling.
s/do/to/

>
> Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Does this really count as a bug fix over that patch? Isn't it more of a 
changing in policy now that we do DRM fence signalling and that other 
changes related to error capture behaviour have been implemented.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 67f3515f07e7a..aee42eae4729f 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	struct i915_request *rq = NULL;
>   	unsigned long flags;
>   
> -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
This still makes me nervous that we will fail to allocate engine 
captures in stress test scenarios, which are exactly the kind of 
situations where we need valid error captures.

There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that 
need updating as well?

John.

>   	if (!ee)
>   		return NULL;
>   


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
@ 2022-01-19  1:29     ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:29 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
> GFP_KERNEL do fully decouple the error capture from fence signalling.
s/do/to/

>
> Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Does this really count as a bug fix over that patch? Isn't it more of a 
changing in policy now that we do DRM fence signalling and that other 
changes related to error capture behaviour have been implemented.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 67f3515f07e7a..aee42eae4729f 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	struct i915_request *rq = NULL;
>   	unsigned long flags;
>   
> -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
This still makes me nervous that we will fail to allocate engine 
captures in stress test scenarios, which are exactly the kind of 
situations where we need valid error captures.

There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that 
need updating as well?

John.

>   	if (!ee)
>   		return NULL;
>   


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
@ 2022-01-19  1:37     ` John Harrison
  -1 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:37 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> The G2H handler needs to be flushed during a GT reset but a G2H
> indicating engine reset failure can trigger a GT reset. Add a worker to
> trigger the GT when a engine reset failure is received to break this
s/a/an/

> circular dependency.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
>   2 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9d26a86fe557a..60ea8deef5392 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -119,6 +119,11 @@ struct intel_guc {
>   		 * function as it might be in an atomic context (no sleeping)
>   		 */
>   		struct work_struct destroyed_worker;
> +		/**
> +		 * @reset_worker: worker to trigger a GT reset after an engine
> +		 * reset fails
> +		 */
> +		struct work_struct reset_worker;
>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 23a40f10d376d..cdd8d691251ff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   }
>   
>   static void destroyed_worker_func(struct work_struct *w);
> +static void reset_worker_func(struct work_struct *w);
>   
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
> @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
> +	INIT_WORK(&guc->submission_state.reset_worker,
> +		  reset_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
>   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>   	return gt->engine_class[engine_class][instance];
>   }
>   
> +static void reset_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.reset_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	intel_gt_handle_error(gt, ALL_ENGINES,
> +			      I915_ERROR_CAPTURE,
> +			      "GuC failed to reset a engine\n");
s/a/an/

> +}
> +
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
> @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>   		guc_class, instance, engine->name, reason);
>   
> -	intel_gt_handle_error(gt, engine->mask,
> -			      I915_ERROR_CAPTURE,
> -			      "GuC failed to reset %s (reason=0x%08x)\n",
> -			      engine->name, reason);
The engine name and reason code are lost from the error capture? I guess 
we still get it in the drm_err above, though. So probably not an issue. 
We shouldn't be getting these from end users and any internal CI run is 
only likely to give us the dmesg, not the error capture anyway! However, 
still seems like it is work saving engine->mask in the submission_state 
structure (ORing in, in case there are multiple resets). Clearing it 
should be safe because once a GT reset has happened, we aren't getting 
any more G2Hs. And we can't have multiple message handlers running 
concurrently, right? So no need to protect the OR either.

John.


> +	/*
> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> +	 * another worker to trigger a GT reset.
> +	 */
> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
@ 2022-01-19  1:37     ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:37 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> The G2H handler needs to be flushed during a GT reset but a G2H
> indicating engine reset failure can trigger a GT reset. Add a worker to
> trigger the GT when a engine reset failure is received to break this
s/a/an/

> circular dependency.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
>   2 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9d26a86fe557a..60ea8deef5392 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -119,6 +119,11 @@ struct intel_guc {
>   		 * function as it might be in an atomic context (no sleeping)
>   		 */
>   		struct work_struct destroyed_worker;
> +		/**
> +		 * @reset_worker: worker to trigger a GT reset after an engine
> +		 * reset fails
> +		 */
> +		struct work_struct reset_worker;
>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 23a40f10d376d..cdd8d691251ff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   }
>   
>   static void destroyed_worker_func(struct work_struct *w);
> +static void reset_worker_func(struct work_struct *w);
>   
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
> @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
> +	INIT_WORK(&guc->submission_state.reset_worker,
> +		  reset_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
>   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>   	return gt->engine_class[engine_class][instance];
>   }
>   
> +static void reset_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.reset_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	intel_gt_handle_error(gt, ALL_ENGINES,
> +			      I915_ERROR_CAPTURE,
> +			      "GuC failed to reset a engine\n");
s/a/an/

> +}
> +
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
> @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>   		guc_class, instance, engine->name, reason);
>   
> -	intel_gt_handle_error(gt, engine->mask,
> -			      I915_ERROR_CAPTURE,
> -			      "GuC failed to reset %s (reason=0x%08x)\n",
> -			      engine->name, reason);
The engine name and reason code are lost from the error capture? I guess 
we still get it in the drm_err above, though. So probably not an issue. 
We shouldn't be getting these from end users and any internal CI run is 
only likely to give us the dmesg, not the error capture anyway! However, 
still seems like it is work saving engine->mask in the submission_state 
structure (ORing in, in case there are multiple resets). Clearing it 
should be safe because once a GT reset has happened, we aren't getting 
any more G2Hs. And we can't have multiple message handlers running 
concurrently, right? So no need to protect the OR either.

John.


> +	/*
> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> +	 * another worker to trigger a GT reset.
> +	 */
> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
  2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
@ 2022-01-19  1:38     ` John Harrison
  -1 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:38 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> Now that the error capture is fully decoupled from fence signalling
> (request retirement to free memory, which is turn depends on resets) we
s/is/in/

With that fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

John.

> can safely flush the G2H handler during a GT reset. This is eliminates
> corner cases where GuC generated G2H (e.g. engine resets) race with a GT
> reset.
>
> Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
>   1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cdd8d691251ff..1a11e8986948b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> -	int i;
> -
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
> @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   
>   	guc_flush_submissions(guc);
>   	guc_flush_destroyed_contexts(guc);
> -
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 3/3] drm/i915/guc: Flush G2H handler during a GT reset
@ 2022-01-19  1:38     ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19  1:38 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: thomas.hellstrom

On 1/18/2022 13:43, Matthew Brost wrote:
> Now that the error capture is fully decoupled from fence signalling
> (request retirement to free memory, which is turn depends on resets) we
s/is/in/

With that fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

John.

> can safely flush the G2H handler during a GT reset. This is eliminates
> corner cases where GuC generated G2H (e.g. engine resets) race with a GT
> reset.
>
> Signed-off-by: Matthew Brost <mattthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
>   1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cdd8d691251ff..1a11e8986948b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> -	int i;
> -
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
> @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   
>   	guc_flush_submissions(guc);
>   	guc_flush_destroyed_contexts(guc);
> -
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  2022-01-19  1:29     ` [Intel-gfx] " John Harrison
@ 2022-01-19 20:47       ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 20:47 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
> > GFP_KERNEL do fully decouple the error capture from fence signalling.
> s/do/to/
> 

Yep.

> > 
> > Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
> Does this really count as a bug fix over that patch? Isn't it more of a
> changing in policy now that we do DRM fence signalling and that other
> changes related to error capture behaviour have been implemented.
>

That patch was supposed to allow signalling annotations to be added,
without this change I think these annotations would be broken. So I
think the Fixes is correct. 
 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index 67f3515f07e7a..aee42eae4729f 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
> >   	struct i915_request *rq = NULL;
> >   	unsigned long flags;
> > -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> > +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
> This still makes me nervous that we will fail to allocate engine captures in
> stress test scenarios, which are exactly the kind of situations where we
> need valid error captures.
> 

Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas
and Daniel seem to think this is correct. For what it's worth this
allocation is less than a page, so it should be pretty safe to do with
ALLOW_FAIL.

> There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need
> updating as well?
>

Probably just should be deleted. If look it tries with ALLOW_FAIL first,
then falls back to GFP_KERNEL. I didn't want to make that update in this
series yet but that is something to keep an eye on.

Matt
 
> John.
> 
> >   	if (!ee)
> >   		return NULL;
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
@ 2022-01-19 20:47       ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 20:47 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
> > GFP_KERNEL do fully decouple the error capture from fence signalling.
> s/do/to/
> 

Yep.

> > 
> > Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
> Does this really count as a bug fix over that patch? Isn't it more of a
> changing in policy now that we do DRM fence signalling and that other
> changes related to error capture behaviour have been implemented.
>

That patch was supposed to allow signalling annotations to be added,
without this change I think these annotations would be broken. So I
think the Fixes is correct. 
 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index 67f3515f07e7a..aee42eae4729f 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
> >   	struct i915_request *rq = NULL;
> >   	unsigned long flags;
> > -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> > +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
> This still makes me nervous that we will fail to allocate engine captures in
> stress test scenarios, which are exactly the kind of situations where we
> need valid error captures.
> 

Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas
and Daniel seem to think this is correct. For what it's worth this
allocation is less than a page, so it should be pretty safe to do with
ALLOW_FAIL.

> There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need
> updating as well?
>

Probably just should be deleted. If look it tries with ALLOW_FAIL first,
then falls back to GFP_KERNEL. I didn't want to make that update in this
series yet but that is something to keep an eye on.

Matt
 
> John.
> 
> >   	if (!ee)
> >   		return NULL;
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19  1:37     ` [Intel-gfx] " John Harrison
@ 2022-01-19 20:54       ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 20:54 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when a engine reset failure is received to break this
> s/a/an/
> 

Yep.

> > circular dependency.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> >   2 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557a..60ea8deef5392 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,11 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 23a40f10d376d..cdd8d691251ff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > +			      I915_ERROR_CAPTURE,
> > +			      "GuC failed to reset a engine\n");
> s/a/an/
> 

Yep.

> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> The engine name and reason code are lost from the error capture? I guess we
> still get it in the drm_err above, though. So probably not an issue. We
> shouldn't be getting these from end users and any internal CI run is only
> likely to give us the dmesg, not the error capture anyway! However, still

That was my reasoning on the msg too.

> seems like it is work saving engine->mask in the submission_state structure
> (ORing in, in case there are multiple resets). Clearing it should be safe
> because once a GT reset has happened, we aren't getting any more G2Hs. And
> we can't have multiple message handlers running concurrently, right? So no
> need to protect the OR either.
> 

I could do that but the engine->mask is really only used for the error
capture with GuC submission as any i915 based reset with GuC submission
is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
all engine state before doing a GT reset which probably isn't a bad
thing, right?

I can update the commit message explaining this if that helps.

Matt 

> John.
> 
> 
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
@ 2022-01-19 20:54       ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 20:54 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> On 1/18/2022 13:43, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when a engine reset failure is received to break this
> s/a/an/
> 

Yep.

> > circular dependency.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> >   2 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557a..60ea8deef5392 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,11 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 23a40f10d376d..cdd8d691251ff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > +			      I915_ERROR_CAPTURE,
> > +			      "GuC failed to reset a engine\n");
> s/a/an/
> 

Yep.

> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> The engine name and reason code are lost from the error capture? I guess we
> still get it in the drm_err above, though. So probably not an issue. We
> shouldn't be getting these from end users and any internal CI run is only
> likely to give us the dmesg, not the error capture anyway! However, still

That was my reasoning on the msg too.

> seems like it is work saving engine->mask in the submission_state structure
> (ORing in, in case there are multiple resets). Clearing it should be safe
> because once a GT reset has happened, we aren't getting any more G2Hs. And
> we can't have multiple message handlers running concurrently, right? So no
> need to protect the OR either.
> 

I could do that but the engine->mask is really only used for the error
capture with GuC submission as any i915 based reset with GuC submission
is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
all engine state before doing a GT reset which probably isn't a bad
thing, right?

I can update the commit message explaining this if that helps.

Matt 

> John.
> 
> 
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
  2022-01-19 20:47       ` [Intel-gfx] " Matthew Brost
@ 2022-01-19 20:56         ` John Harrison
  -1 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19 20:56 UTC (permalink / raw)
  To: Matthew Brost; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On 1/19/2022 12:47, Matthew Brost wrote:
> On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
>> On 1/18/2022 13:43, Matthew Brost wrote:
>>> Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
>>> GFP_KERNEL do fully decouple the error capture from fence signalling.
>> s/do/to/
>>
> Yep.
>
>>> Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
>> Does this really count as a bug fix over that patch? Isn't it more of a
>> changing in policy now that we do DRM fence signalling and that other
>> changes related to error capture behaviour have been implemented.
>>
> That patch was supposed to allow signalling annotations to be added,
> without this change I think these annotations would be broken. So I
> think the Fixes is correct.
>   
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 67f3515f07e7a..aee42eae4729f 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
>>>    	struct i915_request *rq = NULL;
>>>    	unsigned long flags;
>>> -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
>>> +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
>> This still makes me nervous that we will fail to allocate engine captures in
>> stress test scenarios, which are exactly the kind of situations where we
>> need valid error captures.
>>
> Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas
> and Daniel seem to think this is correct. For what it's worth this
> allocation is less than a page, so it should be pretty safe to do with
> ALLOW_FAIL.
>
>> There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need
>> updating as well?
>>
> Probably just should be deleted. If look it tries with ALLOW_FAIL first,
> then falls back to GFP_KERNEL. I didn't want to make that update in this
> series yet but that is something to keep an eye on.
>
> Matt
>   
Okay. Makes sense. With the description typo fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

>> John.
>>
>>>    	if (!ee)
>>>    		return NULL;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL
@ 2022-01-19 20:56         ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19 20:56 UTC (permalink / raw)
  To: Matthew Brost; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On 1/19/2022 12:47, Matthew Brost wrote:
> On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
>> On 1/18/2022 13:43, Matthew Brost wrote:
>>> Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than
>>> GFP_KERNEL do fully decouple the error capture from fence signalling.
>> s/do/to/
>>
> Yep.
>
>>> Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
>> Does this really count as a bug fix over that patch? Isn't it more of a
>> changing in policy now that we do DRM fence signalling and that other
>> changes related to error capture behaviour have been implemented.
>>
> That patch was supposed to allow signalling annotations to be added,
> without this change I think these annotations would be broken. So I
> think the Fixes is correct.
>   
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 67f3515f07e7a..aee42eae4729f 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine,
>>>    	struct i915_request *rq = NULL;
>>>    	unsigned long flags;
>>> -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
>>> +	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
>> This still makes me nervous that we will fail to allocate engine captures in
>> stress test scenarios, which are exactly the kind of situations where we
>> need valid error captures.
>>
> Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas
> and Daniel seem to think this is correct. For what it's worth this
> allocation is less than a page, so it should be pretty safe to do with
> ALLOW_FAIL.
>
>> There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need
>> updating as well?
>>
> Probably just should be deleted. If look it tries with ALLOW_FAIL first,
> then falls back to GFP_KERNEL. I didn't want to make that update in this
> series yet but that is something to keep an eye on.
>
> Matt
>   
Okay. Makes sense. With the description typo fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

>> John.
>>
>>>    	if (!ee)
>>>    		return NULL;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 21:07         ` [Intel-gfx] " John Harrison
@ 2022-01-19 21:05           ` Matthew Brost
  -1 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 21:05 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Wed, Jan 19, 2022 at 01:07:22PM -0800, John Harrison wrote:
> On 1/19/2022 12:54, Matthew Brost wrote:
> > On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> > > On 1/18/2022 13:43, Matthew Brost wrote:
> > > > The G2H handler needs to be flushed during a GT reset but a G2H
> > > > indicating engine reset failure can trigger a GT reset. Add a worker to
> > > > trigger the GT when a engine reset failure is received to break this
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > circular dependency.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> > > >    2 files changed, 24 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 9d26a86fe557a..60ea8deef5392 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -119,6 +119,11 @@ struct intel_guc {
> > > >    		 * function as it might be in an atomic context (no sleeping)
> > > >    		 */
> > > >    		struct work_struct destroyed_worker;
> > > > +		/**
> > > > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > > > +		 * reset fails
> > > > +		 */
> > > > +		struct work_struct reset_worker;
> > > >    	} submission_state;
> > > >    	/**
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 23a40f10d376d..cdd8d691251ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> > > >    }
> > > >    static void destroyed_worker_func(struct work_struct *w);
> > > > +static void reset_worker_func(struct work_struct *w);
> > > >    /*
> > > >     * Set up the memory resources to be shared with the GuC (via the GGTT)
> > > > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > > >    	INIT_WORK(&guc->submission_state.destroyed_worker,
> > > >    		  destroyed_worker_func);
> > > > +	INIT_WORK(&guc->submission_state.reset_worker,
> > > > +		  reset_worker_func);
> > > >    	guc->submission_state.guc_ids_bitmap =
> > > >    		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > > > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> > > >    	return gt->engine_class[engine_class][instance];
> > > >    }
> > > > +static void reset_worker_func(struct work_struct *w)
> > > > +{
> > > > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > > > +					     submission_state.reset_worker);
> > > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > > +
> > > > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > > > +			      I915_ERROR_CAPTURE,
> > > > +			      "GuC failed to reset a engine\n");
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > +}
> > > > +
> > > >    int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    					 const u32 *msg, u32 len)
> > > >    {
> > > > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> > > >    		guc_class, instance, engine->name, reason);
> > > > -	intel_gt_handle_error(gt, engine->mask,
> > > > -			      I915_ERROR_CAPTURE,
> > > > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > > > -			      engine->name, reason);
> > > The engine name and reason code are lost from the error capture? I guess we
> > > still get it in the drm_err above, though. So probably not an issue. We
> > > shouldn't be getting these from end users and any internal CI run is only
> > > likely to give us the dmesg, not the error capture anyway! However, still
> > That was my reasoning on the msg too.
> > 
> > > seems like it is work saving engine->mask in the submission_state structure
> > > (ORing in, in case there are multiple resets). Clearing it should be safe
> > > because once a GT reset has happened, we aren't getting any more G2Hs. And
> > > we can't have multiple message handlers running concurrently, right? So no
> > > need to protect the OR either.
> > > 
> > I could do that but the engine->mask is really only used for the error
> > capture with GuC submission as any i915 based reset with GuC submission
> > is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
> > all engine state before doing a GT reset which probably isn't a bad
> > thing, right?
> > 
> > I can update the commit message explaining this if that helps.
> Except that a failure to reset is notionally a hardware bug. As recently
> demonstrated, it could be a software bug due to timeouts being broken. But
> officially, it is something that should never happen. So in the rare case
> where one does show up, we would want to know as much as possible about the
> issue. Most especially - which engine it was that failed. And if all we get
> is a customer bug report with an error capture but no dmesg then we will
> have no idea which. It just seems wrong to be throwing away potentially
> important information for no real reason.
> 

Ok, will add a engine->mask that gets OR'd on every engine reset failure
and cleared on every GT reset in the worker. Probably to be really safe
I should protect this field by the submission state lock too.

Matt 

> John.
> 
> 
> > 
> > Matt
> > 
> > > John.
> > > 
> > > 
> > > > +	/*
> > > > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > > > +	 * another worker to trigger a GT reset.
> > > > +	 */
> > > > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> > > >    	return 0;
> > > >    }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
@ 2022-01-19 21:05           ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 21:05 UTC (permalink / raw)
  To: John Harrison; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On Wed, Jan 19, 2022 at 01:07:22PM -0800, John Harrison wrote:
> On 1/19/2022 12:54, Matthew Brost wrote:
> > On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
> > > On 1/18/2022 13:43, Matthew Brost wrote:
> > > > The G2H handler needs to be flushed during a GT reset but a G2H
> > > > indicating engine reset failure can trigger a GT reset. Add a worker to
> > > > trigger the GT when a engine reset failure is received to break this
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > circular dependency.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
> > > >    2 files changed, 24 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 9d26a86fe557a..60ea8deef5392 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -119,6 +119,11 @@ struct intel_guc {
> > > >    		 * function as it might be in an atomic context (no sleeping)
> > > >    		 */
> > > >    		struct work_struct destroyed_worker;
> > > > +		/**
> > > > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > > > +		 * reset fails
> > > > +		 */
> > > > +		struct work_struct reset_worker;
> > > >    	} submission_state;
> > > >    	/**
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 23a40f10d376d..cdd8d691251ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> > > >    }
> > > >    static void destroyed_worker_func(struct work_struct *w);
> > > > +static void reset_worker_func(struct work_struct *w);
> > > >    /*
> > > >     * Set up the memory resources to be shared with the GuC (via the GGTT)
> > > > @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > > >    	INIT_WORK(&guc->submission_state.destroyed_worker,
> > > >    		  destroyed_worker_func);
> > > > +	INIT_WORK(&guc->submission_state.reset_worker,
> > > > +		  reset_worker_func);
> > > >    	guc->submission_state.guc_ids_bitmap =
> > > >    		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > > > @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> > > >    	return gt->engine_class[engine_class][instance];
> > > >    }
> > > > +static void reset_worker_func(struct work_struct *w)
> > > > +{
> > > > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > > > +					     submission_state.reset_worker);
> > > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > > +
> > > > +	intel_gt_handle_error(gt, ALL_ENGINES,
> > > > +			      I915_ERROR_CAPTURE,
> > > > +			      "GuC failed to reset a engine\n");
> > > s/a/an/
> > > 
> > Yep.
> > 
> > > > +}
> > > > +
> > > >    int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    					 const u32 *msg, u32 len)
> > > >    {
> > > > @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> > > >    	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> > > >    		guc_class, instance, engine->name, reason);
> > > > -	intel_gt_handle_error(gt, engine->mask,
> > > > -			      I915_ERROR_CAPTURE,
> > > > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > > > -			      engine->name, reason);
> > > The engine name and reason code are lost from the error capture? I guess we
> > > still get it in the drm_err above, though. So probably not an issue. We
> > > shouldn't be getting these from end users and any internal CI run is only
> > > likely to give us the dmesg, not the error capture anyway! However, still
> > That was my reasoning on the msg too.
> > 
> > > seems like it is work saving engine->mask in the submission_state structure
> > > (ORing in, in case there are multiple resets). Clearing it should be safe
> > > because once a GT reset has happened, we aren't getting any more G2Hs. And
> > > we can't have multiple message handlers running concurrently, right? So no
> > > need to protect the OR either.
> > > 
> > I could do that but the engine->mask is really only used for the error
> > capture with GuC submission as any i915 based reset with GuC submission
> > is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
> > all engine state before doing a GT reset which probably isn't a bad
> > thing, right?
> > 
> > I can update the commit message explaining this if that helps.
> Except that a failure to reset is notionally a hardware bug. As recently
> demonstrated, it could be a software bug due to timeouts being broken. But
> officially, it is something that should never happen. So in the rare case
> where one does show up, we would want to know as much as possible about the
> issue. Most especially - which engine it was that failed. And if all we get
> is a customer bug report with an error capture but no dmesg then we will
> have no idea which. It just seems wrong to be throwing away potentially
> important information for no real reason.
> 

Ok, will add a engine->mask that gets OR'd on every engine reset failure
and cleared on every GT reset in the worker. Probably to be really safe
I should protect this field by the submission state lock too.

Matt 

> John.
> 
> 
> > 
> > Matt
> > 
> > > John.
> > > 
> > > 
> > > > +	/*
> > > > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > > > +	 * another worker to trigger a GT reset.
> > > > +	 */
> > > > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> > > >    	return 0;
> > > >    }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 20:54       ` [Intel-gfx] " Matthew Brost
@ 2022-01-19 21:07         ` John Harrison
  -1 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19 21:07 UTC (permalink / raw)
  To: Matthew Brost; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On 1/19/2022 12:54, Matthew Brost wrote:
> On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
>> On 1/18/2022 13:43, Matthew Brost wrote:
>>> The G2H handler needs to be flushed during a GT reset but a G2H
>>> indicating engine reset failure can trigger a GT reset. Add a worker to
>>> trigger the GT when a engine reset failure is received to break this
>> s/a/an/
>>
> Yep.
>
>>> circular dependency.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
>>>    2 files changed, 24 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 9d26a86fe557a..60ea8deef5392 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -119,6 +119,11 @@ struct intel_guc {
>>>    		 * function as it might be in an atomic context (no sleeping)
>>>    		 */
>>>    		struct work_struct destroyed_worker;
>>> +		/**
>>> +		 * @reset_worker: worker to trigger a GT reset after an engine
>>> +		 * reset fails
>>> +		 */
>>> +		struct work_struct reset_worker;
>>>    	} submission_state;
>>>    	/**
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 23a40f10d376d..cdd8d691251ff 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>>    }
>>>    static void destroyed_worker_func(struct work_struct *w);
>>> +static void reset_worker_func(struct work_struct *w);
>>>    /*
>>>     * Set up the memory resources to be shared with the GuC (via the GGTT)
>>> @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>>>    	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>    		  destroyed_worker_func);
>>> +	INIT_WORK(&guc->submission_state.reset_worker,
>>> +		  reset_worker_func);
>>>    	guc->submission_state.guc_ids_bitmap =
>>>    		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
>>> @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>>>    	return gt->engine_class[engine_class][instance];
>>>    }
>>> +static void reset_worker_func(struct work_struct *w)
>>> +{
>>> +	struct intel_guc *guc = container_of(w, struct intel_guc,
>>> +					     submission_state.reset_worker);
>>> +	struct intel_gt *gt = guc_to_gt(guc);
>>> +
>>> +	intel_gt_handle_error(gt, ALL_ENGINES,
>>> +			      I915_ERROR_CAPTURE,
>>> +			      "GuC failed to reset a engine\n");
>> s/a/an/
>>
> Yep.
>
>>> +}
>>> +
>>>    int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>    					 const u32 *msg, u32 len)
>>>    {
>>> @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>    	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>>>    		guc_class, instance, engine->name, reason);
>>> -	intel_gt_handle_error(gt, engine->mask,
>>> -			      I915_ERROR_CAPTURE,
>>> -			      "GuC failed to reset %s (reason=0x%08x)\n",
>>> -			      engine->name, reason);
>> The engine name and reason code are lost from the error capture? I guess we
>> still get it in the drm_err above, though. So probably not an issue. We
>> shouldn't be getting these from end users and any internal CI run is only
>> likely to give us the dmesg, not the error capture anyway! However, still
> That was my reasoning on the msg too.
>
>> seems like it is work saving engine->mask in the submission_state structure
>> (ORing in, in case there are multiple resets). Clearing it should be safe
>> because once a GT reset has happened, we aren't getting any more G2Hs. And
>> we can't have multiple message handlers running concurrently, right? So no
>> need to protect the OR either.
>>
> I could do that but the engine->mask is really only used for the error
> capture with GuC submission as any i915 based reset with GuC submission
> is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
> all engine state before doing a GT reset which probably isn't a bad
> thing, right?
>
> I can update the commit message explaining this if that helps.
Except that a failure to reset is notionally a hardware bug. As recently 
demonstrated, it could be a software bug due to timeouts being broken. 
But officially, it is something that should never happen. So in the rare 
case where one does show up, we would want to know as much as possible 
about the issue. Most especially - which engine it was that failed. And 
if all we get is a customer bug report with an error capture but no 
dmesg then we will have no idea which. It just seems wrong to be 
throwing away potentially important information for no real reason.

John.


>
> Matt
>
>> John.
>>
>>
>>> +	/*
>>> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
>>> +	 * another worker to trigger a GT reset.
>>> +	 */
>>> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>>>    	return 0;
>>>    }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
@ 2022-01-19 21:07         ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-19 21:07 UTC (permalink / raw)
  To: Matthew Brost; +Cc: thomas.hellstrom, intel-gfx, dri-devel

On 1/19/2022 12:54, Matthew Brost wrote:
> On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
>> On 1/18/2022 13:43, Matthew Brost wrote:
>>> The G2H handler needs to be flushed during a GT reset but a G2H
>>> indicating engine reset failure can trigger a GT reset. Add a worker to
>>> trigger the GT when a engine reset failure is received to break this
>> s/a/an/
>>
> Yep.
>
>>> circular dependency.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  5 ++++
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++----
>>>    2 files changed, 24 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 9d26a86fe557a..60ea8deef5392 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -119,6 +119,11 @@ struct intel_guc {
>>>    		 * function as it might be in an atomic context (no sleeping)
>>>    		 */
>>>    		struct work_struct destroyed_worker;
>>> +		/**
>>> +		 * @reset_worker: worker to trigger a GT reset after an engine
>>> +		 * reset fails
>>> +		 */
>>> +		struct work_struct reset_worker;
>>>    	} submission_state;
>>>    	/**
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 23a40f10d376d..cdd8d691251ff 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>>    }
>>>    static void destroyed_worker_func(struct work_struct *w);
>>> +static void reset_worker_func(struct work_struct *w);
>>>    /*
>>>     * Set up the memory resources to be shared with the GuC (via the GGTT)
>>> @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>>>    	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>    		  destroyed_worker_func);
>>> +	INIT_WORK(&guc->submission_state.reset_worker,
>>> +		  reset_worker_func);
>>>    	guc->submission_state.guc_ids_bitmap =
>>>    		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
>>> @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>>>    	return gt->engine_class[engine_class][instance];
>>>    }
>>> +static void reset_worker_func(struct work_struct *w)
>>> +{
>>> +	struct intel_guc *guc = container_of(w, struct intel_guc,
>>> +					     submission_state.reset_worker);
>>> +	struct intel_gt *gt = guc_to_gt(guc);
>>> +
>>> +	intel_gt_handle_error(gt, ALL_ENGINES,
>>> +			      I915_ERROR_CAPTURE,
>>> +			      "GuC failed to reset a engine\n");
>> s/a/an/
>>
> Yep.
>
>>> +}
>>> +
>>>    int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>    					 const u32 *msg, u32 len)
>>>    {
>>> @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>    	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>>>    		guc_class, instance, engine->name, reason);
>>> -	intel_gt_handle_error(gt, engine->mask,
>>> -			      I915_ERROR_CAPTURE,
>>> -			      "GuC failed to reset %s (reason=0x%08x)\n",
>>> -			      engine->name, reason);
>> The engine name and reason code are lost from the error capture? I guess we
>> still get it in the drm_err above, though. So probably not an issue. We
>> shouldn't be getting these from end users and any internal CI run is only
>> likely to give us the dmesg, not the error capture anyway! However, still
> That was my reasoning on the msg too.
>
>> seems like it is work saving engine->mask in the submission_state structure
>> (ORing in, in case there are multiple resets). Clearing it should be safe
>> because once a GT reset has happened, we aren't getting any more G2Hs. And
>> we can't have multiple message handlers running concurrently, right? So no
>> need to protect the OR either.
>>
> I could do that but the engine->mask is really only used for the error
> capture with GuC submission as any i915 based reset with GuC submission
> is a GT reset. Going from engine->mask to ALL_ENGINES will just capture
> all engine state before doing a GT reset which probably isn't a bad
> thing, right?
>
> I can update the commit message explaining this if that helps.
Except that a failure to reset is notionally a hardware bug. As recently 
demonstrated, it could be a software bug due to timeouts being broken. 
But officially, it is something that should never happen. So in the rare 
case where one does show up, we would want to know as much as possible 
about the issue. Most especially - which engine it was that failed. And 
if all we get is a customer bug report with an error capture but no 
dmesg then we will have no idea which. It just seems wrong to be 
throwing away potentially important information for no real reason.

John.


>
> Matt
>
>> John.
>>
>>
>>> +	/*
>>> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
>>> +	 * another worker to trigger a GT reset.
>>> +	 */
>>> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>>>    	return 0;
>>>    }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-21  4:31 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
@ 2022-01-21 18:53   ` John Harrison
  0 siblings, 0 replies; 33+ messages in thread
From: John Harrison @ 2022-01-21 18:53 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel

On 1/20/2022 20:31, Matthew Brost wrote:
> The G2H handler needs to be flushed during a GT reset but a G2H
> indicating engine reset failure can trigger a GT reset. Add a worker to
> trigger the GT rest when an engine reset failure is received to break
> this circular dependency.
>
> v2:
>   (John Harrison)
>    - Store engine reset mask
>    - Fix typo in commit message
> v3:
>   (John Harrison)
>    - Fix another typo in commit message
>    - s/reset_*/reset_fail_*/
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
>   2 files changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9d26a86fe557a..d59bbf49d1c2b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -119,6 +119,15 @@ struct intel_guc {
>   		 * function as it might be in an atomic context (no sleeping)
>   		 */
>   		struct work_struct destroyed_worker;
> +		/**
> +		 * @reset_fail_worker: worker to trigger a GT reset after an
> +		 * engine reset fails
> +		 */
> +		struct work_struct reset_fail_worker;
> +		/**
> +		 * @reset_fail_mask: mask of engines that failed to reset
> +		 */
> +		intel_engine_mask_t reset_fail_mask;
>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3918f1be114fa..9a3f503d201aa 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   }
>   
>   static void destroyed_worker_func(struct work_struct *w);
> +static void reset_fail_worker_func(struct work_struct *w);
>   
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
> @@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
> +	INIT_WORK(&guc->submission_state.reset_fail_worker,
> +		  reset_fail_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
>   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> @@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>   	return gt->engine_class[engine_class][instance];
>   }
>   
> +static void reset_fail_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.reset_fail_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_engine_mask_t reset_fail_mask;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	reset_fail_mask = guc->submission_state.reset_fail_mask;
> +	guc->submission_state.reset_fail_mask = 0;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	if (likely(reset_fail_mask))
> +		intel_gt_handle_error(gt, reset_fail_mask,
> +				      I915_ERROR_CAPTURE,
> +				      "GuC failed to reset engine mask=0x%x\n",
> +				      reset_fail_mask);
> +}
> +
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
> @@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	u8 guc_class, instance;
>   	u32 reason;
> +	unsigned long flags;
>   
>   	if (unlikely(len != 3)) {
>   		drm_err(&gt->i915->drm, "Invalid length %u", len);
> @@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>   		guc_class, instance, engine->name, reason);
>   
> -	intel_gt_handle_error(gt, engine->mask,
> -			      I915_ERROR_CAPTURE,
> -			      "GuC failed to reset %s (reason=0x%08x)\n",
> -			      engine->name, reason);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	guc->submission_state.reset_fail_mask |= engine->mask;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	/*
> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> +	 * another worker to trigger a GT reset.
> +	 */
> +	queue_work(system_unbound_wq, &guc->submission_state.reset_fail_worker);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-21  4:31 [PATCH 0/3] Flush G2H handler during " Matthew Brost
@ 2022-01-21  4:31 ` Matthew Brost
  2022-01-21 18:53   ` John Harrison
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Brost @ 2022-01-21  4:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison

The G2H handler needs to be flushed during a GT reset but a G2H
indicating engine reset failure can trigger a GT reset. Add a worker to
trigger the GT rest when an engine reset failure is received to break
this circular dependency.

v2:
 (John Harrison)
  - Store engine reset mask
  - Fix typo in commit message
v3:
 (John Harrison)
  - Fix another typo in commit message
  - s/reset_*/reset_fail_*/

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9d26a86fe557a..d59bbf49d1c2b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -119,6 +119,15 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @reset_fail_worker: worker to trigger a GT reset after an
+		 * engine reset fails
+		 */
+		struct work_struct reset_fail_worker;
+		/**
+		 * @reset_fail_mask: mask of engines that failed to reset
+		 */
+		intel_engine_mask_t reset_fail_mask;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3918f1be114fa..9a3f503d201aa 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 }
 
 static void destroyed_worker_func(struct work_struct *w);
+static void reset_fail_worker_func(struct work_struct *w);
 
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_WORK(&guc->submission_state.reset_fail_worker,
+		  reset_fail_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 	return gt->engine_class[engine_class][instance];
 }
 
+static void reset_fail_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.reset_fail_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_engine_mask_t reset_fail_mask;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	reset_fail_mask = guc->submission_state.reset_fail_mask;
+	guc->submission_state.reset_fail_mask = 0;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	if (likely(reset_fail_mask))
+		intel_gt_handle_error(gt, reset_fail_mask,
+				      I915_ERROR_CAPTURE,
+				      "GuC failed to reset engine mask=0x%x\n",
+				      reset_fail_mask);
+}
+
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
@@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	struct intel_gt *gt = guc_to_gt(guc);
 	u8 guc_class, instance;
 	u32 reason;
+	unsigned long flags;
 
 	if (unlikely(len != 3)) {
 		drm_err(&gt->i915->drm, "Invalid length %u", len);
@@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
 		guc_class, instance, engine->name, reason);
 
-	intel_gt_handle_error(gt, engine->mask,
-			      I915_ERROR_CAPTURE,
-			      "GuC failed to reset %s (reason=0x%08x)\n",
-			      engine->name, reason);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	guc->submission_state.reset_fail_mask |= engine->mask;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * A GT reset flushes this worker queue (G2H handler) so we must use
+	 * another worker to trigger a GT reset.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.reset_fail_worker);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-21  1:34   ` John Harrison
@ 2022-01-21  4:04     ` Matthew Brost
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Brost @ 2022-01-21  4:04 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel

On Thu, Jan 20, 2022 at 05:34:54PM -0800, John Harrison wrote:
> On 1/19/2022 13:24, Matthew Brost wrote:
> > The G2H handler needs to be flushed during a GT reset but a G2H
> > indicating engine reset failure can trigger a GT reset. Add a worker to
> > trigger the GT when an engine reset failure is received to break this
> trigger the GT reset?
> 

Yes.

> > circular dependency.
> > 
> > v2:
> >   (John Harrison)
> >    - Store engine reset mask
> >    - Fix typo in commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
> >   2 files changed, 42 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9d26a86fe557..c4a9fc7dd246 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -119,6 +119,15 @@ struct intel_guc {
> >   		 * function as it might be in an atomic context (no sleeping)
> >   		 */
> >   		struct work_struct destroyed_worker;
> > +		/**
> > +		 * @reset_worker: worker to trigger a GT reset after an engine
> > +		 * reset fails
> > +		 */
> > +		struct work_struct reset_worker;
> > +		/**
> > +		 * @reset_mask: mask of engines that failed to reset
> > +		 */
> > +		intel_engine_mask_t reset_mask;
> reset_fail_mask might be a less ambiguous name? Same for the worker struct
> and function.
> 

How about:

struct {
	worker;
	mask;
} engine_reset_fail;

Matt

> John.
> 
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3918f1be114f..514b3060b141 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   }
> >   static void destroyed_worker_func(struct work_struct *w);
> > +static void reset_worker_func(struct work_struct *w);
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > @@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	INIT_WORK(&guc->submission_state.reset_worker,
> > +		  reset_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> >   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > @@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >   	return gt->engine_class[engine_class][instance];
> >   }
> > +static void reset_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.reset_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	intel_engine_mask_t reset_mask;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	reset_mask = guc->submission_state.reset_mask;
> > +	guc->submission_state.reset_mask = 0;
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +
> > +	if (likely(reset_mask))
> > +		intel_gt_handle_error(gt, reset_mask,
> > +				      I915_ERROR_CAPTURE,
> > +				      "GuC failed to reset engine mask=0x%x\n",
> > +				      reset_mask);
> > +}
> > +
> >   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   					 const u32 *msg, u32 len)
> >   {
> > @@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	struct intel_gt *gt = guc_to_gt(guc);
> >   	u8 guc_class, instance;
> >   	u32 reason;
> > +	unsigned long flags;
> >   	if (unlikely(len != 3)) {
> >   		drm_err(&gt->i915->drm, "Invalid length %u", len);
> > @@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
> >   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
> >   		guc_class, instance, engine->name, reason);
> > -	intel_gt_handle_error(gt, engine->mask,
> > -			      I915_ERROR_CAPTURE,
> > -			      "GuC failed to reset %s (reason=0x%08x)\n",
> > -			      engine->name, reason);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	guc->submission_state.reset_mask |= engine->mask;
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +
> > +	/*
> > +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> > +	 * another worker to trigger a GT reset.
> > +	 */
> > +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
> >   	return 0;
> >   }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
@ 2022-01-21  1:34   ` John Harrison
  2022-01-21  4:04     ` Matthew Brost
  0 siblings, 1 reply; 33+ messages in thread
From: John Harrison @ 2022-01-21  1:34 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-gfx

On 1/19/2022 13:24, Matthew Brost wrote:
> The G2H handler needs to be flushed during a GT reset but a G2H
> indicating engine reset failure can trigger a GT reset. Add a worker to
> trigger the GT when an engine reset failure is received to break this
trigger the GT reset?

> circular dependency.
>
> v2:
>   (John Harrison)
>    - Store engine reset mask
>    - Fix typo in commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
>   2 files changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9d26a86fe557..c4a9fc7dd246 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -119,6 +119,15 @@ struct intel_guc {
>   		 * function as it might be in an atomic context (no sleeping)
>   		 */
>   		struct work_struct destroyed_worker;
> +		/**
> +		 * @reset_worker: worker to trigger a GT reset after an engine
> +		 * reset fails
> +		 */
> +		struct work_struct reset_worker;
> +		/**
> +		 * @reset_mask: mask of engines that failed to reset
> +		 */
> +		intel_engine_mask_t reset_mask;
reset_fail_mask might be a less ambiguous name? Same for the worker 
struct and function.

John.

>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3918f1be114f..514b3060b141 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   }
>   
>   static void destroyed_worker_func(struct work_struct *w);
> +static void reset_worker_func(struct work_struct *w);
>   
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
> @@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
> +	INIT_WORK(&guc->submission_state.reset_worker,
> +		  reset_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
>   		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> @@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>   	return gt->engine_class[engine_class][instance];
>   }
>   
> +static void reset_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.reset_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_engine_mask_t reset_mask;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	reset_mask = guc->submission_state.reset_mask;
> +	guc->submission_state.reset_mask = 0;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	if (likely(reset_mask))
> +		intel_gt_handle_error(gt, reset_mask,
> +				      I915_ERROR_CAPTURE,
> +				      "GuC failed to reset engine mask=0x%x\n",
> +				      reset_mask);
> +}
> +
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
> @@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	u8 guc_class, instance;
>   	u32 reason;
> +	unsigned long flags;
>   
>   	if (unlikely(len != 3)) {
>   		drm_err(&gt->i915->drm, "Invalid length %u", len);
> @@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
>   		guc_class, instance, engine->name, reason);
>   
> -	intel_gt_handle_error(gt, engine->mask,
> -			      I915_ERROR_CAPTURE,
> -			      "GuC failed to reset %s (reason=0x%08x)\n",
> -			      engine->name, reason);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	guc->submission_state.reset_mask |= engine->mask;
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +
> +	/*
> +	 * A GT reset flushes this worker queue (G2H handler) so we must use
> +	 * another worker to trigger a GT reset.
> +	 */
> +	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset
  2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
@ 2022-01-19 21:24 ` Matthew Brost
  2022-01-21  1:34   ` John Harrison
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Brost @ 2022-01-19 21:24 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: john.c.harrison

The G2H handler needs to be flushed during a GT reset but a G2H
indicating engine reset failure can trigger a GT reset. Add a worker to
trigger the GT when an engine reset failure is received to break this
circular dependency.

v2:
 (John Harrison)
  - Store engine reset mask
  - Fix typo in commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  9 +++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 37 +++++++++++++++++--
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9d26a86fe557..c4a9fc7dd246 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -119,6 +119,15 @@ struct intel_guc {
 		 * function as it might be in an atomic context (no sleeping)
 		 */
 		struct work_struct destroyed_worker;
+		/**
+		 * @reset_worker: worker to trigger a GT reset after an engine
+		 * reset fails
+		 */
+		struct work_struct reset_worker;
+		/**
+		 * @reset_mask: mask of engines that failed to reset
+		 */
+		intel_engine_mask_t reset_mask;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3918f1be114f..514b3060b141 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1731,6 +1731,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 }
 
 static void destroyed_worker_func(struct work_struct *w);
+static void reset_worker_func(struct work_struct *w);
 
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1761,6 +1762,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
+	INIT_WORK(&guc->submission_state.reset_worker,
+		  reset_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
 		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4026,6 +4029,26 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 	return gt->engine_class[engine_class][instance];
 }
 
+static void reset_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.reset_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_engine_mask_t reset_mask;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	reset_mask = guc->submission_state.reset_mask;
+	guc->submission_state.reset_mask = 0;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	if (likely(reset_mask))
+		intel_gt_handle_error(gt, reset_mask,
+				      I915_ERROR_CAPTURE,
+				      "GuC failed to reset engine mask=0x%x\n",
+				      reset_mask);
+}
+
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
@@ -4033,6 +4056,7 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	struct intel_gt *gt = guc_to_gt(guc);
 	u8 guc_class, instance;
 	u32 reason;
+	unsigned long flags;
 
 	if (unlikely(len != 3)) {
 		drm_err(&gt->i915->drm, "Invalid length %u", len);
@@ -4057,10 +4081,15 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X",
 		guc_class, instance, engine->name, reason);
 
-	intel_gt_handle_error(gt, engine->mask,
-			      I915_ERROR_CAPTURE,
-			      "GuC failed to reset %s (reason=0x%08x)\n",
-			      engine->name, reason);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	guc->submission_state.reset_mask |= engine->mask;
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * A GT reset flushes this worker queue (G2H handler) so we must use
+	 * another worker to trigger a GT reset.
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-01-21 18:54 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 21:43 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-18 21:43 ` [Intel-gfx] " Matthew Brost
2022-01-18 21:43 ` [PATCH 1/3] drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:29   ` John Harrison
2022-01-19  1:29     ` [Intel-gfx] " John Harrison
2022-01-19 20:47     ` Matthew Brost
2022-01-19 20:47       ` [Intel-gfx] " Matthew Brost
2022-01-19 20:56       ` John Harrison
2022-01-19 20:56         ` [Intel-gfx] " John Harrison
2022-01-18 21:43 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger a GT reset Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:37   ` John Harrison
2022-01-19  1:37     ` [Intel-gfx] " John Harrison
2022-01-19 20:54     ` Matthew Brost
2022-01-19 20:54       ` [Intel-gfx] " Matthew Brost
2022-01-19 21:07       ` John Harrison
2022-01-19 21:07         ` [Intel-gfx] " John Harrison
2022-01-19 21:05         ` Matthew Brost
2022-01-19 21:05           ` [Intel-gfx] " Matthew Brost
2022-01-18 21:43 ` [PATCH 3/3] drm/i915/guc: Flush G2H handler during " Matthew Brost
2022-01-18 21:43   ` [Intel-gfx] " Matthew Brost
2022-01-19  1:38   ` John Harrison
2022-01-19  1:38     ` [Intel-gfx] " John Harrison
2022-01-18 22:01 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Flush G2H handler during a GT reset (rev2) Patchwork
2022-01-18 22:02 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-01-18 22:32 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-01-19  1:02 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
2022-01-19 21:24 [PATCH 0/3] Flush G2H handler during a GT reset Matthew Brost
2022-01-19 21:24 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21  1:34   ` John Harrison
2022-01-21  4:04     ` Matthew Brost
2022-01-21  4:31 [PATCH 0/3] Flush G2H handler during " Matthew Brost
2022-01-21  4:31 ` [PATCH 2/3] drm/i915/guc: Add work queue to trigger " Matthew Brost
2022-01-21 18:53   ` John Harrison

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.