[PATCH 0/6] Random assortment of (mostly) GuC related patches

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Random assortment of (mostly) GuC related patches
@ 2022-07-28  2:42 John.C.Harrison
  2022-07-28  2:42 ` [PATCH 1/6] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Pushing a bunch of patches which had gotten forgotten about.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (2):
  drm/i915/selftest: Cope with not having an RCS engine
  drm/i915/guc: Don't abort on CTB_UNUSED status

Matthew Brost (2):
  drm/i915/guc: Fix issues with live_preempt_cancel
  drm/i915/guc: Support larger contexts on newer hardware

Michał Winiarski (1):
  drm/i915/guc: Route semaphores to GuC for Gen12+

Rahul Kumar Singh (1):
  drm/i915/guc: Add selftest for a hung GuC

 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  16 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  12 +-
 .../gt/uc/abi/guc_communication_ctb_abi.h     |   8 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  10 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  18 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h    |   4 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  15 ++
 .../drm/i915/gt/uc/selftest_guc_hangcheck.c   | 159 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 9 files changed, 227 insertions(+), 16 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c

-- 
2.37.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] drm/i915/guc: Route semaphores to GuC for Gen12+
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28  2:42 ` [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Matthew Brost, Daniele Ceraolo Spurio, Michał Winiarski,
	John Harrison, DRI-Devel

From: Michał Winiarski <michal.winiarski@intel.com>

In GuC submission mode, there is an option to use auto-switch out
semaphores and have GuC auto-switch in a waiting context. This
requires routing the semaphore interrupt to GuC.

Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h        |  4 ++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 ++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h
index 8dc063f087eb1..a7092f711e9cd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_reg.h
@@ -102,6 +102,10 @@
 #define   GUC_SEND_TRIGGER		  (1<<0)
 #define GEN11_GUC_HOST_INTERRUPT	_MMIO(0x1901f0)
 
+#define GEN12_GUC_SEM_INTR_ENABLES	_MMIO(0xc71c)
+#define   GUC_SEM_INTR_ROUTE_TO_GUC	BIT(31)
+#define   GUC_SEM_INTR_ENABLE_ALL	(0xff)
+
 #define GUC_NUM_DOORBELLS		256
 
 /* format of the HW-monitored doorbell cacheline */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 76916aed897ad..0b8c6450fa344 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4191,13 +4191,27 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 
 void intel_guc_submission_enable(struct intel_guc *guc)
 {
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	/* Enable and route to GuC */
+	if (GRAPHICS_VER(gt->i915) >= 12)
+		intel_uncore_write(gt->uncore, GEN12_GUC_SEM_INTR_ENABLES,
+				   GUC_SEM_INTR_ROUTE_TO_GUC |
+				   GUC_SEM_INTR_ENABLE_ALL);
+
 	guc_init_lrc_mapping(guc);
 	guc_init_engine_stats(guc);
 }
 
 void intel_guc_submission_disable(struct intel_guc *guc)
 {
+	struct intel_gt *gt = guc_to_gt(guc);
+
 	/* Note: By the time we're here, GuC may have already been reset */
+
+	/* Disable and route to host */
+	if (GRAPHICS_VER(gt->i915) >= 12)
+		intel_uncore_write(gt->uncore, GEN12_GUC_SEM_INTR_ENABLES, 0x0);
 }
 
 static bool __guc_submission_supported(struct intel_guc *guc)
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
  2022-07-28  2:42 ` [PATCH 1/6] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28  2:45   ` John Harrison
  2022-07-28  2:42 ` [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Matthew Brost, John Harrison, DRI-Devel

From: Matthew Brost <matthew.brost@intel.com>

Having semaphores results in different behavior when a dependent request
is cancelled. In the case of semaphores the request could be on the HW
and complete successfully while without the request is held in the
driver and the error from the dependent request is propagated. Fix
live_preempt_cancel to take this behavior into account.

Also update live_preempt_cancel to use new function intel_context_ban
rather than intel_context_set_banned.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_execlists.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index 02fc97a0ab502..015f8cd3463e2 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -2087,7 +2087,7 @@ static int __cancel_active0(struct live_preempt_cancel *arg)
 		goto out;
 	}
 
-	intel_context_set_banned(rq->context);
+	intel_context_ban(rq->context, rq);
 	err = intel_engine_pulse(arg->engine);
 	if (err)
 		goto out;
@@ -2146,7 +2146,7 @@ static int __cancel_active1(struct live_preempt_cancel *arg)
 	if (err)
 		goto out;
 
-	intel_context_set_banned(rq[1]->context);
+	intel_context_ban(rq[1]->context, rq[1]);
 	err = intel_engine_pulse(arg->engine);
 	if (err)
 		goto out;
@@ -2229,7 +2229,7 @@ static int __cancel_queued(struct live_preempt_cancel *arg)
 	if (err)
 		goto out;
 
-	intel_context_set_banned(rq[2]->context);
+	intel_context_ban(rq[2]->context, rq[2]);
 	err = intel_engine_pulse(arg->engine);
 	if (err)
 		goto out;
@@ -2244,7 +2244,13 @@ static int __cancel_queued(struct live_preempt_cancel *arg)
 		goto out;
 	}
 
-	if (rq[1]->fence.error != 0) {
+	/*
+	 * The behavior between having semaphores and not is different. With
+	 * semaphores the subsequent request is on the hardware and not cancelled
+	 * while without the request is held in the driver and cancelled.
+	 */
+	if (intel_engine_has_semaphores(rq[1]->engine) &&
+	    rq[1]->fence.error != 0) {
 		pr_err("Normal inflight1 request did not complete\n");
 		err = -EINVAL;
 		goto out;
@@ -2292,7 +2298,7 @@ static int __cancel_hostile(struct live_preempt_cancel *arg)
 		goto out;
 	}
 
-	intel_context_set_banned(rq->context);
+	intel_context_ban(rq->context, rq);
 	err = intel_engine_pulse(arg->engine); /* force reset */
 	if (err)
 		goto out;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
  2022-07-28  2:42 ` [PATCH 1/6] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
  2022-07-28  2:42 ` [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28 18:21   ` John Harrison
  2022-07-28  2:42 ` [PATCH 4/6] drm/i915/selftest: Cope with not having an RCS engine John.C.Harrison
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Rahul Kumar Singh

From: Rahul Kumar Singh <rahul.kumar.singh@intel.com>

Add a test to check that the hangcheck will recover from a submission
hang in the GuC.

Signed-off-by: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_hangcheck.c   | 159 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 161 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0b8c6450fa344..ff205c4125857 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -5177,4 +5177,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
 #include "selftest_guc_multi_lrc.c"
+#include "selftest_guc_hangcheck.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
new file mode 100644
index 0000000000000..af913c4b09d37
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+#define BEAT_INTERVAL	100
+
+static struct i915_request *nop_request(struct intel_engine_cs *engine)
+{
+	struct i915_request *rq;
+
+	rq = intel_engine_create_kernel_request(engine);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_hang_guc(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	struct i915_gem_context *ctx;
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	intel_wakeref_t wakeref;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	struct intel_engine_cs *engine;
+	unsigned int reset_count;
+	u32 guc_status;
+	u32 old_beat;
+
+	ctx = kernel_context(gt->i915, NULL);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed get kernel context: %ld\n", PTR_ERR(ctx));
+		return PTR_ERR(ctx);
+	}
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	ce = intel_context_create(gt->engine[BCS0]);
+	if (IS_ERR(ce)) {
+		ret = PTR_ERR(ce);
+		pr_err("Failed to create spinner request: %d\n", ret);
+		goto err;
+	}
+
+	engine = ce->engine;
+	reset_count = i915_reset_count(global);
+
+	old_beat = engine->props.heartbeat_interval_ms;
+	ret = intel_engine_set_heartbeat(engine, BEAT_INTERVAL);
+	if (ret) {
+		pr_err("Failed to boost heatbeat interval: %d\n", ret);
+		goto err;
+	}
+
+	ret = igt_spinner_init(&spin, engine->gt);
+	if (ret) {
+		pr_err("Failed to create spinner: %d\n", ret);
+		goto err;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	intel_context_put(ce);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed to create spinner request: %d\n", ret);
+		goto err_spin;
+	}
+
+	ret = request_add_spin(rq, &spin);
+	if (ret) {
+		i915_request_put(rq);
+		pr_err("Failed to add Spinner request: %d\n", ret);
+		goto err_spin;
+	}
+
+	ret = intel_reset_guc(gt);
+	if (ret) {
+		i915_request_put(rq);
+		pr_err("Failed to reset GuC, ret = %d\n", ret);
+		goto err_spin;
+	}
+
+	guc_status = intel_uncore_read(gt->uncore, GUC_STATUS);
+	if (!(guc_status & GS_MIA_IN_RESET)) {
+		i915_request_put(rq);
+		pr_err("GuC failed to reset: status = 0x%08X\n", guc_status);
+		ret = -EIO;
+		goto err_spin;
+	}
+
+	/* Wait for the heartbeat to cause a reset */
+	ret = intel_selftest_wait_for_rq(rq);
+	i915_request_put(rq);
+	if (ret) {
+		pr_err("Request failed to complete: %d\n", ret);
+		goto err_spin;
+	}
+
+	if (i915_reset_count(global) == reset_count) {
+		pr_err("Failed to record a GPU reset\n");
+		ret = -EINVAL;
+		goto err_spin;
+	}
+
+err_spin:
+	igt_spinner_end(&spin);
+	igt_spinner_fini(&spin);
+	intel_engine_set_heartbeat(engine, old_beat);
+
+	if (ret == 0) {
+		rq = nop_request(engine);
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			goto err;
+		}
+
+		ret = intel_selftest_wait_for_rq(rq);
+		i915_request_put(rq);
+		if (ret) {
+			pr_err("No-op failed to complete: %d\n", ret);
+			goto err;
+		}
+	}
+
+err:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	kernel_context_close(ctx);
+
+	return ret;
+}
+
+int intel_guc_hang_check(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_hang_guc),
+	};
+	struct intel_gt *gt = to_gt(i915);
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index bdd290f2bf3cd..aaf8a380e5c78 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -49,5 +49,6 @@ selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
 selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
+selftest(guc_hang, intel_guc_hang_check)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] drm/i915/selftest: Cope with not having an RCS engine
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
                   ` (2 preceding siblings ...)
  2022-07-28  2:42 ` [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28  2:42 ` [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
  2022-07-28  2:42 ` [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
  5 siblings, 0 replies; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Matthew Brost, John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is no longer guaranteed that there will always be an RCS engine.
So, use the helper function for finding the first available engine that
can be used for general purpose selftets.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 6493265d5f642..7f3bb1d34dfbf 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -1302,13 +1302,15 @@ static int igt_reset_wait(void *arg)
 {
 	struct intel_gt *gt = arg;
 	struct i915_gpu_error *global = &gt->i915->gpu_error;
-	struct intel_engine_cs *engine = gt->engine[RCS0];
+	struct intel_engine_cs *engine;
 	struct i915_request *rq;
 	unsigned int reset_count;
 	struct hang h;
 	long timeout;
 	int err;
 
+	engine = intel_selftest_find_any_engine(gt);
+
 	if (!engine || !intel_engine_can_store_dword(engine))
 		return 0;
 
@@ -1432,7 +1434,7 @@ static int __igt_reset_evict_vma(struct intel_gt *gt,
 				 int (*fn)(void *),
 				 unsigned int flags)
 {
-	struct intel_engine_cs *engine = gt->engine[RCS0];
+	struct intel_engine_cs *engine;
 	struct drm_i915_gem_object *obj;
 	struct task_struct *tsk = NULL;
 	struct i915_request *rq;
@@ -1444,6 +1446,8 @@ static int __igt_reset_evict_vma(struct intel_gt *gt,
 	if (!gt->ggtt->num_fences && flags & EXEC_OBJECT_NEEDS_FENCE)
 		return 0;
 
+	engine = intel_selftest_find_any_engine(gt);
+
 	if (!engine || !intel_engine_can_store_dword(engine))
 		return 0;
 
@@ -1819,12 +1823,14 @@ static int igt_handle_error(void *arg)
 {
 	struct intel_gt *gt = arg;
 	struct i915_gpu_error *global = &gt->i915->gpu_error;
-	struct intel_engine_cs *engine = gt->engine[RCS0];
+	struct intel_engine_cs *engine;
 	struct hang h;
 	struct i915_request *rq;
 	struct i915_gpu_coredump *error;
 	int err;
 
+	engine = intel_selftest_find_any_engine(gt);
+
 	/* Check that we can issue a global GPU and engine reset */
 
 	if (!intel_has_reset_engine(gt))
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
                   ` (3 preceding siblings ...)
  2022-07-28  2:42 ` [PATCH 4/6] drm/i915/selftest: Cope with not having an RCS engine John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28  2:46   ` John Harrison
  2022-07-28  2:42 ` [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
  5 siblings, 1 reply; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Matthew Brost, John Harrison, DRI-Devel

From: Matthew Brost <matthew.brost@intel.com>

The GuC needs a copy of a golden context for implementing watchdog
resets (aka media resets). This context is larger on newer platforms.
So adjust the size being allocated/copied accordingly.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index ba7541f3ca610..74cbe8eaf5318 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -464,7 +464,11 @@ static void fill_engine_enable_masks(struct intel_gt *gt,
 }
 
 #define LR_HW_CONTEXT_SIZE (80 * sizeof(u32))
-#define LRC_SKIP_SIZE (LRC_PPHWSP_SZ * PAGE_SIZE + LR_HW_CONTEXT_SIZE)
+#define XEHP_LR_HW_CONTEXT_SIZE (96 * sizeof(u32))
+#define LR_HW_CONTEXT_SZ(i915) (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50) ? \
+				    XEHP_LR_HW_CONTEXT_SIZE : \
+				    LR_HW_CONTEXT_SIZE)
+#define LRC_SKIP_SIZE(i915) (LRC_PPHWSP_SZ * PAGE_SIZE + LR_HW_CONTEXT_SZ(i915))
 static int guc_prep_golden_context(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
@@ -525,7 +529,7 @@ static int guc_prep_golden_context(struct intel_guc *guc)
 		 * on all engines).
 		 */
 		ads_blob_write(guc, ads.eng_state_size[guc_class],
-			       real_size - LRC_SKIP_SIZE);
+			       real_size - LRC_SKIP_SIZE(gt->i915));
 		ads_blob_write(guc, ads.golden_context_lrca[guc_class],
 			       addr_ggtt);
 
@@ -599,7 +603,7 @@ static void guc_init_golden_context(struct intel_guc *guc)
 		}
 
 		GEM_BUG_ON(ads_blob_read(guc, ads.eng_state_size[guc_class]) !=
-			   real_size - LRC_SKIP_SIZE);
+			   real_size - LRC_SKIP_SIZE(gt->i915));
 		GEM_BUG_ON(ads_blob_read(guc, ads.golden_context_lrca[guc_class]) != addr_ggtt);
 
 		addr_ggtt += alloc_size;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status
  2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
                   ` (4 preceding siblings ...)
  2022-07-28  2:42 ` [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
@ 2022-07-28  2:42 ` John.C.Harrison
  2022-07-28 19:06   ` [Intel-gfx] " Michal Wajdeczko
  2022-07-29  0:00   ` Ceraolo Spurio, Daniele
  5 siblings, 2 replies; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28  2:42 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

When the KMD sends a CLIENT_RESET request to GuC (as part of the
suspend sequence), GuC will mark the CTB buffer as 'UNUSED'. If the
KMD then checked the CTB queue, it would see a non-zero status value
and report the buffer as corrupted.

Technically, no G2H messages should be received once the CLIENT_RESET
has been sent. However, if a context was outstanding on an engine then
it would get reset and a reset notification would be sent. So, don't
actually treat UNUSED as a catastrophic error. Just flag it up as
unexpected and keep going.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../i915/gt/uc/abi/guc_communication_ctb_abi.h |  8 +++++---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c      | 18 ++++++++++++++++--
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
index df83c1cc7c7a6..28b8387f97b77 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
@@ -37,6 +37,7 @@
  *  |   |       |   - _`GUC_CTB_STATUS_OVERFLOW` = 1 (head/tail too large)     |
  *  |   |       |   - _`GUC_CTB_STATUS_UNDERFLOW` = 2 (truncated message)      |
  *  |   |       |   - _`GUC_CTB_STATUS_MISMATCH` = 4 (head/tail modified)      |
+ *  |   |       |   - _`GUC_CTB_STATUS_UNUSED` = 8 (CTB is not in use)         |
  *  +---+-------+--------------------------------------------------------------+
  *  |...|       | RESERVED = MBZ                                               |
  *  +---+-------+--------------------------------------------------------------+
@@ -49,9 +50,10 @@ struct guc_ct_buffer_desc {
 	u32 tail;
 	u32 status;
 #define GUC_CTB_STATUS_NO_ERROR				0
-#define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
-#define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
-#define GUC_CTB_STATUS_MISMATCH				(1 << 2)
+#define GUC_CTB_STATUS_OVERFLOW				BIT(0)
+#define GUC_CTB_STATUS_UNDERFLOW			BIT(1)
+#define GUC_CTB_STATUS_MISMATCH				BIT(2)
+#define GUC_CTB_STATUS_UNUSED				BIT(3)
 	u32 reserved[13];
 } __packed;
 static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index f01325cd1b625..11b5d4ddb19ce 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -816,8 +816,22 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
 	if (unlikely(ctb->broken))
 		return -EPIPE;
 
-	if (unlikely(desc->status))
-		goto corrupted;
+	if (unlikely(desc->status)) {
+		u32 status = desc->status;
+
+		if (status & GUC_CTB_STATUS_UNUSED) {
+			/*
+			 * Potentially valid if a CLIENT_RESET request resulted in
+			 * contexts/engines being reset. But should never happen as
+			 * no contexts should be active when CLIENT_RESET is sent.
+			 */
+			CT_ERROR(ct, "Unexpected G2H after GuC has stopped!\n");
+			status &= ~GUC_CTB_STATUS_UNUSED;
+		}
+
+		if (status)
+			goto corrupted;
+	}
 
 	GEM_BUG_ON(head > size);
 
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel
  2022-07-28  2:42 ` [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
@ 2022-07-28  2:45   ` John Harrison
  0 siblings, 0 replies; 16+ messages in thread
From: John Harrison @ 2022-07-28  2:45 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Matthew Brost, DRI-Devel

On 7/27/2022 19:42, John.C.Harrison@Intel.com wrote:
> From: Matthew Brost <matthew.brost@intel.com>
>
> Having semaphores results in different behavior when a dependent request
> is cancelled. In the case of semaphores the request could be on the HW
> and complete successfully while without the request is held in the
> driver and the error from the dependent request is propagated. Fix
> live_preempt_cancel to take this behavior into account.
>
> Also update live_preempt_cancel to use new function intel_context_ban
> rather than intel_context_set_banned.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/selftest_execlists.c | 16 +++++++++++-----
>   1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index 02fc97a0ab502..015f8cd3463e2 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -2087,7 +2087,7 @@ static int __cancel_active0(struct live_preempt_cancel *arg)
>   		goto out;
>   	}
>   
> -	intel_context_set_banned(rq->context);
> +	intel_context_ban(rq->context, rq);
>   	err = intel_engine_pulse(arg->engine);
>   	if (err)
>   		goto out;
> @@ -2146,7 +2146,7 @@ static int __cancel_active1(struct live_preempt_cancel *arg)
>   	if (err)
>   		goto out;
>   
> -	intel_context_set_banned(rq[1]->context);
> +	intel_context_ban(rq[1]->context, rq[1]);
>   	err = intel_engine_pulse(arg->engine);
>   	if (err)
>   		goto out;
> @@ -2229,7 +2229,7 @@ static int __cancel_queued(struct live_preempt_cancel *arg)
>   	if (err)
>   		goto out;
>   
> -	intel_context_set_banned(rq[2]->context);
> +	intel_context_ban(rq[2]->context, rq[2]);
>   	err = intel_engine_pulse(arg->engine);
>   	if (err)
>   		goto out;
> @@ -2244,7 +2244,13 @@ static int __cancel_queued(struct live_preempt_cancel *arg)
>   		goto out;
>   	}
>   
> -	if (rq[1]->fence.error != 0) {
> +	/*
> +	 * The behavior between having semaphores and not is different. With
> +	 * semaphores the subsequent request is on the hardware and not cancelled
> +	 * while without the request is held in the driver and cancelled.
> +	 */
> +	if (intel_engine_has_semaphores(rq[1]->engine) &&
> +	    rq[1]->fence.error != 0) {
>   		pr_err("Normal inflight1 request did not complete\n");
>   		err = -EINVAL;
>   		goto out;
> @@ -2292,7 +2298,7 @@ static int __cancel_hostile(struct live_preempt_cancel *arg)
>   		goto out;
>   	}
>   
> -	intel_context_set_banned(rq->context);
> +	intel_context_ban(rq->context, rq);
>   	err = intel_engine_pulse(arg->engine); /* force reset */
>   	if (err)
>   		goto out;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware
  2022-07-28  2:42 ` [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
@ 2022-07-28  2:46   ` John Harrison
  0 siblings, 0 replies; 16+ messages in thread
From: John Harrison @ 2022-07-28  2:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Matthew Brost, DRI-Devel

On 7/27/2022 19:42, John.C.Harrison@Intel.com wrote:
> From: Matthew Brost <matthew.brost@intel.com>
>
> The GuC needs a copy of a golden context for implementing watchdog
> resets (aka media resets). This context is larger on newer platforms.
> So adjust the size being allocated/copied accordingly.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index ba7541f3ca610..74cbe8eaf5318 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -464,7 +464,11 @@ static void fill_engine_enable_masks(struct intel_gt *gt,
>   }
>   
>   #define LR_HW_CONTEXT_SIZE (80 * sizeof(u32))
> -#define LRC_SKIP_SIZE (LRC_PPHWSP_SZ * PAGE_SIZE + LR_HW_CONTEXT_SIZE)
> +#define XEHP_LR_HW_CONTEXT_SIZE (96 * sizeof(u32))
> +#define LR_HW_CONTEXT_SZ(i915) (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50) ? \
> +				    XEHP_LR_HW_CONTEXT_SIZE : \
> +				    LR_HW_CONTEXT_SIZE)
> +#define LRC_SKIP_SIZE(i915) (LRC_PPHWSP_SZ * PAGE_SIZE + LR_HW_CONTEXT_SZ(i915))
>   static int guc_prep_golden_context(struct intel_guc *guc)
>   {
>   	struct intel_gt *gt = guc_to_gt(guc);
> @@ -525,7 +529,7 @@ static int guc_prep_golden_context(struct intel_guc *guc)
>   		 * on all engines).
>   		 */
>   		ads_blob_write(guc, ads.eng_state_size[guc_class],
> -			       real_size - LRC_SKIP_SIZE);
> +			       real_size - LRC_SKIP_SIZE(gt->i915));
>   		ads_blob_write(guc, ads.golden_context_lrca[guc_class],
>   			       addr_ggtt);
>   
> @@ -599,7 +603,7 @@ static void guc_init_golden_context(struct intel_guc *guc)
>   		}
>   
>   		GEM_BUG_ON(ads_blob_read(guc, ads.eng_state_size[guc_class]) !=
> -			   real_size - LRC_SKIP_SIZE);
> +			   real_size - LRC_SKIP_SIZE(gt->i915));
>   		GEM_BUG_ON(ads_blob_read(guc, ads.golden_context_lrca[guc_class]) != addr_ggtt);
>   
>   		addr_ggtt += alloc_size;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC
  2022-07-28  2:42 ` [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
@ 2022-07-28 18:21   ` John Harrison
  2022-07-28 18:26     ` John.C.Harrison
  0 siblings, 1 reply; 16+ messages in thread
From: John Harrison @ 2022-07-28 18:21 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel, Rahul Kumar Singh

On 7/27/2022 19:42, John.C.Harrison@Intel.com wrote:
> From: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
>
> Add a test to check that the hangcheck will recover from a submission
> hang in the GuC.
>
> Signed-off-by: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
>   .../drm/i915/gt/uc/selftest_guc_hangcheck.c   | 159 ++++++++++++++++++
>   .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>   3 files changed, 161 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 0b8c6450fa344..ff205c4125857 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -5177,4 +5177,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_guc.c"
>   #include "selftest_guc_multi_lrc.c"
> +#include "selftest_guc_hangcheck.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
> new file mode 100644
> index 0000000000000..af913c4b09d37
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright �� 2019 Intel Corporation
Need to update the date.

> + */
> +
> +#include "selftests/igt_spinner.h"
> +#include "selftests/igt_reset.h"
> +#include "selftests/intel_scheduler_helpers.h"
> +#include "gt/intel_engine_heartbeat.h"
> +#include "gem/selftests/mock_context.h"
> +
> +#define BEAT_INTERVAL	100
> +
> +static struct i915_request *nop_request(struct intel_engine_cs *engine)
> +{
> +	struct i915_request *rq;
> +
> +	rq = intel_engine_create_kernel_request(engine);
> +	if (IS_ERR(rq))
> +		return rq;
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +
> +	return rq;
> +}
> +
> +static int intel_hang_guc(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	int ret = 0;
> +	struct i915_gem_context *ctx;
> +	struct intel_context *ce;
> +	struct igt_spinner spin;
> +	struct i915_request *rq;
> +	intel_wakeref_t wakeref;
> +	struct i915_gpu_error *global = &gt->i915->gpu_error;
> +	struct intel_engine_cs *engine;
> +	unsigned int reset_count;
> +	u32 guc_status;
> +	u32 old_beat;
> +
> +	ctx = kernel_context(gt->i915, NULL);
> +	if (IS_ERR(ctx)) {
> +		pr_err("Failed get kernel context: %ld\n", PTR_ERR(ctx));
Should not use pr_err when drm_err is available.

John.

> +		return PTR_ERR(ctx);
> +	}
> +
> +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> +
> +	ce = intel_context_create(gt->engine[BCS0]);
> +	if (IS_ERR(ce)) {
> +		ret = PTR_ERR(ce);
> +		pr_err("Failed to create spinner request: %d\n", ret);
> +		goto err;
> +	}
> +
> +	engine = ce->engine;
> +	reset_count = i915_reset_count(global);
> +
> +	old_beat = engine->props.heartbeat_interval_ms;
> +	ret = intel_engine_set_heartbeat(engine, BEAT_INTERVAL);
> +	if (ret) {
> +		pr_err("Failed to boost heatbeat interval: %d\n", ret);
> +		goto err;
> +	}
> +
> +	ret = igt_spinner_init(&spin, engine->gt);
> +	if (ret) {
> +		pr_err("Failed to create spinner: %d\n", ret);
> +		goto err;
> +	}
> +
> +	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
> +	intel_context_put(ce);
> +	if (IS_ERR(rq)) {
> +		ret = PTR_ERR(rq);
> +		pr_err("Failed to create spinner request: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	ret = request_add_spin(rq, &spin);
> +	if (ret) {
> +		i915_request_put(rq);
> +		pr_err("Failed to add Spinner request: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	ret = intel_reset_guc(gt);
> +	if (ret) {
> +		i915_request_put(rq);
> +		pr_err("Failed to reset GuC, ret = %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	guc_status = intel_uncore_read(gt->uncore, GUC_STATUS);
> +	if (!(guc_status & GS_MIA_IN_RESET)) {
> +		i915_request_put(rq);
> +		pr_err("GuC failed to reset: status = 0x%08X\n", guc_status);
> +		ret = -EIO;
> +		goto err_spin;
> +	}
> +
> +	/* Wait for the heartbeat to cause a reset */
> +	ret = intel_selftest_wait_for_rq(rq);
> +	i915_request_put(rq);
> +	if (ret) {
> +		pr_err("Request failed to complete: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	if (i915_reset_count(global) == reset_count) {
> +		pr_err("Failed to record a GPU reset\n");
> +		ret = -EINVAL;
> +		goto err_spin;
> +	}
> +
> +err_spin:
> +	igt_spinner_end(&spin);
> +	igt_spinner_fini(&spin);
> +	intel_engine_set_heartbeat(engine, old_beat);
> +
> +	if (ret == 0) {
> +		rq = nop_request(engine);
> +		if (IS_ERR(rq)) {
> +			ret = PTR_ERR(rq);
> +			goto err;
> +		}
> +
> +		ret = intel_selftest_wait_for_rq(rq);
> +		i915_request_put(rq);
> +		if (ret) {
> +			pr_err("No-op failed to complete: %d\n", ret);
> +			goto err;
> +		}
> +	}
> +
> +err:
> +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> +	kernel_context_close(ctx);
> +
> +	return ret;
> +}
> +
> +int intel_guc_hang_check(struct drm_i915_private *i915)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(intel_hang_guc),
> +	};
> +	struct intel_gt *gt = to_gt(i915);
> +
> +	if (intel_gt_is_wedged(gt))
> +		return 0;
> +
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		return 0;
> +
> +	return intel_gt_live_subtests(tests, gt);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> index bdd290f2bf3cd..aaf8a380e5c78 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> @@ -49,5 +49,6 @@ selftest(perf, i915_perf_live_selftests)
>   selftest(slpc, intel_slpc_live_selftests)
>   selftest(guc, intel_guc_live_selftests)
>   selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
> +selftest(guc_hang, intel_guc_hang_check)
>   /* Here be dragons: keep last to run last! */
>   selftest(late_gt_pm, intel_gt_pm_late_selftests)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC
  2022-07-28 18:21   ` John Harrison
@ 2022-07-28 18:26     ` John.C.Harrison
  2022-07-28 18:33       ` John Harrison
  0 siblings, 1 reply; 16+ messages in thread
From: John.C.Harrison @ 2022-07-28 18:26 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Rahul Kumar Singh

From: Rahul Kumar Singh <rahul.kumar.singh@intel.com>

Add a test to check that the hangcheck will recover from a submission
hang in the GuC.

Signed-off-by: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_hangcheck.c   | 159 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 161 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0b8c6450fa344..ff205c4125857 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -5177,4 +5177,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
 #include "selftest_guc_multi_lrc.c"
+#include "selftest_guc_hangcheck.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
new file mode 100644
index 0000000000000..01f8cd3c31340
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+#define BEAT_INTERVAL	100
+
+static struct i915_request *nop_request(struct intel_engine_cs *engine)
+{
+	struct i915_request *rq;
+
+	rq = intel_engine_create_kernel_request(engine);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_hang_guc(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	struct i915_gem_context *ctx;
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	intel_wakeref_t wakeref;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	struct intel_engine_cs *engine;
+	unsigned int reset_count;
+	u32 guc_status;
+	u32 old_beat;
+
+	ctx = kernel_context(gt->i915, NULL);
+	if (IS_ERR(ctx)) {
+		drm_err(&gt->i915->drm, "Failed get kernel context: %ld\n", PTR_ERR(ctx));
+		return PTR_ERR(ctx);
+	}
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	ce = intel_context_create(gt->engine[BCS0]);
+	if (IS_ERR(ce)) {
+		ret = PTR_ERR(ce);
+		drm_err(&gt->i915->drm, "Failed to create spinner request: %d\n", ret);
+		goto err;
+	}
+
+	engine = ce->engine;
+	reset_count = i915_reset_count(global);
+
+	old_beat = engine->props.heartbeat_interval_ms;
+	ret = intel_engine_set_heartbeat(engine, BEAT_INTERVAL);
+	if (ret) {
+		drm_err(&gt->i915->drm, "Failed to boost heatbeat interval: %d\n", ret);
+		goto err;
+	}
+
+	ret = igt_spinner_init(&spin, engine->gt);
+	if (ret) {
+		drm_err(&gt->i915->drm, "Failed to create spinner: %d\n", ret);
+		goto err;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	intel_context_put(ce);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		drm_err(&gt->i915->drm, "Failed to create spinner request: %d\n", ret);
+		goto err_spin;
+	}
+
+	ret = request_add_spin(rq, &spin);
+	if (ret) {
+		i915_request_put(rq);
+		drm_err(&gt->i915->drm, "Failed to add Spinner request: %d\n", ret);
+		goto err_spin;
+	}
+
+	ret = intel_reset_guc(gt);
+	if (ret) {
+		i915_request_put(rq);
+		drm_err(&gt->i915->drm, "Failed to reset GuC, ret = %d\n", ret);
+		goto err_spin;
+	}
+
+	guc_status = intel_uncore_read(gt->uncore, GUC_STATUS);
+	if (!(guc_status & GS_MIA_IN_RESET)) {
+		i915_request_put(rq);
+		drm_err(&gt->i915->drm, "GuC failed to reset: status = 0x%08X\n", guc_status);
+		ret = -EIO;
+		goto err_spin;
+	}
+
+	/* Wait for the heartbeat to cause a reset */
+	ret = intel_selftest_wait_for_rq(rq);
+	i915_request_put(rq);
+	if (ret) {
+		drm_err(&gt->i915->drm, "Request failed to complete: %d\n", ret);
+		goto err_spin;
+	}
+
+	if (i915_reset_count(global) == reset_count) {
+		drm_err(&gt->i915->drm, "Failed to record a GPU reset\n");
+		ret = -EINVAL;
+		goto err_spin;
+	}
+
+err_spin:
+	igt_spinner_end(&spin);
+	igt_spinner_fini(&spin);
+	intel_engine_set_heartbeat(engine, old_beat);
+
+	if (ret == 0) {
+		rq = nop_request(engine);
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			goto err;
+		}
+
+		ret = intel_selftest_wait_for_rq(rq);
+		i915_request_put(rq);
+		if (ret) {
+			drm_err(&gt->i915->drm, "No-op failed to complete: %d\n", ret);
+			goto err;
+		}
+	}
+
+err:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	kernel_context_close(ctx);
+
+	return ret;
+}
+
+int intel_guc_hang_check(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_hang_guc),
+	};
+	struct intel_gt *gt = to_gt(i915);
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index bdd290f2bf3cd..aaf8a380e5c78 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -49,5 +49,6 @@ selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
 selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
+selftest(guc_hang, intel_guc_hang_check)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC
  2022-07-28 18:26     ` John.C.Harrison
@ 2022-07-28 18:33       ` John Harrison
  0 siblings, 0 replies; 16+ messages in thread
From: John Harrison @ 2022-07-28 18:33 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel, Rahul Kumar Singh

On 7/28/2022 11:26, John.C.Harrison@Intel.com wrote:
> From: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
>
> Add a test to check that the hangcheck will recover from a submission
> hang in the GuC.
>
> Signed-off-by: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
>   .../drm/i915/gt/uc/selftest_guc_hangcheck.c   | 159 ++++++++++++++++++
>   .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>   3 files changed, 161 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 0b8c6450fa344..ff205c4125857 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -5177,4 +5177,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_guc.c"
>   #include "selftest_guc_multi_lrc.c"
> +#include "selftest_guc_hangcheck.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
> new file mode 100644
> index 0000000000000..01f8cd3c31340
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_hangcheck.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +#include "selftests/igt_spinner.h"
> +#include "selftests/igt_reset.h"
> +#include "selftests/intel_scheduler_helpers.h"
> +#include "gt/intel_engine_heartbeat.h"
> +#include "gem/selftests/mock_context.h"
> +
> +#define BEAT_INTERVAL	100
> +
> +static struct i915_request *nop_request(struct intel_engine_cs *engine)
> +{
> +	struct i915_request *rq;
> +
> +	rq = intel_engine_create_kernel_request(engine);
> +	if (IS_ERR(rq))
> +		return rq;
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +
> +	return rq;
> +}
> +
> +static int intel_hang_guc(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	int ret = 0;
> +	struct i915_gem_context *ctx;
> +	struct intel_context *ce;
> +	struct igt_spinner spin;
> +	struct i915_request *rq;
> +	intel_wakeref_t wakeref;
> +	struct i915_gpu_error *global = &gt->i915->gpu_error;
> +	struct intel_engine_cs *engine;
> +	unsigned int reset_count;
> +	u32 guc_status;
> +	u32 old_beat;
> +
> +	ctx = kernel_context(gt->i915, NULL);
> +	if (IS_ERR(ctx)) {
> +		drm_err(&gt->i915->drm, "Failed get kernel context: %ld\n", PTR_ERR(ctx));
> +		return PTR_ERR(ctx);
> +	}
> +
> +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> +
> +	ce = intel_context_create(gt->engine[BCS0]);
> +	if (IS_ERR(ce)) {
> +		ret = PTR_ERR(ce);
> +		drm_err(&gt->i915->drm, "Failed to create spinner request: %d\n", ret);
> +		goto err;
> +	}
> +
> +	engine = ce->engine;
> +	reset_count = i915_reset_count(global);
> +
> +	old_beat = engine->props.heartbeat_interval_ms;
> +	ret = intel_engine_set_heartbeat(engine, BEAT_INTERVAL);
> +	if (ret) {
> +		drm_err(&gt->i915->drm, "Failed to boost heatbeat interval: %d\n", ret);
> +		goto err;
> +	}
> +
> +	ret = igt_spinner_init(&spin, engine->gt);
> +	if (ret) {
> +		drm_err(&gt->i915->drm, "Failed to create spinner: %d\n", ret);
> +		goto err;
> +	}
> +
> +	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
> +	intel_context_put(ce);
> +	if (IS_ERR(rq)) {
> +		ret = PTR_ERR(rq);
> +		drm_err(&gt->i915->drm, "Failed to create spinner request: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	ret = request_add_spin(rq, &spin);
> +	if (ret) {
> +		i915_request_put(rq);
> +		drm_err(&gt->i915->drm, "Failed to add Spinner request: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	ret = intel_reset_guc(gt);
> +	if (ret) {
> +		i915_request_put(rq);
> +		drm_err(&gt->i915->drm, "Failed to reset GuC, ret = %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	guc_status = intel_uncore_read(gt->uncore, GUC_STATUS);
> +	if (!(guc_status & GS_MIA_IN_RESET)) {
> +		i915_request_put(rq);
> +		drm_err(&gt->i915->drm, "GuC failed to reset: status = 0x%08X\n", guc_status);
> +		ret = -EIO;
> +		goto err_spin;
> +	}
> +
> +	/* Wait for the heartbeat to cause a reset */
> +	ret = intel_selftest_wait_for_rq(rq);
> +	i915_request_put(rq);
> +	if (ret) {
> +		drm_err(&gt->i915->drm, "Request failed to complete: %d\n", ret);
> +		goto err_spin;
> +	}
> +
> +	if (i915_reset_count(global) == reset_count) {
> +		drm_err(&gt->i915->drm, "Failed to record a GPU reset\n");
> +		ret = -EINVAL;
> +		goto err_spin;
> +	}
> +
> +err_spin:
> +	igt_spinner_end(&spin);
> +	igt_spinner_fini(&spin);
> +	intel_engine_set_heartbeat(engine, old_beat);
> +
> +	if (ret == 0) {
> +		rq = nop_request(engine);
> +		if (IS_ERR(rq)) {
> +			ret = PTR_ERR(rq);
> +			goto err;
> +		}
> +
> +		ret = intel_selftest_wait_for_rq(rq);
> +		i915_request_put(rq);
> +		if (ret) {
> +			drm_err(&gt->i915->drm, "No-op failed to complete: %d\n", ret);
> +			goto err;
> +		}
> +	}
> +
> +err:
> +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> +	kernel_context_close(ctx);
> +
> +	return ret;
> +}
> +
> +int intel_guc_hang_check(struct drm_i915_private *i915)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(intel_hang_guc),
> +	};
> +	struct intel_gt *gt = to_gt(i915);
> +
> +	if (intel_gt_is_wedged(gt))
> +		return 0;
> +
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		return 0;
> +
> +	return intel_gt_live_subtests(tests, gt);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> index bdd290f2bf3cd..aaf8a380e5c78 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> @@ -49,5 +49,6 @@ selftest(perf, i915_perf_live_selftests)
>   selftest(slpc, intel_slpc_live_selftests)
>   selftest(guc, intel_guc_live_selftests)
>   selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
> +selftest(guc_hang, intel_guc_hang_check)
>   /* Here be dragons: keep last to run last! */
>   selftest(late_gt_pm, intel_gt_pm_late_selftests)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status
  2022-07-28  2:42 ` [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
@ 2022-07-28 19:06   ` Michal Wajdeczko
  2022-07-28 19:38     ` John Harrison
  2022-07-29  0:00   ` Ceraolo Spurio, Daniele
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Wajdeczko @ 2022-07-28 19:06 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 28.07.2022 04:42, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When the KMD sends a CLIENT_RESET request to GuC (as part of the
> suspend sequence), GuC will mark the CTB buffer as 'UNUSED'. If the

hmm, GuC shouldn't do that on CLIENT_RESET, GuC shall only mark CTB as
UNUSED when we explicitly disable CTB using CONTROL_CTB as only then CTB
descriptors are known to be valid

> KMD then checked the CTB queue, it would see a non-zero status value
> and report the buffer as corrupted.
> 
> Technically, no G2H messages should be received once the CLIENT_RESET
> has been sent. However, if a context was outstanding on an engine then
> it would get reset and a reset notification would be sent. So, don't
> actually treat UNUSED as a catastrophic error. Just flag it up as
> unexpected and keep going.

we should have already marked locally that CTB is disabled, either as
part of the explicit disabling of CTB with CONTROL_CTB, or implicit due
to issued CLIENT_RESET, but in both cases we shouldn't try to read CTB
any more, even it there are any outstanding messages ...

is this due to a race with ct->enabled ?

> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  .../i915/gt/uc/abi/guc_communication_ctb_abi.h |  8 +++++---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c      | 18 ++++++++++++++++--
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> index df83c1cc7c7a6..28b8387f97b77 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> @@ -37,6 +37,7 @@
>   *  |   |       |   - _`GUC_CTB_STATUS_OVERFLOW` = 1 (head/tail too large)     |
>   *  |   |       |   - _`GUC_CTB_STATUS_UNDERFLOW` = 2 (truncated message)      |
>   *  |   |       |   - _`GUC_CTB_STATUS_MISMATCH` = 4 (head/tail modified)      |
> + *  |   |       |   - _`GUC_CTB_STATUS_UNUSED` = 8 (CTB is not in use)         |
>   *  +---+-------+--------------------------------------------------------------+
>   *  |...|       | RESERVED = MBZ                                               |
>   *  +---+-------+--------------------------------------------------------------+
> @@ -49,9 +50,10 @@ struct guc_ct_buffer_desc {
>  	u32 tail;
>  	u32 status;
>  #define GUC_CTB_STATUS_NO_ERROR				0
> -#define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
> -#define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
> -#define GUC_CTB_STATUS_MISMATCH				(1 << 2)
> +#define GUC_CTB_STATUS_OVERFLOW				BIT(0)
> +#define GUC_CTB_STATUS_UNDERFLOW			BIT(1)
> +#define GUC_CTB_STATUS_MISMATCH				BIT(2)
> +#define GUC_CTB_STATUS_UNUSED				BIT(3)

nit: our goal was to use plain C definitions in ABI headers as much as
possible without introducing any dependency on external macros

>  	u32 reserved[13];
>  } __packed;
>  static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index f01325cd1b625..11b5d4ddb19ce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -816,8 +816,22 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
>  	if (unlikely(ctb->broken))
>  		return -EPIPE;
>  
> -	if (unlikely(desc->status))
> -		goto corrupted;
> +	if (unlikely(desc->status)) {
> +		u32 status = desc->status;
> +
> +		if (status & GUC_CTB_STATUS_UNUSED) {
> +			/*
> +			 * Potentially valid if a CLIENT_RESET request resulted in
> +			 * contexts/engines being reset. But should never happen as
> +			 * no contexts should be active when CLIENT_RESET is sent.
> +			 */
> +			CT_ERROR(ct, "Unexpected G2H after GuC has stopped!\n");
> +			status &= ~GUC_CTB_STATUS_UNUSED;

do you really want to continue read messages from already disabled CTB ?
maybe instead of clearing GUC_CTB_STATUS_UNUSED bit we should just return?

Michal

> +		}
> +
> +		if (status)
> +			goto corrupted;
> +	}
>  
>  	GEM_BUG_ON(head > size);
>  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status
  2022-07-28 19:06   ` [Intel-gfx] " Michal Wajdeczko
@ 2022-07-28 19:38     ` John Harrison
  0 siblings, 0 replies; 16+ messages in thread
From: John Harrison @ 2022-07-28 19:38 UTC (permalink / raw)
  To: Michal Wajdeczko, Intel-GFX; +Cc: DRI-Devel

On 7/28/2022 12:06, Michal Wajdeczko wrote:
> On 28.07.2022 04:42, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When the KMD sends a CLIENT_RESET request to GuC (as part of the
>> suspend sequence), GuC will mark the CTB buffer as 'UNUSED'. If the
> hmm, GuC shouldn't do that on CLIENT_RESET, GuC shall only mark CTB as
> UNUSED when we explicitly disable CTB using CONTROL_CTB as only then CTB
> descriptors are known to be valid
GuC very definitely does do that.

>
>> KMD then checked the CTB queue, it would see a non-zero status value
>> and report the buffer as corrupted.
>>
>> Technically, no G2H messages should be received once the CLIENT_RESET
>> has been sent. However, if a context was outstanding on an engine then
>> it would get reset and a reset notification would be sent. So, don't
>> actually treat UNUSED as a catastrophic error. Just flag it up as
>> unexpected and keep going.
> we should have already marked locally that CTB is disabled, either as
> part of the explicit disabling of CTB with CONTROL_CTB, or implicit due
> to issued CLIENT_RESET, but in both cases we shouldn't try to read CTB
> any more, even it there are any outstanding messages ...
>
> is this due to a race with ct->enabled ?
As per review comments on previous revision, it was only hit during POC 
work of a hardware w/a that led to the G2H processing code being called 
even when no G2H message had been sent.

And you can't mark the CTB as disabled before sending a H2G message. 
That would result in not sending the CLIENT_RESET H2G at all. We do mark 
it disabled after having sent the message. But there will always exist a 
potential race condition where GuC sends us a message before we get that 
far.

>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   .../i915/gt/uc/abi/guc_communication_ctb_abi.h |  8 +++++---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c      | 18 ++++++++++++++++--
>>   2 files changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> index df83c1cc7c7a6..28b8387f97b77 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> @@ -37,6 +37,7 @@
>>    *  |   |       |   - _`GUC_CTB_STATUS_OVERFLOW` = 1 (head/tail too large)     |
>>    *  |   |       |   - _`GUC_CTB_STATUS_UNDERFLOW` = 2 (truncated message)      |
>>    *  |   |       |   - _`GUC_CTB_STATUS_MISMATCH` = 4 (head/tail modified)      |
>> + *  |   |       |   - _`GUC_CTB_STATUS_UNUSED` = 8 (CTB is not in use)         |
>>    *  +---+-------+--------------------------------------------------------------+
>>    *  |...|       | RESERVED = MBZ                                               |
>>    *  +---+-------+--------------------------------------------------------------+
>> @@ -49,9 +50,10 @@ struct guc_ct_buffer_desc {
>>   	u32 tail;
>>   	u32 status;
>>   #define GUC_CTB_STATUS_NO_ERROR				0
>> -#define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
>> -#define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
>> -#define GUC_CTB_STATUS_MISMATCH				(1 << 2)
>> +#define GUC_CTB_STATUS_OVERFLOW				BIT(0)
>> +#define GUC_CTB_STATUS_UNDERFLOW			BIT(1)
>> +#define GUC_CTB_STATUS_MISMATCH				BIT(2)
>> +#define GUC_CTB_STATUS_UNUSED				BIT(3)
> nit: our goal was to use plain C definitions in ABI headers as much as
> possible without introducing any dependency on external macros
Except that checkpatch complains like a complainy thing.

>
>>   	u32 reserved[13];
>>   } __packed;
>>   static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> index f01325cd1b625..11b5d4ddb19ce 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> @@ -816,8 +816,22 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
>>   	if (unlikely(ctb->broken))
>>   		return -EPIPE;
>>   
>> -	if (unlikely(desc->status))
>> -		goto corrupted;
>> +	if (unlikely(desc->status)) {
>> +		u32 status = desc->status;
>> +
>> +		if (status & GUC_CTB_STATUS_UNUSED) {
>> +			/*
>> +			 * Potentially valid if a CLIENT_RESET request resulted in
>> +			 * contexts/engines being reset. But should never happen as
>> +			 * no contexts should be active when CLIENT_RESET is sent.
>> +			 */
>> +			CT_ERROR(ct, "Unexpected G2H after GuC has stopped!\n");
>> +			status &= ~GUC_CTB_STATUS_UNUSED;
> do you really want to continue read messages from already disabled CTB ?
> maybe instead of clearing GUC_CTB_STATUS_UNUSED bit we should just return?
GuC could have sent us a valid message right before shutting down its 
end of the CTB. We should still process that message. Note that the 
clear is only of the local status variable. The CTB itself is still 
marked as closed by GuC.

John.


>
> Michal
>
>> +		}
>> +
>> +		if (status)
>> +			goto corrupted;
>> +	}
>>   
>>   	GEM_BUG_ON(head > size);
>>   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status
  2022-07-28  2:42 ` [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
  2022-07-28 19:06   ` [Intel-gfx] " Michal Wajdeczko
@ 2022-07-29  0:00   ` Ceraolo Spurio, Daniele
  2022-07-29  0:35     ` John Harrison
  1 sibling, 1 reply; 16+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-07-29  0:00 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 7/27/2022 7:42 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> When the KMD sends a CLIENT_RESET request to GuC (as part of the
> suspend sequence), GuC will mark the CTB buffer as 'UNUSED'. If the
> KMD then checked the CTB queue, it would see a non-zero status value
> and report the buffer as corrupted.
>
> Technically, no G2H messages should be received once the CLIENT_RESET
> has been sent. However, if a context was outstanding on an engine then
> it would get reset and a reset notification would be sent. So, don't
> actually treat UNUSED as a catastrophic error. Just flag it up as
> unexpected and keep going.

Given that we disable CTs right after sending the CLIENT_RESET, there is 
only a small window for the kernel to receive a G2H interrupt before we 
turn everything off. If we want to support catching unexpected G2Hs 
coming at that time, maybe we should instead make sure all CT messages 
(if any) have been processed before the disable. Not a blocker for this 
patch, can be done as a follow-up.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../i915/gt/uc/abi/guc_communication_ctb_abi.h |  8 +++++---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c      | 18 ++++++++++++++++--
>   2 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> index df83c1cc7c7a6..28b8387f97b77 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
> @@ -37,6 +37,7 @@
>    *  |   |       |   - _`GUC_CTB_STATUS_OVERFLOW` = 1 (head/tail too large)     |
>    *  |   |       |   - _`GUC_CTB_STATUS_UNDERFLOW` = 2 (truncated message)      |
>    *  |   |       |   - _`GUC_CTB_STATUS_MISMATCH` = 4 (head/tail modified)      |
> + *  |   |       |   - _`GUC_CTB_STATUS_UNUSED` = 8 (CTB is not in use)         |
>    *  +---+-------+--------------------------------------------------------------+
>    *  |...|       | RESERVED = MBZ                                               |
>    *  +---+-------+--------------------------------------------------------------+
> @@ -49,9 +50,10 @@ struct guc_ct_buffer_desc {
>   	u32 tail;
>   	u32 status;
>   #define GUC_CTB_STATUS_NO_ERROR				0
> -#define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
> -#define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
> -#define GUC_CTB_STATUS_MISMATCH				(1 << 2)
> +#define GUC_CTB_STATUS_OVERFLOW				BIT(0)
> +#define GUC_CTB_STATUS_UNDERFLOW			BIT(1)
> +#define GUC_CTB_STATUS_MISMATCH				BIT(2)
> +#define GUC_CTB_STATUS_UNUSED				BIT(3)
>   	u32 reserved[13];
>   } __packed;
>   static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index f01325cd1b625..11b5d4ddb19ce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -816,8 +816,22 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
>   	if (unlikely(ctb->broken))
>   		return -EPIPE;
>   
> -	if (unlikely(desc->status))
> -		goto corrupted;
> +	if (unlikely(desc->status)) {
> +		u32 status = desc->status;
> +
> +		if (status & GUC_CTB_STATUS_UNUSED) {
> +			/*
> +			 * Potentially valid if a CLIENT_RESET request resulted in
> +			 * contexts/engines being reset. But should never happen as
> +			 * no contexts should be active when CLIENT_RESET is sent.
> +			 */
> +			CT_ERROR(ct, "Unexpected G2H after GuC has stopped!\n");
> +			status &= ~GUC_CTB_STATUS_UNUSED;
> +		}
> +
> +		if (status)
> +			goto corrupted;
> +	}
>   
>   	GEM_BUG_ON(head > size);
>   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status
  2022-07-29  0:00   ` Ceraolo Spurio, Daniele
@ 2022-07-29  0:35     ` John Harrison
  0 siblings, 0 replies; 16+ messages in thread
From: John Harrison @ 2022-07-29  0:35 UTC (permalink / raw)
  To: Ceraolo Spurio, Daniele, Intel-GFX; +Cc: DRI-Devel

On 7/28/2022 17:00, Ceraolo Spurio, Daniele wrote:
> On 7/27/2022 7:42 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When the KMD sends a CLIENT_RESET request to GuC (as part of the
>> suspend sequence), GuC will mark the CTB buffer as 'UNUSED'. If the
>> KMD then checked the CTB queue, it would see a non-zero status value
>> and report the buffer as corrupted.
>>
>> Technically, no G2H messages should be received once the CLIENT_RESET
>> has been sent. However, if a context was outstanding on an engine then
>> it would get reset and a reset notification would be sent. So, don't
>> actually treat UNUSED as a catastrophic error. Just flag it up as
>> unexpected and keep going.
>
> Given that we disable CTs right after sending the CLIENT_RESET, there 
> is only a small window for the kernel to receive a G2H interrupt 
> before we turn everything off. If we want to support catching 
> unexpected G2Hs coming at that time, maybe we should instead make sure 
> all CT messages (if any) have been processed before the disable. Not a 
> blocker for this patch, can be done as a follow-up.
Yeah, it gets messy. How do you check for messages in a CTB that is 
already marked as 'do not touch me'? The current check for available 
work (head != tail) is after the status check. This is specifically so 
that we don't try to process corrupted messages in a corrupted buffer. 
But by definition, if the send(CLIENT_RESEET) call has returned then the 
status is already 'do not use'. Ideally, we would just want to flush out 
any pending interrupts before turning interrupts off in the sanitise 
code. But then, is there a race where the interrupt hasn't quite made it 
far enough by that time? Do we need to stall for a bit? How long?

As noted, in the case where we actually hit the issue the interrupt 
handler did get to run in the gap between sending the reset message and 
turning off the i915 side of the CTB. So we are basically into windows 
of opportunity and diminishing returns. Given that it is supposedly an 
impossible situation anyway, I'm not sure it is worth putting a complex 
solution in to solve. But yeah, can think more and maybe get some kind 
of extra check in there as a follow up.

John.


>
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>
> Daniele
>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   .../i915/gt/uc/abi/guc_communication_ctb_abi.h |  8 +++++---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c      | 18 ++++++++++++++++--
>>   2 files changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git 
>> a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h 
>> b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> index df83c1cc7c7a6..28b8387f97b77 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_communication_ctb_abi.h
>> @@ -37,6 +37,7 @@
>>    *  |   |       |   - _`GUC_CTB_STATUS_OVERFLOW` = 1 (head/tail too 
>> large)     |
>>    *  |   |       |   - _`GUC_CTB_STATUS_UNDERFLOW` = 2 (truncated 
>> message)      |
>>    *  |   |       |   - _`GUC_CTB_STATUS_MISMATCH` = 4 (head/tail 
>> modified)      |
>> + *  |   |       |   - _`GUC_CTB_STATUS_UNUSED` = 8 (CTB is not in 
>> use)         |
>>    * 
>> +---+-------+--------------------------------------------------------------+
>>    *  |...|       | RESERVED = 
>> MBZ                                               |
>>    * 
>> +---+-------+--------------------------------------------------------------+
>> @@ -49,9 +50,10 @@ struct guc_ct_buffer_desc {
>>       u32 tail;
>>       u32 status;
>>   #define GUC_CTB_STATUS_NO_ERROR                0
>> -#define GUC_CTB_STATUS_OVERFLOW                (1 << 0)
>> -#define GUC_CTB_STATUS_UNDERFLOW            (1 << 1)
>> -#define GUC_CTB_STATUS_MISMATCH                (1 << 2)
>> +#define GUC_CTB_STATUS_OVERFLOW                BIT(0)
>> +#define GUC_CTB_STATUS_UNDERFLOW            BIT(1)
>> +#define GUC_CTB_STATUS_MISMATCH                BIT(2)
>> +#define GUC_CTB_STATUS_UNUSED                BIT(3)
>>       u32 reserved[13];
>>   } __packed;
>>   static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> index f01325cd1b625..11b5d4ddb19ce 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> @@ -816,8 +816,22 @@ static int ct_read(struct intel_guc_ct *ct, 
>> struct ct_incoming_msg **msg)
>>       if (unlikely(ctb->broken))
>>           return -EPIPE;
>>   -    if (unlikely(desc->status))
>> -        goto corrupted;
>> +    if (unlikely(desc->status)) {
>> +        u32 status = desc->status;
>> +
>> +        if (status & GUC_CTB_STATUS_UNUSED) {
>> +            /*
>> +             * Potentially valid if a CLIENT_RESET request resulted in
>> +             * contexts/engines being reset. But should never happen as
>> +             * no contexts should be active when CLIENT_RESET is sent.
>> +             */
>> +            CT_ERROR(ct, "Unexpected G2H after GuC has stopped!\n");
>> +            status &= ~GUC_CTB_STATUS_UNUSED;
>> +        }
>> +
>> +        if (status)
>> +            goto corrupted;
>> +    }
>>         GEM_BUG_ON(head > size);
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-07-29  0:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28  2:42 [PATCH 0/6] Random assortment of (mostly) GuC related patches John.C.Harrison
2022-07-28  2:42 ` [PATCH 1/6] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
2022-07-28  2:42 ` [PATCH 2/6] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
2022-07-28  2:45   ` John Harrison
2022-07-28  2:42 ` [PATCH 3/6] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
2022-07-28 18:21   ` John Harrison
2022-07-28 18:26     ` John.C.Harrison
2022-07-28 18:33       ` John Harrison
2022-07-28  2:42 ` [PATCH 4/6] drm/i915/selftest: Cope with not having an RCS engine John.C.Harrison
2022-07-28  2:42 ` [PATCH 5/6] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
2022-07-28  2:46   ` John Harrison
2022-07-28  2:42 ` [PATCH 6/6] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
2022-07-28 19:06   ` [Intel-gfx] " Michal Wajdeczko
2022-07-28 19:38     ` John Harrison
2022-07-29  0:00   ` Ceraolo Spurio, Daniele
2022-07-29  0:35     ` John Harrison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).