[Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up some CI failures for GuC submission

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up some CI failures for GuC submission
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-08-08 17:59 ` Patchwork
  -1 siblings, 0 replies; 23+ messages in thread
From: Patchwork @ 2021-08-08 17:59 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Clean up some CI failures for GuC submission
URL   : https://patchwork.freedesktop.org/series/93487/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
d416adfba5b9 drm/i915/guc: Fix several issues related to resets / request cancelation
-:84: WARNING:MSLEEP: msleep < 20ms can sleep for up to 20ms; see Documentation/timers/timers-howto.rst
#84: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:733:
+		msleep(1);

total: 0 errors, 1 warnings, 0 checks, 98 lines checked
4b762ba775a1 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
21bec627a015 drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
-:76: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#76: 
new file mode 100644

-:136: ERROR:SPACING: space required before the open parenthesis '('
#136: FILE: drivers/gpu/drm/i915/gt/uc/selftest_guc.c:56:
+		switch(i) {

total: 1 errors, 1 warnings, 0 checks, 211 lines checked



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Clean up some CI failures for GuC submission
@ 2021-08-08 18:07 ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Address CI failure related to reset [1]. In addition to the public CI
failure we also fix several issues uncovered recenting in our internal
CI related to resets. Patch #1 address all of these issues.

Workaround an existing memory corruption, in gt_lrc selftest, exposed by
enabling GuC submission [2].

Lastly, add a selftest to give us confidence in some of the reset code
that is rather hard / intermittent to exercise via IGTs.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10456/fi-rkl-guc/igt@i915_selftest@live@workarounds.html
[2] https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20772/fi-rkl-guc/igt@i915_selftest@live@gt_lrc.html

Matthew Brost (3):
  drm/i915/guc: Fix several issues related to resets / request
    cancelation
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H

 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  61 ++++++---
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 7 files changed, 218 insertions(+), 17 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-gfx] [PATCH 0/3] Clean up some CI failures for GuC submission
@ 2021-08-08 18:07 ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Address CI failure related to reset [1]. In addition to the public CI
failure we also fix several issues uncovered recenting in our internal
CI related to resets. Patch #1 address all of these issues.

Workaround an existing memory corruption, in gt_lrc selftest, exposed by
enabling GuC submission [2].

Lastly, add a selftest to give us confidence in some of the reset code
that is rather hard / intermittent to exercise via IGTs.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10456/fi-rkl-guc/igt@i915_selftest@live@workarounds.html
[2] https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20772/fi-rkl-guc/igt@i915_selftest@live@gt_lrc.html

Matthew Brost (3):
  drm/i915/guc: Fix several issues related to resets / request
    cancelation
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H

 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  61 ++++++---
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 7 files changed, 218 insertions(+), 17 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
@ 2021-08-08 18:07   ` Matthew Brost
  -1 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Resets are notoriously hard to get fully working and notoriously racey,
especially with selftests / IGTs that do all sorts of wild things that
would be near impossible to hit during normal use cases. Even though
likely impossible to hit, anything selftests / IGTs uncover needs to be
fixed. This patch addresses 7 such issues.

1. A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

2. When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reserve and
append to the head of the priority list to fix this.

3. Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

4. Prior to this patch the blocked context counter was cleared on
init_sched_state (used during registering a context & resets) which is
incorrect. This state needs to be persistent or the counter can read the
incorrect value.

5. Flush the work queue for GuC generated G2H messages during a GT reset.

6. Do not clear enable during a context reset if a schedule enable is in
flight.

7. When unblocking a context, do not enable scheduling if the context is
banned.

Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..cd8df078ca87 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
 static inline bool
@@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (err == -EBUSY && g2h_len_dw)
+		atomic_dec(&guc->outstanding_submission_g2h);
+
 	return err;
 }
 
@@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 			wait_for_reset(guc, &guc->outstanding_submission_g2h);
 		} while (!list_empty(&guc->ct.requests.incoming));
 	}
+
+	/* Flush any GuC generated G2H */
+	while (!list_empty(&guc->ct.requests.incoming))
+		msleep(1);
+
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
@@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_active.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
 
 		__i915_request_unsubmit(rq);
 
@@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
@@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission. A small window exists
+	 * where the context could be block & unblocked (scheduling enable) while
+	 * this reset was inflight. If a scheduling enable is already is in
+	 * flight do not clear the enable.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (!context_pending_enable(ce))
+		clr_context_enabled(ce);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Intel-gfx] [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
@ 2021-08-08 18:07   ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Resets are notoriously hard to get fully working and notoriously racey,
especially with selftests / IGTs that do all sorts of wild things that
would be near impossible to hit during normal use cases. Even though
likely impossible to hit, anything selftests / IGTs uncover needs to be
fixed. This patch addresses 7 such issues.

1. A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

2. When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reserve and
append to the head of the priority list to fix this.

3. Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

4. Prior to this patch the blocked context counter was cleared on
init_sched_state (used during registering a context & resets) which is
incorrect. This state needs to be persistent or the counter can read the
incorrect value.

5. Flush the work queue for GuC generated G2H messages during a GT reset.

6. Do not clear enable during a context reset if a schedule enable is in
flight.

7. When unblocking a context, do not enable scheduling if the context is
banned.

Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..cd8df078ca87 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
 static inline bool
@@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (err == -EBUSY && g2h_len_dw)
+		atomic_dec(&guc->outstanding_submission_g2h);
+
 	return err;
 }
 
@@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 			wait_for_reset(guc, &guc->outstanding_submission_g2h);
 		} while (!list_empty(&guc->ct.requests.incoming));
 	}
+
+	/* Flush any GuC generated G2H */
+	while (!list_empty(&guc->ct.requests.incoming))
+		msleep(1);
+
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
@@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_active.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
 
 		__i915_request_unsubmit(rq);
 
@@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
@@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission. A small window exists
+	 * where the context could be block & unblocked (scheduling enable) while
+	 * this reset was inflight. If a scheduling enable is already is in
+	 * flight do not clear the enable.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (!context_pending_enable(ce))
+		clr_context_enabled(ce);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
@ 2021-08-08 18:07   ` Matthew Brost
  -1 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

GuC submission has exposed an existing memory corruption in
live_lrc_isolation. We believe that some writes to the watchdog offsets
in the LRC (0x178 & 0x17c) can result in trashing of portions of the
address space. With GuC submission there are additional objects which
can move the context redzone into the space that is trashed. To
workaround this avoid poisoning the watchdog.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..6500e9fce8a0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Intel-gfx] [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
@ 2021-08-08 18:07   ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

GuC submission has exposed an existing memory corruption in
live_lrc_isolation. We believe that some writes to the watchdog offsets
in the LRC (0x178 & 0x17c) can result in trashing of portions of the
address space. With GuC submission there are additional objects which
can move the context redzone into the space that is trashed. To
workaround this avoid poisoning the watchdog.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..6500e9fce8a0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
@ 2021-08-08 18:07   ` Matthew Brost
  -1 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  18 +++
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 6 files changed, 163 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..fec5ff7ef168 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -198,6 +198,10 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+	I915_SELFTEST_DECLARE(bool drop_schedule_enable);
+	I915_SELFTEST_DECLARE(bool drop_schedule_disable);
+	I915_SELFTEST_DECLARE(bool drop_deregister);
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cd8df078ca87..d13dc56bae43 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2618,6 +2618,11 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+	if (I915_SELFTEST_ONLY(ce->drop_deregister)) {
+		I915_SELFTEST_DECLARE(ce->drop_deregister = false;)
+		return 0;
+	}
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2672,10 +2677,19 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+		if (I915_SELFTEST_ONLY(ce->drop_schedule_enable)) {
+			I915_SELFTEST_DECLARE(ce->drop_schedule_enable = false;)
+			return 0;
+		}
 		clr_context_pending_enable(ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+		if (I915_SELFTEST_ONLY(ce->drop_schedule_disable)) {
+			I915_SELFTEST_DECLARE(ce->drop_schedule_disable = false;)
+			return 0;
+		}
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -3047,3 +3061,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..46ca6554f65d
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch(i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Intel-gfx] [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
@ 2021-08-08 18:07   ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-08 18:07 UTC (permalink / raw)
  To: intel-gfx, dri-devel

While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  18 +++
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 6 files changed, 163 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..fec5ff7ef168 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -198,6 +198,10 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+	I915_SELFTEST_DECLARE(bool drop_schedule_enable);
+	I915_SELFTEST_DECLARE(bool drop_schedule_disable);
+	I915_SELFTEST_DECLARE(bool drop_deregister);
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cd8df078ca87..d13dc56bae43 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2618,6 +2618,11 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+	if (I915_SELFTEST_ONLY(ce->drop_deregister)) {
+		I915_SELFTEST_DECLARE(ce->drop_deregister = false;)
+		return 0;
+	}
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2672,10 +2677,19 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+		if (I915_SELFTEST_ONLY(ce->drop_schedule_enable)) {
+			I915_SELFTEST_DECLARE(ce->drop_schedule_enable = false;)
+			return 0;
+		}
 		clr_context_pending_enable(ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+		if (I915_SELFTEST_ONLY(ce->drop_schedule_disable)) {
+			I915_SELFTEST_DECLARE(ce->drop_schedule_disable = false;)
+			return 0;
+		}
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -3047,3 +3061,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..46ca6554f65d
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch(i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Clean up some CI failures for GuC submission
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
                   ` (4 preceding siblings ...)
  (?)
@ 2021-08-08 18:26 ` Patchwork
  -1 siblings, 0 replies; 23+ messages in thread
From: Patchwork @ 2021-08-08 18:26 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 2959 bytes --]

== Series Details ==

Series: Clean up some CI failures for GuC submission
URL   : https://patchwork.freedesktop.org/series/93487/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10458 -> Patchwork_20785
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/index.html

New tests
---------

  New tests have been introduced between CI_DRM_10458 and Patchwork_20785:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 31 pass(s)
    - Exec time: [0.40, 4.86] s

  

Known issues
------------

  Here are the changes found in Patchwork_20785 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@cs-gfx:
    - fi-rkl-guc:         NOTRUN -> [SKIP][1] ([fdo#109315]) +17 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/fi-rkl-guc/igt@amdgpu/amd_basic@cs-gfx.html

  
#### Possible fixes ####

  * igt@i915_module_load@reload:
    - {fi-tgl-dsi}:       [DMESG-WARN][2] ([i915#1982] / [k.org#205379]) -> [PASS][3]
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/fi-tgl-dsi/igt@i915_module_load@reload.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/fi-tgl-dsi/igt@i915_module_load@reload.html

  * igt@i915_selftest@live@workarounds:
    - fi-rkl-guc:         [DMESG-FAIL][4] -> [PASS][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/fi-rkl-guc/igt@i915_selftest@live@workarounds.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/fi-rkl-guc/igt@i915_selftest@live@workarounds.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [k.org#205379]: https://bugzilla.kernel.org/show_bug.cgi?id=205379


Participating hosts (37 -> 34)
------------------------------

  Missing    (3): fi-bdw-samus fi-bsw-cyan bat-jsl-1 


Build changes
-------------

  * Linux: CI_DRM_10458 -> Patchwork_20785

  CI-20190529: 20190529
  CI_DRM_10458: 6841146b2648073c607c7e560176495c01b2ebae @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6162: 2f32b9e0da5f1ac9529318dd5b836c8cf4d3c441 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20785: 21bec627a0155077d0a0b30a7e0455d38827e00a @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

21bec627a015 drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
4b762ba775a1 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
d416adfba5b9 drm/i915/guc: Fix several issues related to resets / request cancelation

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/index.html

[-- Attachment #2: Type: text/html, Size: 3662 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Clean up some CI failures for GuC submission
  2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
                   ` (5 preceding siblings ...)
  (?)
@ 2021-08-08 19:40 ` Patchwork
  -1 siblings, 0 replies; 23+ messages in thread
From: Patchwork @ 2021-08-08 19:40 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30267 bytes --]

== Series Details ==

Series: Clean up some CI failures for GuC submission
URL   : https://patchwork.freedesktop.org/series/93487/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10458_full -> Patchwork_20785_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_20785_full:

### IGT changes ###

#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@gem_userptr_blits@map-fixed-invalidate-overlap@fixed}:
    - {shard-rkl}:        NOTRUN -> [SKIP][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-2/igt@gem_userptr_blits@map-fixed-invalidate-overlap@fixed.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10458_full and Patchwork_20785_full:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 7 pass(s)
    - Exec time: [0.49, 4.94] s

  

Known issues
------------

  Here are the changes found in Patchwork_20785_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ctx_persistence@engines-mixed:
    - shard-snb:          NOTRUN -> [SKIP][2] ([fdo#109271] / [i915#1099]) +4 similar issues
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-snb2/igt@gem_ctx_persistence@engines-mixed.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-tglb:         [PASS][3] -> [FAIL][4] ([i915#2842])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-tglb3/igt@gem_exec_fair@basic-pace@rcs0.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb2/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_exec_fair@basic-pace@vcs0:
    - shard-iclb:         [PASS][5] -> [FAIL][6] ([i915#2842])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb6/igt@gem_exec_fair@basic-pace@vcs0.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb7/igt@gem_exec_fair@basic-pace@vcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [PASS][7] -> [FAIL][8] ([i915#2842])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-kbl4/igt@gem_exec_fair@basic-pace@vecs0.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_huc_copy@huc-copy:
    - shard-apl:          NOTRUN -> [SKIP][9] ([fdo#109271] / [i915#2190])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl1/igt@gem_huc_copy@huc-copy.html

  * igt@gem_mmap_gtt@cpuset-big-copy:
    - shard-iclb:         [PASS][10] -> [FAIL][11] ([i915#307])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb3/igt@gem_mmap_gtt@cpuset-big-copy.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb2/igt@gem_mmap_gtt@cpuset-big-copy.html

  * igt@gem_pwrite@basic-exhaustion:
    - shard-apl:          NOTRUN -> [WARN][12] ([i915#2658])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl2/igt@gem_pwrite@basic-exhaustion.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-apl:          NOTRUN -> [FAIL][13] ([i915#3318])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl2/igt@gem_userptr_blits@vma-merge.html

  * igt@gem_workarounds@suspend-resume:
    - shard-apl:          NOTRUN -> [DMESG-WARN][14] ([i915#180])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@gem_workarounds@suspend-resume.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-iclb:         [PASS][15] -> [FAIL][16] ([i915#454])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb8/igt@i915_pm_dc@dc6-psr.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb4/igt@i915_pm_dc@dc6-psr.html

  * igt@i915_pm_lpsp@kms-lpsp@kms-lpsp-dp:
    - shard-apl:          NOTRUN -> [SKIP][17] ([fdo#109271] / [i915#1937])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@i915_pm_lpsp@kms-lpsp@kms-lpsp-dp.html

  * igt@i915_pm_rpm@basic-rte:
    - shard-apl:          NOTRUN -> [FAIL][18] ([i915#579])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl7/igt@i915_pm_rpm@basic-rte.html
    - shard-glk:          NOTRUN -> [FAIL][19] ([i915#579])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@i915_pm_rpm@basic-rte.html

  * igt@kms_big_fb@x-tiled-32bpp-rotate-0:
    - shard-glk:          [PASS][20] -> [DMESG-WARN][21] ([i915#118] / [i915#95])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk6/igt@kms_big_fb@x-tiled-32bpp-rotate-0.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk7/igt@kms_big_fb@x-tiled-32bpp-rotate-0.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][22] ([fdo#109271] / [i915#3777]) +1 similar issue
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip.html
    - shard-apl:          NOTRUN -> [SKIP][23] ([fdo#109271] / [i915#3777]) +1 similar issue
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip.html
    - shard-skl:          NOTRUN -> [SKIP][24] ([fdo#109271] / [i915#3777]) +1 similar issue
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl1/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-glk:          NOTRUN -> [SKIP][25] ([fdo#109271] / [i915#3777])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk7/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_big_fb@yf-tiled-addfb:
    - shard-tglb:         NOTRUN -> [SKIP][26] ([fdo#111615])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb3/igt@kms_big_fb@yf-tiled-addfb.html

  * igt@kms_ccs@pipe-b-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-glk:          NOTRUN -> [SKIP][27] ([fdo#109271] / [i915#3886]) +2 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@kms_ccs@pipe-b-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc.html
    - shard-apl:          NOTRUN -> [SKIP][28] ([fdo#109271] / [i915#3886]) +6 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl7/igt@kms_ccs@pipe-b-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_mc_ccs:
    - shard-kbl:          NOTRUN -> [SKIP][29] ([fdo#109271] / [i915#3886]) +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_mc_ccs:
    - shard-skl:          NOTRUN -> [SKIP][30] ([fdo#109271] / [i915#3886]) +3 similar issues
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl4/igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_mc_ccs.html

  * igt@kms_chamelium@vga-hpd-without-ddc:
    - shard-snb:          NOTRUN -> [SKIP][31] ([fdo#109271] / [fdo#111827]) +13 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-snb6/igt@kms_chamelium@vga-hpd-without-ddc.html

  * igt@kms_color_chamelium@pipe-b-ctm-0-25:
    - shard-tglb:         NOTRUN -> [SKIP][32] ([fdo#109284] / [fdo#111827])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb5/igt@kms_color_chamelium@pipe-b-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-c-ctm-0-25:
    - shard-apl:          NOTRUN -> [SKIP][33] ([fdo#109271] / [fdo#111827]) +12 similar issues
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@kms_color_chamelium@pipe-c-ctm-0-25.html
    - shard-kbl:          NOTRUN -> [SKIP][34] ([fdo#109271] / [fdo#111827]) +3 similar issues
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_color_chamelium@pipe-c-ctm-0-25.html
    - shard-skl:          NOTRUN -> [SKIP][35] ([fdo#109271] / [fdo#111827]) +5 similar issues
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl1/igt@kms_color_chamelium@pipe-c-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-d-ctm-0-25:
    - shard-glk:          NOTRUN -> [SKIP][36] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@kms_color_chamelium@pipe-d-ctm-0-25.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-skl:          [PASS][37] -> [FAIL][38] ([i915#2346])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl6/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-skl:          [PASS][39] -> [FAIL][40] ([i915#2346] / [i915#533])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl5/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl1/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-kbl:          [PASS][41] -> [INCOMPLETE][42] ([i915#155] / [i915#180] / [i915#636])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-kbl1/igt@kms_fbcon_fbt@fbc-suspend.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl7/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-blt:
    - shard-kbl:          NOTRUN -> [SKIP][43] ([fdo#109271]) +55 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@fbc-suspend:
    - shard-kbl:          [PASS][44] -> [DMESG-WARN][45] ([i915#180]) +6 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-kbl1/igt@kms_frontbuffer_tracking@fbc-suspend.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl4/igt@kms_frontbuffer_tracking@fbc-suspend.html

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-offscren-pri-indfb-draw-mmap-cpu:
    - shard-glk:          NOTRUN -> [SKIP][46] ([fdo#109271]) +53 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@kms_frontbuffer_tracking@fbcpsr-1p-offscren-pri-indfb-draw-mmap-cpu.html

  * igt@kms_hdr@static-swap:
    - shard-tglb:         NOTRUN -> [SKIP][47] ([i915#1187])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb3/igt@kms_hdr@static-swap.html

  * igt@kms_pipe_crc_basic@hang-read-crc-pipe-d:
    - shard-apl:          NOTRUN -> [SKIP][48] ([fdo#109271] / [i915#533])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl1/igt@kms_pipe_crc_basic@hang-read-crc-pipe-d.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes:
    - shard-apl:          [PASS][49] -> [DMESG-WARN][50] ([i915#180]) +2 similar issues
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-apl6/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-a-planes.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-basic:
    - shard-apl:          NOTRUN -> [FAIL][51] ([fdo#108145] / [i915#265]) +1 similar issue
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl2/igt@kms_plane_alpha_blend@pipe-c-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb:
    - shard-apl:          NOTRUN -> [FAIL][52] ([i915#265])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html
    - shard-kbl:          NOTRUN -> [FAIL][53] ([i915#265])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html
    - shard-skl:          NOTRUN -> [FAIL][54] ([i915#265])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl1/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          [PASS][55] -> [FAIL][56] ([fdo#108145] / [i915#265]) +1 similar issue
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl7/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl9/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-yf:
    - shard-tglb:         NOTRUN -> [SKIP][57] ([fdo#112054])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb5/igt@kms_plane_multiple@atomic-pipe-c-tiling-yf.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-3:
    - shard-kbl:          NOTRUN -> [SKIP][58] ([fdo#109271] / [i915#658])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-3.html
    - shard-apl:          NOTRUN -> [SKIP][59] ([fdo#109271] / [i915#658]) +2 similar issues
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl6/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-3.html

  * igt@kms_psr2_sf@plane-move-sf-dmg-area-2:
    - shard-skl:          NOTRUN -> [SKIP][60] ([fdo#109271] / [i915#658]) +1 similar issue
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl4/igt@kms_psr2_sf@plane-move-sf-dmg-area-2.html

  * igt@kms_psr2_su@page_flip:
    - shard-iclb:         [PASS][61] -> [SKIP][62] ([fdo#109642] / [fdo#111068] / [i915#658])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb2/igt@kms_psr2_su@page_flip.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb8/igt@kms_psr2_su@page_flip.html

  * igt@kms_psr@psr2_primary_mmap_cpu:
    - shard-iclb:         [PASS][63] -> [SKIP][64] ([fdo#109441]) +2 similar issues
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb2/igt@kms_psr@psr2_primary_mmap_cpu.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb8/igt@kms_psr@psr2_primary_mmap_cpu.html

  * igt@kms_psr@psr2_sprite_blt:
    - shard-skl:          NOTRUN -> [SKIP][65] ([fdo#109271]) +65 similar issues
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl4/igt@kms_psr@psr2_sprite_blt.html

  * igt@kms_vblank@pipe-d-query-forked-hang:
    - shard-snb:          NOTRUN -> [SKIP][66] ([fdo#109271]) +285 similar issues
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-snb2/igt@kms_vblank@pipe-d-query-forked-hang.html

  * igt@kms_vblank@pipe-d-wait-forked-hang:
    - shard-apl:          NOTRUN -> [SKIP][67] ([fdo#109271]) +145 similar issues
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl2/igt@kms_vblank@pipe-d-wait-forked-hang.html

  * igt@kms_vrr@flip-basic:
    - shard-tglb:         NOTRUN -> [SKIP][68] ([fdo#109502])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb3/igt@kms_vrr@flip-basic.html

  * igt@kms_writeback@writeback-invalid-parameters:
    - shard-apl:          NOTRUN -> [SKIP][69] ([fdo#109271] / [i915#2437])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl1/igt@kms_writeback@writeback-invalid-parameters.html

  * igt@sysfs_clients@fair-1:
    - shard-skl:          NOTRUN -> [SKIP][70] ([fdo#109271] / [i915#2994])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl4/igt@sysfs_clients@fair-1.html

  * igt@sysfs_clients@fair-3:
    - shard-tglb:         NOTRUN -> [SKIP][71] ([i915#2994])
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb5/igt@sysfs_clients@fair-3.html

  * igt@sysfs_clients@split-25:
    - shard-kbl:          NOTRUN -> [SKIP][72] ([fdo#109271] / [i915#2994])
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@sysfs_clients@split-25.html
    - shard-apl:          NOTRUN -> [SKIP][73] ([fdo#109271] / [i915#2994]) +2 similar issues
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl2/igt@sysfs_clients@split-25.html

  
#### Possible fixes ####

  * igt@fbdev@write:
    - {shard-rkl}:        [SKIP][74] ([i915#2582]) -> [PASS][75]
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-2/igt@fbdev@write.html
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@fbdev@write.html

  * igt@gem_ctx_isolation@preservation-s3@vcs0:
    - shard-kbl:          [DMESG-WARN][76] ([i915#180]) -> [PASS][77] +3 similar issues
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-kbl6/igt@gem_ctx_isolation@preservation-s3@vcs0.html
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@gem_ctx_isolation@preservation-s3@vcs0.html

  * igt@gem_ctx_isolation@preservation-s3@vecs0:
    - shard-skl:          [INCOMPLETE][78] ([i915#198]) -> [PASS][79]
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl10/igt@gem_ctx_isolation@preservation-s3@vecs0.html
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl1/igt@gem_ctx_isolation@preservation-s3@vecs0.html

  * igt@gem_ctx_persistence@many-contexts:
    - {shard-rkl}:        [FAIL][80] ([i915#2410]) -> [PASS][81]
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-1/igt@gem_ctx_persistence@many-contexts.html
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-5/igt@gem_ctx_persistence@many-contexts.html

  * igt@gem_eio@in-flight-contexts-immediate:
    - {shard-rkl}:        [TIMEOUT][82] ([i915#3063]) -> [PASS][83]
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-6/igt@gem_eio@in-flight-contexts-immediate.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-1/igt@gem_eio@in-flight-contexts-immediate.html

  * igt@gem_eio@unwedge-stress:
    - shard-tglb:         [TIMEOUT][84] ([i915#2369] / [i915#3063] / [i915#3648]) -> [PASS][85]
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-tglb7/igt@gem_eio@unwedge-stress.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb3/igt@gem_eio@unwedge-stress.html
    - shard-iclb:         [TIMEOUT][86] ([i915#2369] / [i915#2481] / [i915#3070]) -> [PASS][87]
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb8/igt@gem_eio@unwedge-stress.html
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb4/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_endless@dispatch@vecs0:
    - {shard-rkl}:        [INCOMPLETE][88] -> [PASS][89]
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-2/igt@gem_exec_endless@dispatch@vecs0.html
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-2/igt@gem_exec_endless@dispatch@vecs0.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-glk:          [FAIL][90] ([i915#2846]) -> [PASS][91]
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk7/igt@gem_exec_fair@basic-deadline.html
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - {shard-rkl}:        [FAIL][92] ([i915#2842]) -> [PASS][93]
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@gem_exec_fair@basic-none-solo@rcs0.html
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-1/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-none@rcs0:
    - shard-glk:          [FAIL][94] ([i915#2842]) -> [PASS][95] +1 similar issue
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk1/igt@gem_exec_fair@basic-none@rcs0.html
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk7/igt@gem_exec_fair@basic-none@rcs0.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-kbl:          [FAIL][96] ([i915#2842]) -> [PASS][97] +1 similar issue
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-kbl4/igt@gem_exec_fair@basic-pace@rcs0.html
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-kbl2/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-iclb:         [FAIL][98] ([i915#2849]) -> [PASS][99]
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb4/igt@gem_exec_fair@basic-throttle@rcs0.html
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb3/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_huc_copy@huc-copy:
    - shard-tglb:         [SKIP][100] ([i915#2190]) -> [PASS][101]
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-tglb6/igt@gem_huc_copy@huc-copy.html
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-tglb1/igt@gem_huc_copy@huc-copy.html

  * igt@gem_mmap_gtt@cpuset-big-copy-xy:
    - shard-iclb:         [FAIL][102] ([i915#2428]) -> [PASS][103] +1 similar issue
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb8/igt@gem_mmap_gtt@cpuset-big-copy-xy.html
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-iclb2/igt@gem_mmap_gtt@cpuset-big-copy-xy.html

  * igt@gen9_exec_parse@allowed-all:
    - shard-glk:          [DMESG-WARN][104] ([i915#1436] / [i915#716]) -> [PASS][105]
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk3/igt@gen9_exec_parse@allowed-all.html
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@gen9_exec_parse@allowed-all.html

  * igt@i915_pm_backlight@fade:
    - {shard-rkl}:        [SKIP][106] ([i915#3012]) -> [PASS][107]
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@i915_pm_backlight@fade.html
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@i915_pm_backlight@fade.html

  * igt@i915_pm_rpm@gem-evict-pwrite:
    - {shard-rkl}:        [SKIP][108] ([i915#3844] / [i915#579]) -> [PASS][109] +2 similar issues
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-6/igt@i915_pm_rpm@gem-evict-pwrite.html
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-2/igt@i915_pm_rpm@gem-evict-pwrite.html

  * igt@kms_async_flips@alternate-sync-async-flip:
    - shard-skl:          [FAIL][110] ([i915#2521]) -> [PASS][111]
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl9/igt@kms_async_flips@alternate-sync-async-flip.html
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl8/igt@kms_async_flips@alternate-sync-async-flip.html

  * igt@kms_big_fb@linear-32bpp-rotate-0:
    - shard-glk:          [DMESG-WARN][112] ([i915#118] / [i915#95]) -> [PASS][113]
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk3/igt@kms_big_fb@linear-32bpp-rotate-0.html
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk3/igt@kms_big_fb@linear-32bpp-rotate-0.html

  * igt@kms_big_fb@linear-8bpp-rotate-180:
    - {shard-rkl}:        [SKIP][114] ([i915#3638]) -> [PASS][115] +3 similar issues
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_big_fb@linear-8bpp-rotate-180.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_big_fb@linear-8bpp-rotate-180.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0:
    - {shard-rkl}:        [SKIP][116] ([i915#3721]) -> [PASS][117] +3 similar issues
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0.html

  * igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc:
    - {shard-rkl}:        [FAIL][118] ([i915#3678]) -> [PASS][119] +9 similar issues
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc.html
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_color@pipe-b-ctm-0-5:
    - {shard-rkl}:        [SKIP][120] ([i915#1149] / [i915#1849]) -> [PASS][121] +3 similar issues
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_color@pipe-b-ctm-0-5.html
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_color@pipe-b-ctm-0-5.html

  * igt@kms_cursor_crc@pipe-a-cursor-256x256-rapid-movement:
    - {shard-rkl}:        [SKIP][122] ([fdo#112022]) -> [PASS][123] +12 similar issues
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_cursor_crc@pipe-a-cursor-256x256-rapid-movement.html
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_cursor_crc@pipe-a-cursor-256x256-rapid-movement.html

  * igt@kms_cursor_legacy@flip-vs-cursor-crc-atomic:
    - {shard-rkl}:        [SKIP][124] ([fdo#111825]) -> [PASS][125] +3 similar issues
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_cursor_legacy@flip-vs-cursor-crc-atomic.html
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_cursor_legacy@flip-vs-cursor-crc-atomic.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-xtiled:
    - {shard-rkl}:        [SKIP][126] ([fdo#111314]) -> [PASS][127] +6 similar issues
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-xtiled.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-xtiled.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1:
    - shard-skl:          [FAIL][128] ([i915#79]) -> [PASS][129]
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl4/igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl7/igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1.html

  * igt@kms_flip@flip-vs-expired-vblank@c-hdmi-a2:
    - shard-glk:          [FAIL][130] ([i915#79]) -> [PASS][131]
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-glk9/igt@kms_flip@flip-vs-expired-vblank@c-hdmi-a2.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-glk9/igt@kms_flip@flip-vs-expired-vblank@c-hdmi-a2.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-apl:          [DMESG-WARN][132] ([i915#180]) -> [PASS][133] +2 similar issues
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-apl8/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl7/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip_event_leak:
    - shard-apl:          [DMESG-WARN][134] ([i915#1982]) -> [PASS][135]
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-apl8/igt@kms_flip_event_leak.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-apl7/igt@kms_flip_event_leak.html

  * igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-render:
    - {shard-rkl}:        [SKIP][136] ([i915#1849]) -> [PASS][137] +40 similar issues
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-render.html
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-render.html

  * igt@kms_hdr@static-toggle-dpms:
    - {shard-rkl}:        [SKIP][138] ([i915#1845]) -> [PASS][139] +24 similar issues
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-rkl-5/igt@kms_hdr@static-toggle-dpms.html
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-rkl-6/igt@kms_hdr@static-toggle-dpms.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes:
    - shard-skl:          [INCOMPLETE][140] ([i915#146] / [i915#198]) -> [PASS][141] +1 similar issue
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl3/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl7/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html

  * igt@kms_plane_alpha_blend@pipe-b-coverage-7efc:
    - shard-skl:          [FAIL][142] ([fdo#108145] / [i915#265]) -> [PASS][143] +1 similar issue
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-skl6/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/shard-skl4/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html

  * igt@kms_psr@psr2_sprite_plane_move:
    - shard-iclb:         [SKIP][144] ([fdo#109441]) -> [PASS][145] +1 similar issue
   [144]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10458/shard-iclb8/igt@kms_psr@psr2_sprite

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20785/index.html

[-- Attachment #2: Type: text/html, Size: 33384 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
  2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
@ 2021-08-09 13:35     ` Daniel Vetter
  -1 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-09 13:35 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> Resets are notoriously hard to get fully working and notoriously racey,
> especially with selftests / IGTs that do all sorts of wild things that
> would be near impossible to hit during normal use cases. Even though
> likely impossible to hit, anything selftests / IGTs uncover needs to be
> fixed. This patch addresses 7 such issues.
> 
> 1. A small race that could result in incorrect accounting of the number
> of outstanding G2H. Basically prior to this patch we did not increment
> the number of outstanding G2H if we encoutered a GT reset while sending
> a H2G. This was incorrect as the context state had already been updated
> to anticipate a G2H response thus the counter should be incremented.
> 
> 2. When unwinding requests on a reset context, if other requests in the
> context are in the priority list the requests could be resubmitted out
> of seqno order. Traverse the list of active requests in reserve and
> append to the head of the priority list to fix this.
> 
> 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> At one point we had to drop this because of a lock inversion but that is
> no longer the case. It is much safer to hold the lock so let's do that.
> 
> 4. Prior to this patch the blocked context counter was cleared on
> init_sched_state (used during registering a context & resets) which is
> incorrect. This state needs to be persistent or the counter can read the
> incorrect value.
> 
> 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> 
> 6. Do not clear enable during a context reset if a schedule enable is in
> flight.
> 
> 7. When unblocking a context, do not enable scheduling if the context is
> banned.

I think each of the above should be a separate patch. I think it would
also be good if each fix references the commits that introduced/changed
something.

Most of this stuff is extremely hard to get right, and unfortunately our
current code is way too fond of lockless trickery (which really isn't a
great idea in the reset code). We need to apply as much care as possible
here.

Also expect me to ask a lot of annoying questions about all the atomic_t
you touch :-)
-Daniel


> 
> Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
>  1 file changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 87d8dc8f51b9..cd8df078ca87 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
>  {
>  	/* Only should be called from guc_lrc_desc_pin() */
>  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> -	ce->guc_state.sched_state = 0;
> +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>  }
>  
>  static inline bool
> @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>  {
>  	int err;
>  
> -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> -
> -	if (!err && g2h_len_dw)
> +	if (g2h_len_dw)
>  		atomic_inc(&guc->outstanding_submission_g2h);
>  
> +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> +	if (err == -EBUSY && g2h_len_dw)
> +		atomic_dec(&guc->outstanding_submission_g2h);
> +
>  	return err;
>  }
>  
> @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
>  		} while (!list_empty(&guc->ct.requests.incoming));
>  	}
> +
> +	/* Flush any GuC generated G2H */
> +	while (!list_empty(&guc->ct.requests.incoming))
> +		msleep(1);
> +
>  	scrub_guc_desc_for_outstanding_g2h(guc);
>  }
>  
> @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  
>  	spin_lock_irqsave(&sched_engine->lock, flags);
>  	spin_lock(&ce->guc_active.lock);
> -	list_for_each_entry_safe(rq, rn,
> -				 &ce->guc_active.requests,
> -				 sched.link) {
> +	list_for_each_entry_safe_reverse(rq, rn,
> +					 &ce->guc_active.requests,
> +					 sched.link) {
>  		if (i915_request_completed(rq))
>  			continue;
>  
>  		list_del_init(&rq->sched.link);
> -		spin_unlock(&ce->guc_active.lock);
>  
>  		__i915_request_unsubmit(rq);
>  
> @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  		}
>  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
>  
> -		list_add_tail(&rq->sched.link, pl);
> +		list_add(&rq->sched.link, pl);
>  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> -
> -		spin_lock(&ce->guc_active.lock);
>  	}
>  	spin_unlock(&ce->guc_active.lock);
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  static void __guc_reset_context(struct intel_context *ce, bool stalled)
>  {
>  	struct i915_request *rq;
> +	unsigned long flags;
>  	u32 head;
>  
>  	intel_context_get(ce);
>  
>  	/*
> -	 * GuC will implicitly mark the context as non-schedulable
> -	 * when it sends the reset notification. Make sure our state
> -	 * reflects this change. The context will be marked enabled
> -	 * on resubmission.
> +	 * GuC will implicitly mark the context as non-schedulable when it sends
> +	 * the reset notification. Make sure our state reflects this change. The
> +	 * context will be marked enabled on resubmission. A small window exists
> +	 * where the context could be block & unblocked (scheduling enable) while
> +	 * this reset was inflight. If a scheduling enable is already is in
> +	 * flight do not clear the enable.
>  	 */
> -	clr_context_enabled(ce);
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	if (!context_pending_enable(ce))
> +		clr_context_enabled(ce);
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
>  	rq = intel_context_find_active_request(ce);
>  	if (!rq) {
> @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	if (unlikely(submission_disabled(guc) ||
> +		     intel_context_is_banned(ce) ||
>  		     !intel_context_is_pinned(ce) ||
>  		     context_pending_disable(ce) ||
>  		     context_blocked(ce) > 1)) {
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
@ 2021-08-09 13:35     ` Daniel Vetter
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-09 13:35 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> Resets are notoriously hard to get fully working and notoriously racey,
> especially with selftests / IGTs that do all sorts of wild things that
> would be near impossible to hit during normal use cases. Even though
> likely impossible to hit, anything selftests / IGTs uncover needs to be
> fixed. This patch addresses 7 such issues.
> 
> 1. A small race that could result in incorrect accounting of the number
> of outstanding G2H. Basically prior to this patch we did not increment
> the number of outstanding G2H if we encoutered a GT reset while sending
> a H2G. This was incorrect as the context state had already been updated
> to anticipate a G2H response thus the counter should be incremented.
> 
> 2. When unwinding requests on a reset context, if other requests in the
> context are in the priority list the requests could be resubmitted out
> of seqno order. Traverse the list of active requests in reserve and
> append to the head of the priority list to fix this.
> 
> 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> At one point we had to drop this because of a lock inversion but that is
> no longer the case. It is much safer to hold the lock so let's do that.
> 
> 4. Prior to this patch the blocked context counter was cleared on
> init_sched_state (used during registering a context & resets) which is
> incorrect. This state needs to be persistent or the counter can read the
> incorrect value.
> 
> 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> 
> 6. Do not clear enable during a context reset if a schedule enable is in
> flight.
> 
> 7. When unblocking a context, do not enable scheduling if the context is
> banned.

I think each of the above should be a separate patch. I think it would
also be good if each fix references the commits that introduced/changed
something.

Most of this stuff is extremely hard to get right, and unfortunately our
current code is way too fond of lockless trickery (which really isn't a
great idea in the reset code). We need to apply as much care as possible
here.

Also expect me to ask a lot of annoying questions about all the atomic_t
you touch :-)
-Daniel


> 
> Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
>  1 file changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 87d8dc8f51b9..cd8df078ca87 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
>  {
>  	/* Only should be called from guc_lrc_desc_pin() */
>  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> -	ce->guc_state.sched_state = 0;
> +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>  }
>  
>  static inline bool
> @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>  {
>  	int err;
>  
> -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> -
> -	if (!err && g2h_len_dw)
> +	if (g2h_len_dw)
>  		atomic_inc(&guc->outstanding_submission_g2h);
>  
> +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> +	if (err == -EBUSY && g2h_len_dw)
> +		atomic_dec(&guc->outstanding_submission_g2h);
> +
>  	return err;
>  }
>  
> @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
>  		} while (!list_empty(&guc->ct.requests.incoming));
>  	}
> +
> +	/* Flush any GuC generated G2H */
> +	while (!list_empty(&guc->ct.requests.incoming))
> +		msleep(1);
> +
>  	scrub_guc_desc_for_outstanding_g2h(guc);
>  }
>  
> @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  
>  	spin_lock_irqsave(&sched_engine->lock, flags);
>  	spin_lock(&ce->guc_active.lock);
> -	list_for_each_entry_safe(rq, rn,
> -				 &ce->guc_active.requests,
> -				 sched.link) {
> +	list_for_each_entry_safe_reverse(rq, rn,
> +					 &ce->guc_active.requests,
> +					 sched.link) {
>  		if (i915_request_completed(rq))
>  			continue;
>  
>  		list_del_init(&rq->sched.link);
> -		spin_unlock(&ce->guc_active.lock);
>  
>  		__i915_request_unsubmit(rq);
>  
> @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  		}
>  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
>  
> -		list_add_tail(&rq->sched.link, pl);
> +		list_add(&rq->sched.link, pl);
>  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> -
> -		spin_lock(&ce->guc_active.lock);
>  	}
>  	spin_unlock(&ce->guc_active.lock);
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  static void __guc_reset_context(struct intel_context *ce, bool stalled)
>  {
>  	struct i915_request *rq;
> +	unsigned long flags;
>  	u32 head;
>  
>  	intel_context_get(ce);
>  
>  	/*
> -	 * GuC will implicitly mark the context as non-schedulable
> -	 * when it sends the reset notification. Make sure our state
> -	 * reflects this change. The context will be marked enabled
> -	 * on resubmission.
> +	 * GuC will implicitly mark the context as non-schedulable when it sends
> +	 * the reset notification. Make sure our state reflects this change. The
> +	 * context will be marked enabled on resubmission. A small window exists
> +	 * where the context could be block & unblocked (scheduling enable) while
> +	 * this reset was inflight. If a scheduling enable is already is in
> +	 * flight do not clear the enable.
>  	 */
> -	clr_context_enabled(ce);
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	if (!context_pending_enable(ce))
> +		clr_context_enabled(ce);
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
>  	rq = intel_context_find_active_request(ce);
>  	if (!rq) {
> @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	if (unlikely(submission_disabled(guc) ||
> +		     intel_context_is_banned(ce) ||
>  		     !intel_context_is_pinned(ce) ||
>  		     context_pending_disable(ce) ||
>  		     context_blocked(ce) > 1)) {
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-08-09 13:38   ` Daniel Vetter
  2021-08-09 19:37     ` Matthew Brost
  -1 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2021-08-09 13:38 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Sun, Aug 08, 2021 at 11:07:56AM -0700, Matthew Brost wrote:
> GuC submission has exposed an existing memory corruption in
> live_lrc_isolation. We believe that some writes to the watchdog offsets
> in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> address space. With GuC submission there are additional objects which
> can move the context redzone into the space that is trashed. To
> workaround this avoid poisoning the watchdog.

A Bspec reference here would be good (we can quote anything that's marked
for public release, so doesn't have one of the IP markers).

Also I think the above should be replicated in condensed form instead of
the XXX comment.

With those: Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> since I
definitely have enough clue here for a detailed review.
-Daniel

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> index b0977a3b699b..6500e9fce8a0 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
>  	goto err_after;
>  }
>  
> +static u32 safe_offset(u32 offset, u32 reg)
> +{
> +	/* XXX skip testing of watchdog */
> +	if (offset == 0x178 || offset == 0x17c)
> +		reg = 0;
> +
> +	return reg;
> +}
> +
> +static int get_offset_mask(struct intel_engine_cs *engine)
> +{
> +	if (GRAPHICS_VER(engine->i915) < 12)
> +		return 0xfff;
> +
> +	switch (engine->class) {
> +	default:
> +	case RENDER_CLASS:
> +		return 0x07ff;
> +	case COPY_ENGINE_CLASS:
> +		return 0x0fff;
> +	case VIDEO_DECODE_CLASS:
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		return 0x3fff;
> +	}
> +}
> +
>  static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
>  {
>  	struct i915_vma *batch;
> @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
>  		len = (len + 1) / 2;
>  		*cs++ = MI_LOAD_REGISTER_IMM(len);
>  		while (len--) {
> -			*cs++ = hw[dw];
> +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> +					    hw[dw]);
>  			*cs++ = poison;
>  			dw += 2;
>  		}
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-08-09 14:03   ` Daniel Vetter
  2021-08-09 19:41     ` Matthew Brost
  -1 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:03 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Sun, Aug 08, 2021 at 11:07:57AM -0700, Matthew Brost wrote:
> While debugging an issue with full GT resets I went down a rabbit hole
> thinking the scrubbing of lost G2H wasn't working correctly. This proved
> to be incorrect as this was working just fine but this chase inspired me
> to write a selftest to prove that this works. This simple selftest
> injects errors dropping various G2H and then issues a full GT reset
> proving that the scrubbing of these G2H doesn't blow up.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  18 +++
>  drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
>  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>  .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
>  .../i915/selftests/intel_scheduler_helpers.h  |   2 +
>  6 files changed, 163 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index e54351a170e2..fec5ff7ef168 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -198,6 +198,10 @@ struct intel_context {
>  	 */
>  	u8 guc_prio;
>  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> +

I know the existing stuff isn't following this at all, but for anything
new we really should put some kerneldoc into structures. This probably
means you need to open-code the #ifdef here, since this macro will likely
upset kerneldoc parsing.

> +	I915_SELFTEST_DECLARE(bool drop_schedule_enable);
> +	I915_SELFTEST_DECLARE(bool drop_schedule_disable);
> +	I915_SELFTEST_DECLARE(bool drop_deregister);
>  };
>  
>  #endif /* __INTEL_CONTEXT_TYPES__ */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cd8df078ca87..d13dc56bae43 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2618,6 +2618,11 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>  
>  	trace_intel_context_deregister_done(ce);
>  
> +	if (I915_SELFTEST_ONLY(ce->drop_deregister)) {
> +		I915_SELFTEST_DECLARE(ce->drop_deregister = false;)

This macro wrapping is quite nasty, can't we just #ifdef this? Especially
the _DECLARE name really doesn't expect a statement.

Aside from these bikesheds I don't have a much to say on the test logic
itself, since I'm far from knowledgable on guc stuff ...
-Daniel


> +		return 0;
> +	}
> +
>  	if (context_wait_for_deregister_to_register(ce)) {
>  		struct intel_runtime_pm *runtime_pm =
>  			&ce->engine->gt->i915->runtime_pm;
> @@ -2672,10 +2677,19 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>  	trace_intel_context_sched_done(ce);
>  
>  	if (context_pending_enable(ce)) {
> +		if (I915_SELFTEST_ONLY(ce->drop_schedule_enable)) {
> +			I915_SELFTEST_DECLARE(ce->drop_schedule_enable = false;)
> +			return 0;
> +		}
>  		clr_context_pending_enable(ce);
>  	} else if (context_pending_disable(ce)) {
>  		bool banned;
>  
> +		if (I915_SELFTEST_ONLY(ce->drop_schedule_disable)) {
> +			I915_SELFTEST_DECLARE(ce->drop_schedule_disable = false;)
> +			return 0;
> +		}
> +
>  		/*
>  		 * Unpin must be done before __guc_signal_context_fence,
>  		 * otherwise a race exists between the requests getting
> @@ -3047,3 +3061,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
>  
>  	return false;
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftest_guc.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> new file mode 100644
> index 000000000000..46ca6554f65d
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright �� 2021 Intel Corporation
> + */
> +
> +#include "selftests/intel_scheduler_helpers.h"
> +
> +static struct i915_request *nop_user_request(struct intel_context *ce,
> +					     struct i915_request *from)
> +{
> +	struct i915_request *rq;
> +	int ret;
> +
> +	rq = intel_context_create_request(ce);
> +	if (IS_ERR(rq))
> +		return rq;
> +
> +	if (from) {
> +		ret = i915_sw_fence_await_dma_fence(&rq->submit,
> +						    &from->fence, 0,
> +						    I915_FENCE_GFP);
> +		if (ret < 0) {
> +			i915_request_put(rq);
> +			return ERR_PTR(ret);
> +		}
> +	}
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +
> +	return rq;
> +}
> +
> +static int intel_guc_scrub_ctbs(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	int ret = 0;
> +	int i;
> +	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
> +	intel_wakeref_t wakeref;
> +	struct intel_engine_cs *engine;
> +	struct intel_context *ce;
> +
> +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> +	engine = intel_selftest_find_any_engine(gt);
> +
> +	/* Submit requests and inject errors forcing G2H to be dropped */
> +	for (i = 0; i < 3; ++i) {
> +		ce = intel_context_create(engine);
> +		if (IS_ERR(ce)) {
> +			ret = PTR_ERR(ce);
> +			pr_err("Failed to create context, %d: %d\n", i, ret);
> +			goto err;
> +		}
> +
> +		switch(i) {
> +		case 0:
> +			ce->drop_schedule_enable = true;
> +			break;
> +		case 1:
> +			ce->drop_schedule_disable = true;
> +			break;
> +		case 2:
> +			ce->drop_deregister = true;
> +			break;
> +		}
> +
> +		rq = nop_user_request(ce, NULL);
> +		intel_context_put(ce);
> +
> +		if (IS_ERR(rq)) {
> +			ret = PTR_ERR(rq);
> +			pr_err("Failed to create request, %d: %d\n", i, ret);
> +			goto err;
> +		}
> +
> +		last[i] = rq;
> +	}
> +
> +	for (i = 0; i < 3; ++i) {
> +		ret = i915_request_wait(last[i], 0, HZ);
> +		if (ret < 0) {
> +			pr_err("Last request failed to complete: %d\n", ret);
> +			goto err;
> +		}
> +		i915_request_put(last[i]);
> +		last[i] = NULL;
> +	}
> +
> +	/* Force all H2G / G2H to be submitted / processed */
> +	intel_gt_retire_requests(gt);
> +	msleep(500);
> +
> +	/* Scrub missing G2H */
> +	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
> +
> +	ret = intel_gt_wait_for_idle(gt, HZ);
> +	if (ret < 0) {
> +		pr_err("GT failed to idle: %d\n", ret);
> +		goto err;
> +	}
> +
> +err:
> +	for (i = 0; i < 3; ++i)
> +		if (last[i])
> +			i915_request_put(last[i]);
> +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> +
> +	return ret;
> +}
> +
> +int intel_guc_live_selftests(struct drm_i915_private *i915)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(intel_guc_scrub_ctbs),
> +	};
> +	struct intel_gt *gt = &i915->gt;
> +
> +	if (intel_gt_is_wedged(gt))
> +		return 0;
> +
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		return 0;
> +
> +	return intel_gt_live_subtests(tests, gt);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> index cfa5c4165a4f..3cf6758931f9 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> @@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
>  selftest(ring_submission, intel_ring_submission_live_selftests)
>  selftest(perf, i915_perf_live_selftests)
>  selftest(slpc, intel_slpc_live_selftests)
> +selftest(guc, intel_guc_live_selftests)
>  /* Here be dragons: keep last to run last! */
>  selftest(late_gt_pm, intel_gt_pm_late_selftests)
> diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> index 4b328346b48a..310fb83c527e 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> @@ -14,6 +14,18 @@
>  #define REDUCED_PREEMPT		10
>  #define WAIT_FOR_RESET_TIME	10000
>  
> +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, gt, id)
> +		return engine;
> +
> +	pr_err("No valid engine found!\n");
> +	return NULL;
> +}
> +
>  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
>  				 struct intel_selftest_saved_policy *saved,
>  				 u32 modify_type)
> diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> index 35c098601ac0..ae60bb507f45 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> @@ -10,6 +10,7 @@
>  
>  struct i915_request;
>  struct intel_engine_cs;
> +struct intel_gt;
>  
>  struct intel_selftest_saved_policy {
>  	u32 flags;
> @@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
>  	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
>  };
>  
> +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
>  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
>  				 struct intel_selftest_saved_policy *saved,
>  				 enum selftest_scheduler_modify modify_type);
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
  2021-08-09 13:35     ` [Intel-gfx] " Daniel Vetter
@ 2021-08-09 19:35       ` Matthew Brost
  -1 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-09 19:35 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 03:35:26PM +0200, Daniel Vetter wrote:
> On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> > Resets are notoriously hard to get fully working and notoriously racey,
> > especially with selftests / IGTs that do all sorts of wild things that
> > would be near impossible to hit during normal use cases. Even though
> > likely impossible to hit, anything selftests / IGTs uncover needs to be
> > fixed. This patch addresses 7 such issues.
> > 
> > 1. A small race that could result in incorrect accounting of the number
> > of outstanding G2H. Basically prior to this patch we did not increment
> > the number of outstanding G2H if we encoutered a GT reset while sending
> > a H2G. This was incorrect as the context state had already been updated
> > to anticipate a G2H response thus the counter should be incremented.
> > 
> > 2. When unwinding requests on a reset context, if other requests in the
> > context are in the priority list the requests could be resubmitted out
> > of seqno order. Traverse the list of active requests in reserve and
> > append to the head of the priority list to fix this.
> > 
> > 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> > At one point we had to drop this because of a lock inversion but that is
> > no longer the case. It is much safer to hold the lock so let's do that.
> > 
> > 4. Prior to this patch the blocked context counter was cleared on
> > init_sched_state (used during registering a context & resets) which is
> > incorrect. This state needs to be persistent or the counter can read the
> > incorrect value.
> > 
> > 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> > 
> > 6. Do not clear enable during a context reset if a schedule enable is in
> > flight.
> > 
> > 7. When unblocking a context, do not enable scheduling if the context is
> > banned.
> 
> I think each of the above should be a separate patch. I think it would
> also be good if each fix references the commits that introduced/changed
> something.
>

Sure, just was trying to cheat and make our lives easier with less
patches to backport into DII.
 
> Most of this stuff is extremely hard to get right, and unfortunately our
> current code is way too fond of lockless trickery (which really isn't a
> great idea in the reset code). We need to apply as much care as possible
> here.
>

Yep, resets are hard. It is hard because like ten other async things
(e.g. a new submission, registering a context, banning a context,
canceling a request, processing a G2H, trying to idle the GPU, unpinning
a context) can all be happening at the same time. Hopefully when we move
the DRM scheduler we can remove some of these async operations,
perma-pinned contexts would also help too. Have a story for that + a
story to simplify the locking.

> Also expect me to ask a lot of annoying questions about all the atomic_t
> you touch :-)

Looking forward to it.

Matt

> -Daniel
> 
> 
> > 
> > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
> >  1 file changed, 27 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 87d8dc8f51b9..cd8df078ca87 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
> >  {
> >  	/* Only should be called from guc_lrc_desc_pin() */
> >  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> > -	ce->guc_state.sched_state = 0;
> > +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> >  }
> >  
> >  static inline bool
> > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> >  {
> >  	int err;
> >  
> > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > -
> > -	if (!err && g2h_len_dw)
> > +	if (g2h_len_dw)
> >  		atomic_inc(&guc->outstanding_submission_g2h);
> >  
> > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > +	if (err == -EBUSY && g2h_len_dw)
> > +		atomic_dec(&guc->outstanding_submission_g2h);
> > +
> >  	return err;
> >  }
> >  
> > @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> >  		} while (!list_empty(&guc->ct.requests.incoming));
> >  	}
> > +
> > +	/* Flush any GuC generated G2H */
> > +	while (!list_empty(&guc->ct.requests.incoming))
> > +		msleep(1);
> > +
> >  	scrub_guc_desc_for_outstanding_g2h(guc);
> >  }
> >  
> > @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  
> >  	spin_lock_irqsave(&sched_engine->lock, flags);
> >  	spin_lock(&ce->guc_active.lock);
> > -	list_for_each_entry_safe(rq, rn,
> > -				 &ce->guc_active.requests,
> > -				 sched.link) {
> > +	list_for_each_entry_safe_reverse(rq, rn,
> > +					 &ce->guc_active.requests,
> > +					 sched.link) {
> >  		if (i915_request_completed(rq))
> >  			continue;
> >  
> >  		list_del_init(&rq->sched.link);
> > -		spin_unlock(&ce->guc_active.lock);
> >  
> >  		__i915_request_unsubmit(rq);
> >  
> > @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  		}
> >  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
> >  
> > -		list_add_tail(&rq->sched.link, pl);
> > +		list_add(&rq->sched.link, pl);
> >  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > -
> > -		spin_lock(&ce->guc_active.lock);
> >  	}
> >  	spin_unlock(&ce->guc_active.lock);
> >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >  {
> >  	struct i915_request *rq;
> > +	unsigned long flags;
> >  	u32 head;
> >  
> >  	intel_context_get(ce);
> >  
> >  	/*
> > -	 * GuC will implicitly mark the context as non-schedulable
> > -	 * when it sends the reset notification. Make sure our state
> > -	 * reflects this change. The context will be marked enabled
> > -	 * on resubmission.
> > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > +	 * the reset notification. Make sure our state reflects this change. The
> > +	 * context will be marked enabled on resubmission. A small window exists
> > +	 * where the context could be block & unblocked (scheduling enable) while
> > +	 * this reset was inflight. If a scheduling enable is already is in
> > +	 * flight do not clear the enable.
> >  	 */
> > -	clr_context_enabled(ce);
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	if (!context_pending_enable(ce))
> > +		clr_context_enabled(ce);
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  
> >  	rq = intel_context_find_active_request(ce);
> >  	if (!rq) {
> > @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
@ 2021-08-09 19:35       ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2021-08-09 19:35 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 03:35:26PM +0200, Daniel Vetter wrote:
> On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> > Resets are notoriously hard to get fully working and notoriously racey,
> > especially with selftests / IGTs that do all sorts of wild things that
> > would be near impossible to hit during normal use cases. Even though
> > likely impossible to hit, anything selftests / IGTs uncover needs to be
> > fixed. This patch addresses 7 such issues.
> > 
> > 1. A small race that could result in incorrect accounting of the number
> > of outstanding G2H. Basically prior to this patch we did not increment
> > the number of outstanding G2H if we encoutered a GT reset while sending
> > a H2G. This was incorrect as the context state had already been updated
> > to anticipate a G2H response thus the counter should be incremented.
> > 
> > 2. When unwinding requests on a reset context, if other requests in the
> > context are in the priority list the requests could be resubmitted out
> > of seqno order. Traverse the list of active requests in reserve and
> > append to the head of the priority list to fix this.
> > 
> > 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> > At one point we had to drop this because of a lock inversion but that is
> > no longer the case. It is much safer to hold the lock so let's do that.
> > 
> > 4. Prior to this patch the blocked context counter was cleared on
> > init_sched_state (used during registering a context & resets) which is
> > incorrect. This state needs to be persistent or the counter can read the
> > incorrect value.
> > 
> > 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> > 
> > 6. Do not clear enable during a context reset if a schedule enable is in
> > flight.
> > 
> > 7. When unblocking a context, do not enable scheduling if the context is
> > banned.
> 
> I think each of the above should be a separate patch. I think it would
> also be good if each fix references the commits that introduced/changed
> something.
>

Sure, just was trying to cheat and make our lives easier with less
patches to backport into DII.
 
> Most of this stuff is extremely hard to get right, and unfortunately our
> current code is way too fond of lockless trickery (which really isn't a
> great idea in the reset code). We need to apply as much care as possible
> here.
>

Yep, resets are hard. It is hard because like ten other async things
(e.g. a new submission, registering a context, banning a context,
canceling a request, processing a G2H, trying to idle the GPU, unpinning
a context) can all be happening at the same time. Hopefully when we move
the DRM scheduler we can remove some of these async operations,
perma-pinned contexts would also help too. Have a story for that + a
story to simplify the locking.

> Also expect me to ask a lot of annoying questions about all the atomic_t
> you touch :-)

Looking forward to it.

Matt

> -Daniel
> 
> 
> > 
> > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
> >  1 file changed, 27 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 87d8dc8f51b9..cd8df078ca87 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
> >  {
> >  	/* Only should be called from guc_lrc_desc_pin() */
> >  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> > -	ce->guc_state.sched_state = 0;
> > +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> >  }
> >  
> >  static inline bool
> > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> >  {
> >  	int err;
> >  
> > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > -
> > -	if (!err && g2h_len_dw)
> > +	if (g2h_len_dw)
> >  		atomic_inc(&guc->outstanding_submission_g2h);
> >  
> > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > +	if (err == -EBUSY && g2h_len_dw)
> > +		atomic_dec(&guc->outstanding_submission_g2h);
> > +
> >  	return err;
> >  }
> >  
> > @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> >  		} while (!list_empty(&guc->ct.requests.incoming));
> >  	}
> > +
> > +	/* Flush any GuC generated G2H */
> > +	while (!list_empty(&guc->ct.requests.incoming))
> > +		msleep(1);
> > +
> >  	scrub_guc_desc_for_outstanding_g2h(guc);
> >  }
> >  
> > @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  
> >  	spin_lock_irqsave(&sched_engine->lock, flags);
> >  	spin_lock(&ce->guc_active.lock);
> > -	list_for_each_entry_safe(rq, rn,
> > -				 &ce->guc_active.requests,
> > -				 sched.link) {
> > +	list_for_each_entry_safe_reverse(rq, rn,
> > +					 &ce->guc_active.requests,
> > +					 sched.link) {
> >  		if (i915_request_completed(rq))
> >  			continue;
> >  
> >  		list_del_init(&rq->sched.link);
> > -		spin_unlock(&ce->guc_active.lock);
> >  
> >  		__i915_request_unsubmit(rq);
> >  
> > @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  		}
> >  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
> >  
> > -		list_add_tail(&rq->sched.link, pl);
> > +		list_add(&rq->sched.link, pl);
> >  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > -
> > -		spin_lock(&ce->guc_active.lock);
> >  	}
> >  	spin_unlock(&ce->guc_active.lock);
> >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >  {
> >  	struct i915_request *rq;
> > +	unsigned long flags;
> >  	u32 head;
> >  
> >  	intel_context_get(ce);
> >  
> >  	/*
> > -	 * GuC will implicitly mark the context as non-schedulable
> > -	 * when it sends the reset notification. Make sure our state
> > -	 * reflects this change. The context will be marked enabled
> > -	 * on resubmission.
> > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > +	 * the reset notification. Make sure our state reflects this change. The
> > +	 * context will be marked enabled on resubmission. A small window exists
> > +	 * where the context could be block & unblocked (scheduling enable) while
> > +	 * this reset was inflight. If a scheduling enable is already is in
> > +	 * flight do not clear the enable.
> >  	 */
> > -	clr_context_enabled(ce);
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	if (!context_pending_enable(ce))
> > +		clr_context_enabled(ce);
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  
> >  	rq = intel_context_find_active_request(ce);
> >  	if (!rq) {
> > @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-09 13:38   ` Daniel Vetter
@ 2021-08-09 19:37     ` Matthew Brost
  2021-08-10 10:33       ` Daniel Vetter
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Brost @ 2021-08-09 19:37 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 03:38:38PM +0200, Daniel Vetter wrote:
> On Sun, Aug 08, 2021 at 11:07:56AM -0700, Matthew Brost wrote:
> > GuC submission has exposed an existing memory corruption in
> > live_lrc_isolation. We believe that some writes to the watchdog offsets
> > in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> > address space. With GuC submission there are additional objects which
> > can move the context redzone into the space that is trashed. To
> > workaround this avoid poisoning the watchdog.
> 
> A Bspec reference here would be good (we can quote anything that's marked
> for public release, so doesn't have one of the IP markers).
>

Let me see what I dig up in the bspec.

BTW - Hopefully we can root cause this soon with a proper fix.
 
> Also I think the above should be replicated in condensed form instead of
> the XXX comment.
>

Sure.

Matt

> With those: Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> since I
> definitely have enough clue here for a detailed review.
> -Daniel
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
> >  1 file changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > index b0977a3b699b..6500e9fce8a0 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
> >  	goto err_after;
> >  }
> >  
> > +static u32 safe_offset(u32 offset, u32 reg)
> > +{
> > +	/* XXX skip testing of watchdog */
> > +	if (offset == 0x178 || offset == 0x17c)
> > +		reg = 0;
> > +
> > +	return reg;
> > +}
> > +
> > +static int get_offset_mask(struct intel_engine_cs *engine)
> > +{
> > +	if (GRAPHICS_VER(engine->i915) < 12)
> > +		return 0xfff;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +	case RENDER_CLASS:
> > +		return 0x07ff;
> > +	case COPY_ENGINE_CLASS:
> > +		return 0x0fff;
> > +	case VIDEO_DECODE_CLASS:
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		return 0x3fff;
> > +	}
> > +}
> > +
> >  static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >  {
> >  	struct i915_vma *batch;
> > @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >  		len = (len + 1) / 2;
> >  		*cs++ = MI_LOAD_REGISTER_IMM(len);
> >  		while (len--) {
> > -			*cs++ = hw[dw];
> > +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> > +					    hw[dw]);
> >  			*cs++ = poison;
> >  			dw += 2;
> >  		}
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-09 14:03   ` Daniel Vetter
@ 2021-08-09 19:41     ` Matthew Brost
  2021-08-10 10:36       ` Daniel Vetter
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Brost @ 2021-08-09 19:41 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:03:28PM +0200, Daniel Vetter wrote:
> On Sun, Aug 08, 2021 at 11:07:57AM -0700, Matthew Brost wrote:
> > While debugging an issue with full GT resets I went down a rabbit hole
> > thinking the scrubbing of lost G2H wasn't working correctly. This proved
> > to be incorrect as this was working just fine but this chase inspired me
> > to write a selftest to prove that this works. This simple selftest
> > injects errors dropping various G2H and then issues a full GT reset
> > proving that the scrubbing of these G2H doesn't blow up.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  18 +++
> >  drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
> >  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
> >  .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
> >  .../i915/selftests/intel_scheduler_helpers.h  |   2 +
> >  6 files changed, 163 insertions(+)
> >  create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index e54351a170e2..fec5ff7ef168 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -198,6 +198,10 @@ struct intel_context {
> >  	 */
> >  	u8 guc_prio;
> >  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > +
> 
> I know the existing stuff isn't following this at all, but for anything
> new we really should put some kerneldoc into structures. This probably
> means you need to open-code the #ifdef here, since this macro will likely
> upset kerneldoc parsing.
> 

Ok, got it.

> > +	I915_SELFTEST_DECLARE(bool drop_schedule_enable);
> > +	I915_SELFTEST_DECLARE(bool drop_schedule_disable);
> > +	I915_SELFTEST_DECLARE(bool drop_deregister);
> >  };
> >  
> >  #endif /* __INTEL_CONTEXT_TYPES__ */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index cd8df078ca87..d13dc56bae43 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -2618,6 +2618,11 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >  
> >  	trace_intel_context_deregister_done(ce);
> >  
> > +	if (I915_SELFTEST_ONLY(ce->drop_deregister)) {
> > +		I915_SELFTEST_DECLARE(ce->drop_deregister = false;)
> 
> This macro wrapping is quite nasty, can't we just #ifdef this? Especially
> the _DECLARE name really doesn't expect a statement.
>

Had it like that originally then remember these marcos and in the past
people have complained when I didn't use them, so yes pretty much a
bikeshed. I personally like the ifdef myself.

Matt
 
> Aside from these bikesheds I don't have a much to say on the test logic
> itself, since I'm far from knowledgable on guc stuff ...
> -Daniel
> 
> 
> > +		return 0;
> > +	}
> > +
> >  	if (context_wait_for_deregister_to_register(ce)) {
> >  		struct intel_runtime_pm *runtime_pm =
> >  			&ce->engine->gt->i915->runtime_pm;
> > @@ -2672,10 +2677,19 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> >  	trace_intel_context_sched_done(ce);
> >  
> >  	if (context_pending_enable(ce)) {
> > +		if (I915_SELFTEST_ONLY(ce->drop_schedule_enable)) {
> > +			I915_SELFTEST_DECLARE(ce->drop_schedule_enable = false;)
> > +			return 0;
> > +		}
> >  		clr_context_pending_enable(ce);
> >  	} else if (context_pending_disable(ce)) {
> >  		bool banned;
> >  
> > +		if (I915_SELFTEST_ONLY(ce->drop_schedule_disable)) {
> > +			I915_SELFTEST_DECLARE(ce->drop_schedule_disable = false;)
> > +			return 0;
> > +		}
> > +
> >  		/*
> >  		 * Unpin must be done before __guc_signal_context_fence,
> >  		 * otherwise a race exists between the requests getting
> > @@ -3047,3 +3061,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
> >  
> >  	return false;
> >  }
> > +
> > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > +#include "selftest_guc.c"
> > +#endif
> > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > new file mode 100644
> > index 000000000000..46ca6554f65d
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > @@ -0,0 +1,126 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright �� 2021 Intel Corporation
> > + */
> > +
> > +#include "selftests/intel_scheduler_helpers.h"
> > +
> > +static struct i915_request *nop_user_request(struct intel_context *ce,
> > +					     struct i915_request *from)
> > +{
> > +	struct i915_request *rq;
> > +	int ret;
> > +
> > +	rq = intel_context_create_request(ce);
> > +	if (IS_ERR(rq))
> > +		return rq;
> > +
> > +	if (from) {
> > +		ret = i915_sw_fence_await_dma_fence(&rq->submit,
> > +						    &from->fence, 0,
> > +						    I915_FENCE_GFP);
> > +		if (ret < 0) {
> > +			i915_request_put(rq);
> > +			return ERR_PTR(ret);
> > +		}
> > +	}
> > +
> > +	i915_request_get(rq);
> > +	i915_request_add(rq);
> > +
> > +	return rq;
> > +}
> > +
> > +static int intel_guc_scrub_ctbs(void *arg)
> > +{
> > +	struct intel_gt *gt = arg;
> > +	int ret = 0;
> > +	int i;
> > +	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
> > +	intel_wakeref_t wakeref;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_context *ce;
> > +
> > +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> > +	engine = intel_selftest_find_any_engine(gt);
> > +
> > +	/* Submit requests and inject errors forcing G2H to be dropped */
> > +	for (i = 0; i < 3; ++i) {
> > +		ce = intel_context_create(engine);
> > +		if (IS_ERR(ce)) {
> > +			ret = PTR_ERR(ce);
> > +			pr_err("Failed to create context, %d: %d\n", i, ret);
> > +			goto err;
> > +		}
> > +
> > +		switch(i) {
> > +		case 0:
> > +			ce->drop_schedule_enable = true;
> > +			break;
> > +		case 1:
> > +			ce->drop_schedule_disable = true;
> > +			break;
> > +		case 2:
> > +			ce->drop_deregister = true;
> > +			break;
> > +		}
> > +
> > +		rq = nop_user_request(ce, NULL);
> > +		intel_context_put(ce);
> > +
> > +		if (IS_ERR(rq)) {
> > +			ret = PTR_ERR(rq);
> > +			pr_err("Failed to create request, %d: %d\n", i, ret);
> > +			goto err;
> > +		}
> > +
> > +		last[i] = rq;
> > +	}
> > +
> > +	for (i = 0; i < 3; ++i) {
> > +		ret = i915_request_wait(last[i], 0, HZ);
> > +		if (ret < 0) {
> > +			pr_err("Last request failed to complete: %d\n", ret);
> > +			goto err;
> > +		}
> > +		i915_request_put(last[i]);
> > +		last[i] = NULL;
> > +	}
> > +
> > +	/* Force all H2G / G2H to be submitted / processed */
> > +	intel_gt_retire_requests(gt);
> > +	msleep(500);
> > +
> > +	/* Scrub missing G2H */
> > +	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
> > +
> > +	ret = intel_gt_wait_for_idle(gt, HZ);
> > +	if (ret < 0) {
> > +		pr_err("GT failed to idle: %d\n", ret);
> > +		goto err;
> > +	}
> > +
> > +err:
> > +	for (i = 0; i < 3; ++i)
> > +		if (last[i])
> > +			i915_request_put(last[i]);
> > +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> > +
> > +	return ret;
> > +}
> > +
> > +int intel_guc_live_selftests(struct drm_i915_private *i915)
> > +{
> > +	static const struct i915_subtest tests[] = {
> > +		SUBTEST(intel_guc_scrub_ctbs),
> > +	};
> > +	struct intel_gt *gt = &i915->gt;
> > +
> > +	if (intel_gt_is_wedged(gt))
> > +		return 0;
> > +
> > +	if (!intel_uc_uses_guc_submission(&gt->uc))
> > +		return 0;
> > +
> > +	return intel_gt_live_subtests(tests, gt);
> > +}
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > index cfa5c4165a4f..3cf6758931f9 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > @@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
> >  selftest(ring_submission, intel_ring_submission_live_selftests)
> >  selftest(perf, i915_perf_live_selftests)
> >  selftest(slpc, intel_slpc_live_selftests)
> > +selftest(guc, intel_guc_live_selftests)
> >  /* Here be dragons: keep last to run last! */
> >  selftest(late_gt_pm, intel_gt_pm_late_selftests)
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > index 4b328346b48a..310fb83c527e 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > @@ -14,6 +14,18 @@
> >  #define REDUCED_PREEMPT		10
> >  #define WAIT_FOR_RESET_TIME	10000
> >  
> > +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
> > +{
> > +	struct intel_engine_cs *engine;
> > +	enum intel_engine_id id;
> > +
> > +	for_each_engine(engine, gt, id)
> > +		return engine;
> > +
> > +	pr_err("No valid engine found!\n");
> > +	return NULL;
> > +}
> > +
> >  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
> >  				 struct intel_selftest_saved_policy *saved,
> >  				 u32 modify_type)
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > index 35c098601ac0..ae60bb507f45 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > @@ -10,6 +10,7 @@
> >  
> >  struct i915_request;
> >  struct intel_engine_cs;
> > +struct intel_gt;
> >  
> >  struct intel_selftest_saved_policy {
> >  	u32 flags;
> > @@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
> >  	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
> >  };
> >  
> > +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
> >  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
> >  				 struct intel_selftest_saved_policy *saved,
> >  				 enum selftest_scheduler_modify modify_type);
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
  2021-08-09 19:35       ` [Intel-gfx] " Matthew Brost
@ 2021-08-10 10:32         ` Daniel Vetter
  -1 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-10 10:32 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:35:22PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 03:35:26PM +0200, Daniel Vetter wrote:
> > On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> > > Resets are notoriously hard to get fully working and notoriously racey,
> > > especially with selftests / IGTs that do all sorts of wild things that
> > > would be near impossible to hit during normal use cases. Even though
> > > likely impossible to hit, anything selftests / IGTs uncover needs to be
> > > fixed. This patch addresses 7 such issues.
> > > 
> > > 1. A small race that could result in incorrect accounting of the number
> > > of outstanding G2H. Basically prior to this patch we did not increment
> > > the number of outstanding G2H if we encoutered a GT reset while sending
> > > a H2G. This was incorrect as the context state had already been updated
> > > to anticipate a G2H response thus the counter should be incremented.
> > > 
> > > 2. When unwinding requests on a reset context, if other requests in the
> > > context are in the priority list the requests could be resubmitted out
> > > of seqno order. Traverse the list of active requests in reserve and
> > > append to the head of the priority list to fix this.
> > > 
> > > 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> > > At one point we had to drop this because of a lock inversion but that is
> > > no longer the case. It is much safer to hold the lock so let's do that.
> > > 
> > > 4. Prior to this patch the blocked context counter was cleared on
> > > init_sched_state (used during registering a context & resets) which is
> > > incorrect. This state needs to be persistent or the counter can read the
> > > incorrect value.
> > > 
> > > 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> > > 
> > > 6. Do not clear enable during a context reset if a schedule enable is in
> > > flight.
> > > 
> > > 7. When unblocking a context, do not enable scheduling if the context is
> > > banned.
> > 
> > I think each of the above should be a separate patch. I think it would
> > also be good if each fix references the commits that introduced/changed
> > something.
> >
> 
> Sure, just was trying to cheat and make our lives easier with less
> patches to backport into DII.
>  
> > Most of this stuff is extremely hard to get right, and unfortunately our
> > current code is way too fond of lockless trickery (which really isn't a
> > great idea in the reset code). We need to apply as much care as possible
> > here.
> >
> 
> Yep, resets are hard. It is hard because like ten other async things
> (e.g. a new submission, registering a context, banning a context,
> canceling a request, processing a G2H, trying to idle the GPU, unpinning
> a context) can all be happening at the same time. Hopefully when we move
> the DRM scheduler we can remove some of these async operations,
> perma-pinned contexts would also help too. Have a story for that + a
> story to simplify the locking.

A bit an aside, but drm/sched has a pretty solid story around resets,
including what to do if your reset domain escalates (probably more useful
for the execlist backend than GuC) and how it's all synchronized.

I do need to review the barriers for when you permanently wedge an engine,
and the support for that isn't well encapsulated nor documented. But it's
there (amdgpu uses that when the reset fails, kinda like we do), and
that's about the only gap I've found thus far around drm/sched reset
support.

So should all be substantially clarified once we get there.
-Daniel

> 
> > Also expect me to ask a lot of annoying questions about all the atomic_t
> > you touch :-)
> 
> Looking forward to it.
> 
> Matt
> 
> > -Daniel
> > 
> > 
> > > 
> > > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
> > >  1 file changed, 27 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 87d8dc8f51b9..cd8df078ca87 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
> > >  {
> > >  	/* Only should be called from guc_lrc_desc_pin() */
> > >  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> > > -	ce->guc_state.sched_state = 0;
> > > +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> > >  }
> > >  
> > >  static inline bool
> > > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> > >  {
> > >  	int err;
> > >  
> > > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > > -
> > > -	if (!err && g2h_len_dw)
> > > +	if (g2h_len_dw)
> > >  		atomic_inc(&guc->outstanding_submission_g2h);
> > >  
> > > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > > +	if (err == -EBUSY && g2h_len_dw)
> > > +		atomic_dec(&guc->outstanding_submission_g2h);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > >  	}
> > > +
> > > +	/* Flush any GuC generated G2H */
> > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > +		msleep(1);
> > > +
> > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > >  }
> > >  
> > > @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  
> > >  	spin_lock_irqsave(&sched_engine->lock, flags);
> > >  	spin_lock(&ce->guc_active.lock);
> > > -	list_for_each_entry_safe(rq, rn,
> > > -				 &ce->guc_active.requests,
> > > -				 sched.link) {
> > > +	list_for_each_entry_safe_reverse(rq, rn,
> > > +					 &ce->guc_active.requests,
> > > +					 sched.link) {
> > >  		if (i915_request_completed(rq))
> > >  			continue;
> > >  
> > >  		list_del_init(&rq->sched.link);
> > > -		spin_unlock(&ce->guc_active.lock);
> > >  
> > >  		__i915_request_unsubmit(rq);
> > >  
> > > @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  		}
> > >  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
> > >  
> > > -		list_add_tail(&rq->sched.link, pl);
> > > +		list_add(&rq->sched.link, pl);
> > >  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > > -
> > > -		spin_lock(&ce->guc_active.lock);
> > >  	}
> > >  	spin_unlock(&ce->guc_active.lock);
> > >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > > @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  static void __guc_reset_context(struct intel_context *ce, bool stalled)
> > >  {
> > >  	struct i915_request *rq;
> > > +	unsigned long flags;
> > >  	u32 head;
> > >  
> > >  	intel_context_get(ce);
> > >  
> > >  	/*
> > > -	 * GuC will implicitly mark the context as non-schedulable
> > > -	 * when it sends the reset notification. Make sure our state
> > > -	 * reflects this change. The context will be marked enabled
> > > -	 * on resubmission.
> > > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > > +	 * the reset notification. Make sure our state reflects this change. The
> > > +	 * context will be marked enabled on resubmission. A small window exists
> > > +	 * where the context could be block & unblocked (scheduling enable) while
> > > +	 * this reset was inflight. If a scheduling enable is already is in
> > > +	 * flight do not clear the enable.
> > >  	 */
> > > -	clr_context_enabled(ce);
> > > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > +	if (!context_pending_enable(ce))
> > > +		clr_context_enabled(ce);
> > > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > >  
> > >  	rq = intel_context_find_active_request(ce);
> > >  	if (!rq) {
> > > @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
> > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > >  
> > >  	if (unlikely(submission_disabled(guc) ||
> > > +		     intel_context_is_banned(ce) ||
> > >  		     !intel_context_is_pinned(ce) ||
> > >  		     context_pending_disable(ce) ||
> > >  		     context_blocked(ce) > 1)) {
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation
@ 2021-08-10 10:32         ` Daniel Vetter
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-10 10:32 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:35:22PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 03:35:26PM +0200, Daniel Vetter wrote:
> > On Sun, Aug 08, 2021 at 11:07:55AM -0700, Matthew Brost wrote:
> > > Resets are notoriously hard to get fully working and notoriously racey,
> > > especially with selftests / IGTs that do all sorts of wild things that
> > > would be near impossible to hit during normal use cases. Even though
> > > likely impossible to hit, anything selftests / IGTs uncover needs to be
> > > fixed. This patch addresses 7 such issues.
> > > 
> > > 1. A small race that could result in incorrect accounting of the number
> > > of outstanding G2H. Basically prior to this patch we did not increment
> > > the number of outstanding G2H if we encoutered a GT reset while sending
> > > a H2G. This was incorrect as the context state had already been updated
> > > to anticipate a G2H response thus the counter should be incremented.
> > > 
> > > 2. When unwinding requests on a reset context, if other requests in the
> > > context are in the priority list the requests could be resubmitted out
> > > of seqno order. Traverse the list of active requests in reserve and
> > > append to the head of the priority list to fix this.
> > > 
> > > 3. Don't drop ce->guc_active.lock when unwinding a context after reset.
> > > At one point we had to drop this because of a lock inversion but that is
> > > no longer the case. It is much safer to hold the lock so let's do that.
> > > 
> > > 4. Prior to this patch the blocked context counter was cleared on
> > > init_sched_state (used during registering a context & resets) which is
> > > incorrect. This state needs to be persistent or the counter can read the
> > > incorrect value.
> > > 
> > > 5. Flush the work queue for GuC generated G2H messages during a GT reset.
> > > 
> > > 6. Do not clear enable during a context reset if a schedule enable is in
> > > flight.
> > > 
> > > 7. When unblocking a context, do not enable scheduling if the context is
> > > banned.
> > 
> > I think each of the above should be a separate patch. I think it would
> > also be good if each fix references the commits that introduced/changed
> > something.
> >
> 
> Sure, just was trying to cheat and make our lives easier with less
> patches to backport into DII.
>  
> > Most of this stuff is extremely hard to get right, and unfortunately our
> > current code is way too fond of lockless trickery (which really isn't a
> > great idea in the reset code). We need to apply as much care as possible
> > here.
> >
> 
> Yep, resets are hard. It is hard because like ten other async things
> (e.g. a new submission, registering a context, banning a context,
> canceling a request, processing a G2H, trying to idle the GPU, unpinning
> a context) can all be happening at the same time. Hopefully when we move
> the DRM scheduler we can remove some of these async operations,
> perma-pinned contexts would also help too. Have a story for that + a
> story to simplify the locking.

A bit an aside, but drm/sched has a pretty solid story around resets,
including what to do if your reset domain escalates (probably more useful
for the execlist backend than GuC) and how it's all synchronized.

I do need to review the barriers for when you permanently wedge an engine,
and the support for that isn't well encapsulated nor documented. But it's
there (amdgpu uses that when the reset fails, kinda like we do), and
that's about the only gap I've found thus far around drm/sched reset
support.

So should all be substantially clarified once we get there.
-Daniel

> 
> > Also expect me to ask a lot of annoying questions about all the atomic_t
> > you touch :-)
> 
> Looking forward to it.
> 
> Matt
> 
> > -Daniel
> > 
> > 
> > > 
> > > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++-------
> > >  1 file changed, 27 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 87d8dc8f51b9..cd8df078ca87 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
> > >  {
> > >  	/* Only should be called from guc_lrc_desc_pin() */
> > >  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> > > -	ce->guc_state.sched_state = 0;
> > > +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> > >  }
> > >  
> > >  static inline bool
> > > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> > >  {
> > >  	int err;
> > >  
> > > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > > -
> > > -	if (!err && g2h_len_dw)
> > > +	if (g2h_len_dw)
> > >  		atomic_inc(&guc->outstanding_submission_g2h);
> > >  
> > > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > > +	if (err == -EBUSY && g2h_len_dw)
> > > +		atomic_dec(&guc->outstanding_submission_g2h);
> > > +
> > >  	return err;
> > >  }
> > >  
> > > @@ -725,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > >  	}
> > > +
> > > +	/* Flush any GuC generated G2H */
> > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > +		msleep(1);
> > > +
> > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > >  }
> > >  
> > > @@ -797,14 +804,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  
> > >  	spin_lock_irqsave(&sched_engine->lock, flags);
> > >  	spin_lock(&ce->guc_active.lock);
> > > -	list_for_each_entry_safe(rq, rn,
> > > -				 &ce->guc_active.requests,
> > > -				 sched.link) {
> > > +	list_for_each_entry_safe_reverse(rq, rn,
> > > +					 &ce->guc_active.requests,
> > > +					 sched.link) {
> > >  		if (i915_request_completed(rq))
> > >  			continue;
> > >  
> > >  		list_del_init(&rq->sched.link);
> > > -		spin_unlock(&ce->guc_active.lock);
> > >  
> > >  		__i915_request_unsubmit(rq);
> > >  
> > > @@ -816,10 +822,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  		}
> > >  		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
> > >  
> > > -		list_add_tail(&rq->sched.link, pl);
> > > +		list_add(&rq->sched.link, pl);
> > >  		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > > -
> > > -		spin_lock(&ce->guc_active.lock);
> > >  	}
> > >  	spin_unlock(&ce->guc_active.lock);
> > >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > > @@ -828,17 +832,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
> > >  static void __guc_reset_context(struct intel_context *ce, bool stalled)
> > >  {
> > >  	struct i915_request *rq;
> > > +	unsigned long flags;
> > >  	u32 head;
> > >  
> > >  	intel_context_get(ce);
> > >  
> > >  	/*
> > > -	 * GuC will implicitly mark the context as non-schedulable
> > > -	 * when it sends the reset notification. Make sure our state
> > > -	 * reflects this change. The context will be marked enabled
> > > -	 * on resubmission.
> > > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > > +	 * the reset notification. Make sure our state reflects this change. The
> > > +	 * context will be marked enabled on resubmission. A small window exists
> > > +	 * where the context could be block & unblocked (scheduling enable) while
> > > +	 * this reset was inflight. If a scheduling enable is already is in
> > > +	 * flight do not clear the enable.
> > >  	 */
> > > -	clr_context_enabled(ce);
> > > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > +	if (!context_pending_enable(ce))
> > > +		clr_context_enabled(ce);
> > > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > >  
> > >  	rq = intel_context_find_active_request(ce);
> > >  	if (!rq) {
> > > @@ -1562,6 +1572,7 @@ static void guc_context_unblock(struct intel_context *ce)
> > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > >  
> > >  	if (unlikely(submission_disabled(guc) ||
> > > +		     intel_context_is_banned(ce) ||
> > >  		     !intel_context_is_pinned(ce) ||
> > >  		     context_pending_disable(ce) ||
> > >  		     context_blocked(ce) > 1)) {
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-09 19:37     ` Matthew Brost
@ 2021-08-10 10:33       ` Daniel Vetter
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-10 10:33 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:37:39PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 03:38:38PM +0200, Daniel Vetter wrote:
> > On Sun, Aug 08, 2021 at 11:07:56AM -0700, Matthew Brost wrote:
> > > GuC submission has exposed an existing memory corruption in
> > > live_lrc_isolation. We believe that some writes to the watchdog offsets
> > > in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> > > address space. With GuC submission there are additional objects which
> > > can move the context redzone into the space that is trashed. To
> > > workaround this avoid poisoning the watchdog.
> > 
> > A Bspec reference here would be good (we can quote anything that's marked
> > for public release, so doesn't have one of the IP markers).
> >
> 
> Let me see what I dig up in the bspec.
> 
> BTW - Hopefully we can root cause this soon with a proper fix.

Well if it's work-in-progress duct-tape without reference that's fine
too, then perhaps sprinkle a JIRA number here (just not the full link,
intel IT doesn't like those leaking). Just something that in a few months
when someone reads that code they can stitch together the story again.
-Daniel

>  
> > Also I think the above should be replicated in condensed form instead of
> > the XXX comment.
> >
> 
> Sure.
> 
> Matt
> 
> > With those: Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> since I
> > definitely have enough clue here for a detailed review.
> > -Daniel
> > 
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
> > >  1 file changed, 28 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > > index b0977a3b699b..6500e9fce8a0 100644
> > > --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > > @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
> > >  	goto err_after;
> > >  }
> > >  
> > > +static u32 safe_offset(u32 offset, u32 reg)
> > > +{
> > > +	/* XXX skip testing of watchdog */
> > > +	if (offset == 0x178 || offset == 0x17c)
> > > +		reg = 0;
> > > +
> > > +	return reg;
> > > +}
> > > +
> > > +static int get_offset_mask(struct intel_engine_cs *engine)
> > > +{
> > > +	if (GRAPHICS_VER(engine->i915) < 12)
> > > +		return 0xfff;
> > > +
> > > +	switch (engine->class) {
> > > +	default:
> > > +	case RENDER_CLASS:
> > > +		return 0x07ff;
> > > +	case COPY_ENGINE_CLASS:
> > > +		return 0x0fff;
> > > +	case VIDEO_DECODE_CLASS:
> > > +	case VIDEO_ENHANCEMENT_CLASS:
> > > +		return 0x3fff;
> > > +	}
> > > +}
> > > +
> > >  static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> > >  {
> > >  	struct i915_vma *batch;
> > > @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> > >  		len = (len + 1) / 2;
> > >  		*cs++ = MI_LOAD_REGISTER_IMM(len);
> > >  		while (len--) {
> > > -			*cs++ = hw[dw];
> > > +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> > > +					    hw[dw]);
> > >  			*cs++ = poison;
> > >  			dw += 2;
> > >  		}
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-gfx] [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-09 19:41     ` Matthew Brost
@ 2021-08-10 10:36       ` Daniel Vetter
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2021-08-10 10:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:41:29PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:03:28PM +0200, Daniel Vetter wrote:
> > On Sun, Aug 08, 2021 at 11:07:57AM -0700, Matthew Brost wrote:
> > > While debugging an issue with full GT resets I went down a rabbit hole
> > > thinking the scrubbing of lost G2H wasn't working correctly. This proved
> > > to be incorrect as this was working just fine but this chase inspired me
> > > to write a selftest to prove that this works. This simple selftest
> > > injects errors dropping various G2H and then issues a full GT reset
> > > proving that the scrubbing of these G2H doesn't blow up.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  18 +++
> > >  drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
> > >  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
> > >  .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
> > >  .../i915/selftests/intel_scheduler_helpers.h  |   2 +
> > >  6 files changed, 163 insertions(+)
> > >  create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index e54351a170e2..fec5ff7ef168 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -198,6 +198,10 @@ struct intel_context {
> > >  	 */
> > >  	u8 guc_prio;
> > >  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > > +
> > 
> > I know the existing stuff isn't following this at all, but for anything
> > new we really should put some kerneldoc into structures. This probably
> > means you need to open-code the #ifdef here, since this macro will likely
> > upset kerneldoc parsing.
> > 
> 
> Ok, got it.
> 
> > > +	I915_SELFTEST_DECLARE(bool drop_schedule_enable);
> > > +	I915_SELFTEST_DECLARE(bool drop_schedule_disable);
> > > +	I915_SELFTEST_DECLARE(bool drop_deregister);
> > >  };
> > >  
> > >  #endif /* __INTEL_CONTEXT_TYPES__ */
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index cd8df078ca87..d13dc56bae43 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -2618,6 +2618,11 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> > >  
> > >  	trace_intel_context_deregister_done(ce);
> > >  
> > > +	if (I915_SELFTEST_ONLY(ce->drop_deregister)) {
> > > +		I915_SELFTEST_DECLARE(ce->drop_deregister = false;)
> > 
> > This macro wrapping is quite nasty, can't we just #ifdef this? Especially
> > the _DECLARE name really doesn't expect a statement.
> >
> 
> Had it like that originally then remember these marcos and in the past
> people have complained when I didn't use them, so yes pretty much a
> bikeshed. I personally like the ifdef myself.

I think the plain #ifdef is much clearer in the code. The
I915_SELFTEST_ONLY macro makes some sense to compile stuff out in some
cases and make sure it's wrapped in unlikely when enabled, and often
that's good enough. But not here.

Also because it breaks kerneldoc I don't like the macro really in structs
either. Anything that discourages people from writing solid comments is
Not Good at All :-) So another reason to not like I915_SELFTEST_DECLARE()
macro.
-Daniel

> 
> Matt
>  
> > Aside from these bikesheds I don't have a much to say on the test logic
> > itself, since I'm far from knowledgable on guc stuff ...
> > -Daniel
> > 
> > 
> > > +		return 0;
> > > +	}
> > > +
> > >  	if (context_wait_for_deregister_to_register(ce)) {
> > >  		struct intel_runtime_pm *runtime_pm =
> > >  			&ce->engine->gt->i915->runtime_pm;
> > > @@ -2672,10 +2677,19 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> > >  	trace_intel_context_sched_done(ce);
> > >  
> > >  	if (context_pending_enable(ce)) {
> > > +		if (I915_SELFTEST_ONLY(ce->drop_schedule_enable)) {
> > > +			I915_SELFTEST_DECLARE(ce->drop_schedule_enable = false;)
> > > +			return 0;
> > > +		}
> > >  		clr_context_pending_enable(ce);
> > >  	} else if (context_pending_disable(ce)) {
> > >  		bool banned;
> > >  
> > > +		if (I915_SELFTEST_ONLY(ce->drop_schedule_disable)) {
> > > +			I915_SELFTEST_DECLARE(ce->drop_schedule_disable = false;)
> > > +			return 0;
> > > +		}
> > > +
> > >  		/*
> > >  		 * Unpin must be done before __guc_signal_context_fence,
> > >  		 * otherwise a race exists between the requests getting
> > > @@ -3047,3 +3061,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
> > >  
> > >  	return false;
> > >  }
> > > +
> > > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > +#include "selftest_guc.c"
> > > +#endif
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > > new file mode 100644
> > > index 000000000000..46ca6554f65d
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > > @@ -0,0 +1,126 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright �� 2021 Intel Corporation
> > > + */
> > > +
> > > +#include "selftests/intel_scheduler_helpers.h"
> > > +
> > > +static struct i915_request *nop_user_request(struct intel_context *ce,
> > > +					     struct i915_request *from)
> > > +{
> > > +	struct i915_request *rq;
> > > +	int ret;
> > > +
> > > +	rq = intel_context_create_request(ce);
> > > +	if (IS_ERR(rq))
> > > +		return rq;
> > > +
> > > +	if (from) {
> > > +		ret = i915_sw_fence_await_dma_fence(&rq->submit,
> > > +						    &from->fence, 0,
> > > +						    I915_FENCE_GFP);
> > > +		if (ret < 0) {
> > > +			i915_request_put(rq);
> > > +			return ERR_PTR(ret);
> > > +		}
> > > +	}
> > > +
> > > +	i915_request_get(rq);
> > > +	i915_request_add(rq);
> > > +
> > > +	return rq;
> > > +}
> > > +
> > > +static int intel_guc_scrub_ctbs(void *arg)
> > > +{
> > > +	struct intel_gt *gt = arg;
> > > +	int ret = 0;
> > > +	int i;
> > > +	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
> > > +	intel_wakeref_t wakeref;
> > > +	struct intel_engine_cs *engine;
> > > +	struct intel_context *ce;
> > > +
> > > +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> > > +	engine = intel_selftest_find_any_engine(gt);
> > > +
> > > +	/* Submit requests and inject errors forcing G2H to be dropped */
> > > +	for (i = 0; i < 3; ++i) {
> > > +		ce = intel_context_create(engine);
> > > +		if (IS_ERR(ce)) {
> > > +			ret = PTR_ERR(ce);
> > > +			pr_err("Failed to create context, %d: %d\n", i, ret);
> > > +			goto err;
> > > +		}
> > > +
> > > +		switch(i) {
> > > +		case 0:
> > > +			ce->drop_schedule_enable = true;
> > > +			break;
> > > +		case 1:
> > > +			ce->drop_schedule_disable = true;
> > > +			break;
> > > +		case 2:
> > > +			ce->drop_deregister = true;
> > > +			break;
> > > +		}
> > > +
> > > +		rq = nop_user_request(ce, NULL);
> > > +		intel_context_put(ce);
> > > +
> > > +		if (IS_ERR(rq)) {
> > > +			ret = PTR_ERR(rq);
> > > +			pr_err("Failed to create request, %d: %d\n", i, ret);
> > > +			goto err;
> > > +		}
> > > +
> > > +		last[i] = rq;
> > > +	}
> > > +
> > > +	for (i = 0; i < 3; ++i) {
> > > +		ret = i915_request_wait(last[i], 0, HZ);
> > > +		if (ret < 0) {
> > > +			pr_err("Last request failed to complete: %d\n", ret);
> > > +			goto err;
> > > +		}
> > > +		i915_request_put(last[i]);
> > > +		last[i] = NULL;
> > > +	}
> > > +
> > > +	/* Force all H2G / G2H to be submitted / processed */
> > > +	intel_gt_retire_requests(gt);
> > > +	msleep(500);
> > > +
> > > +	/* Scrub missing G2H */
> > > +	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
> > > +
> > > +	ret = intel_gt_wait_for_idle(gt, HZ);
> > > +	if (ret < 0) {
> > > +		pr_err("GT failed to idle: %d\n", ret);
> > > +		goto err;
> > > +	}
> > > +
> > > +err:
> > > +	for (i = 0; i < 3; ++i)
> > > +		if (last[i])
> > > +			i915_request_put(last[i]);
> > > +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +int intel_guc_live_selftests(struct drm_i915_private *i915)
> > > +{
> > > +	static const struct i915_subtest tests[] = {
> > > +		SUBTEST(intel_guc_scrub_ctbs),
> > > +	};
> > > +	struct intel_gt *gt = &i915->gt;
> > > +
> > > +	if (intel_gt_is_wedged(gt))
> > > +		return 0;
> > > +
> > > +	if (!intel_uc_uses_guc_submission(&gt->uc))
> > > +		return 0;
> > > +
> > > +	return intel_gt_live_subtests(tests, gt);
> > > +}
> > > diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > > index cfa5c4165a4f..3cf6758931f9 100644
> > > --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > > +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> > > @@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
> > >  selftest(ring_submission, intel_ring_submission_live_selftests)
> > >  selftest(perf, i915_perf_live_selftests)
> > >  selftest(slpc, intel_slpc_live_selftests)
> > > +selftest(guc, intel_guc_live_selftests)
> > >  /* Here be dragons: keep last to run last! */
> > >  selftest(late_gt_pm, intel_gt_pm_late_selftests)
> > > diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > > index 4b328346b48a..310fb83c527e 100644
> > > --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > > +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> > > @@ -14,6 +14,18 @@
> > >  #define REDUCED_PREEMPT		10
> > >  #define WAIT_FOR_RESET_TIME	10000
> > >  
> > > +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
> > > +{
> > > +	struct intel_engine_cs *engine;
> > > +	enum intel_engine_id id;
> > > +
> > > +	for_each_engine(engine, gt, id)
> > > +		return engine;
> > > +
> > > +	pr_err("No valid engine found!\n");
> > > +	return NULL;
> > > +}
> > > +
> > >  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
> > >  				 struct intel_selftest_saved_policy *saved,
> > >  				 u32 modify_type)
> > > diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > > index 35c098601ac0..ae60bb507f45 100644
> > > --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > > +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> > > @@ -10,6 +10,7 @@
> > >  
> > >  struct i915_request;
> > >  struct intel_engine_cs;
> > > +struct intel_gt;
> > >  
> > >  struct intel_selftest_saved_policy {
> > >  	u32 flags;
> > > @@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
> > >  	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
> > >  };
> > >  
> > > +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
> > >  int intel_selftest_modify_policy(struct intel_engine_cs *engine,
> > >  				 struct intel_selftest_saved_policy *saved,
> > >  				 enum selftest_scheduler_modify modify_type);
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-08-10 10:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-08 18:07 [PATCH 0/3] Clean up some CI failures for GuC submission Matthew Brost
2021-08-08 18:07 ` [Intel-gfx] " Matthew Brost
2021-08-08 17:59 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2021-08-08 18:07 ` [PATCH 1/3] drm/i915/guc: Fix several issues related to resets / request cancelation Matthew Brost
2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
2021-08-09 13:35   ` Daniel Vetter
2021-08-09 13:35     ` [Intel-gfx] " Daniel Vetter
2021-08-09 19:35     ` Matthew Brost
2021-08-09 19:35       ` [Intel-gfx] " Matthew Brost
2021-08-10 10:32       ` Daniel Vetter
2021-08-10 10:32         ` [Intel-gfx] " Daniel Vetter
2021-08-08 18:07 ` [PATCH 2/3] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
2021-08-09 13:38   ` Daniel Vetter
2021-08-09 19:37     ` Matthew Brost
2021-08-10 10:33       ` Daniel Vetter
2021-08-08 18:07 ` [PATCH 3/3] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
2021-08-08 18:07   ` [Intel-gfx] " Matthew Brost
2021-08-09 14:03   ` Daniel Vetter
2021-08-09 19:41     ` Matthew Brost
2021-08-10 10:36       ` Daniel Vetter
2021-08-08 18:26 ` [Intel-gfx] ✓ Fi.CI.BAT: success for Clean up some CI failures for GuC submission Patchwork
2021-08-08 19:40 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.