All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Clean up some CI failures for GuC submission
@ 2021-08-11  1:16 Matthew Brost
  2021-08-11  1:16 ` [PATCH 1/9] drm/i915/guc: Fix blocked context accounting Matthew Brost
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

Resets are notoriously hard to get fully working and notoriously racey,
especially with selftests / IGTs that do all sorts of wild things that
would be near impossible to hit during normal use cases. Even though
likely impossible to hit, anything selftests / IGTs uncover needs to be
fixed. This series addresses 7 such issues, adds a workaround to a
selftest, and add another selftest to prove scrubbing of G2H during a
reset works. 

v2:
 (Daniel Vetter)
  - Split fixes into individual patches
  - A kernel doc
  - A VLK ref for selftest workaround
 (Checkpatch)
  - Fix warnings

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (9):
  drm/i915/guc: Fix blocked context accounting
  drm/i915/guc: outstanding G2H accounting
  drm/i915/guc: Unwind context requests in reverse order
  drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  drm/i915/guc: Flush the work queue for GuC generated G2H
  drm/i915/guc: Do not clear enable during reset in an enable is
    inflight
  drm/i915/guc: Don't enable scheduling on a banned context
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H

 drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +++
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  69 +++++++---
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 7 files changed, 239 insertions(+), 18 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/9] drm/i915/guc: Fix blocked context accounting
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11 10:30   ` Daniel Vetter
  2021-08-11  1:16 ` [PATCH 2/9] drm/i915/guc: outstanding G2H accounting Matthew Brost
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

Prior to this patch the blocked context counter was cleared on
init_sched_state (used during registering a context & resets) which is
incorrect. This state needs to be persistent or the counter can read the
incorrect value resulting in scheduling never getting enabled again.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..69faa39da178 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
 static inline bool
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/9] drm/i915/guc: outstanding G2H accounting
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
  2021-08-11  1:16 ` [PATCH 1/9] drm/i915/guc: Fix blocked context accounting Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 3/9] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 69faa39da178..b5d3972ae164 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (err == -EBUSY && g2h_len_dw)
+		atomic_dec(&guc->outstanding_submission_g2h);
+
 	return err;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/9] drm/i915/guc: Unwind context requests in reverse order
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
  2021-08-11  1:16 ` [PATCH 1/9] drm/i915/guc: Fix blocked context accounting Matthew Brost
  2021-08-11  1:16 ` [PATCH 2/9] drm/i915/guc: outstanding G2H accounting Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 4/9] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reverse and
append to the head of the priority list to fix this.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b5d3972ae164..bc51caba50d0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -799,9 +799,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_active.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
@@ -818,7 +818,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
 		spin_lock(&ce->guc_active.lock);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/9] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (2 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 3/9] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H Matthew Brost
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bc51caba50d0..3cd2da6f5c03 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -806,8 +806,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
-
 		__i915_request_unsubmit(rq);
 
 		/* Push the request back into the queue for later resubmission. */
@@ -820,8 +818,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (3 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 4/9] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-12 14:11   ` Daniel Vetter
  2021-08-11  1:16 ` [PATCH 6/9] drm/i915/guc: Do not clear enable during reset in an enable is inflight Matthew Brost
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

Flush the work queue for GuC generated G2H messages durinr a GT reset.
This is accomplished by spinning on the the list of outstanding G2H to
go empty.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3cd2da6f5c03..e5eb2df11b4a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 			wait_for_reset(guc, &guc->outstanding_submission_g2h);
 		} while (!list_empty(&guc->ct.requests.incoming));
 	}
+
+	/* Flush any GuC generated G2H */
+	while (!list_empty(&guc->ct.requests.incoming))
+		msleep(20);
+
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/9] drm/i915/guc: Do not clear enable during reset in an enable is inflight
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (4 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 7/9] drm/i915/guc: Don't enable scheduling on a banned context Matthew Brost
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

Do not clear enable during a context reset if a schedule enable is in
flight. This can occur if the context reset during a request
cancellation, clears the offending request, and then enables scheduling.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e5eb2df11b4a..4631b15eccaf 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -831,17 +831,23 @@ __unwind_incomplete_requests(struct intel_context *ce)
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission. A small window exists
+	 * where the context could be block & unblocked (scheduling enable) while
+	 * this reset was inflight. If a scheduling enable is already is in
+	 * flight do not clear the enable.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (!context_pending_enable(ce))
+		clr_context_enabled(ce);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/9] drm/i915/guc: Don't enable scheduling on a banned context
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (5 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 6/9] drm/i915/guc: Do not clear enable during reset in an enable is inflight Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 8/9] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
  2021-08-11  1:16 ` [PATCH 9/9] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

When unblocking a context, do not enable scheduling if the context is
banned.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4631b15eccaf..da0f8af2c8ab 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1571,6 +1571,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 8/9] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (6 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 7/9] drm/i915/guc: Don't enable scheduling on a banned context Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  2021-08-11  1:16 ` [PATCH 9/9] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

GuC submission has exposed an existing memory corruption in
live_lrc_isolation. We believe that some writes to the watchdog offsets
in the LRC (0x178 & 0x17c) can result in trashing of portions of the
address space. With GuC submission there are additional objects which
can move the context redzone into the space that is trashed. To
workaround this avoid poisoning the watchdog.

v2:
 (Daniel Vetter)
  - Add VLK ref in code to workaround

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..cdc6ae48a1e1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog - VLK-22772 */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 9/9] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
                   ` (7 preceding siblings ...)
  2021-08-11  1:16 ` [PATCH 8/9] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
@ 2021-08-11  1:16 ` Matthew Brost
  8 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11  1:16 UTC (permalink / raw)
  To: gfx-internal-devel, dri-devel

While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

v2:
 (Daniel Vetter)
  - Use ifdef instead of macros for selftests

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  25 ++++
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 6 files changed, 184 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..3a73f3117873 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -198,6 +198,24 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
+	 */
+	bool drop_schedule_enable;
+
+	/**
+	 * @drop_schedule_disable: Force drop of schedule disable G2H for
+	 * selftest
+	 */
+	bool drop_schedule_disable;
+
+	/**
+	 * @drop_deregister: Force drop of deregister G2H for selftest
+	 */
+	bool drop_deregister;
+#endif
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index da0f8af2c8ab..6914f2247c39 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2617,6 +2617,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+	if (unlikely(ce->drop_deregister)) {
+		ce->drop_deregister = false;
+		return 0;
+	}
+#endif
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2671,10 +2678,24 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_enable)) {
+			ce->drop_schedule_enable = false;
+			return 0;
+		}
+#endif
+
 		clr_context_pending_enable(ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_disable)) {
+			ce->drop_schedule_disable = false;
+			return 0;
+		}
+#endif
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -3046,3 +3067,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..46ca6554f65d
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch(i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] drm/i915/guc: Fix blocked context accounting
  2021-08-11  1:16 ` [PATCH 1/9] drm/i915/guc: Fix blocked context accounting Matthew Brost
@ 2021-08-11 10:30   ` Daniel Vetter
  2021-08-11 17:28     ` Matthew Brost
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2021-08-11 10:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: gfx-internal-devel, dri-devel

On Wed, Aug 11, 2021 at 01:16:14AM +0000, Matthew Brost wrote:
> Prior to this patch the blocked context counter was cleared on
> init_sched_state (used during registering a context & resets) which is
> incorrect. This state needs to be persistent or the counter can read the
> incorrect value resulting in scheduling never getting enabled again.
> 
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")

Tiny bikeshed on that commit, but for SCHED_STATE_BLOCKED_MASK you want
GENMASK. Also SCHED_STATE_BLOCKED is usually called
SCHED_STATE_BLOCKED_BIAS or similar if you put a counter into that field.

But also ... we can't afford a separate counter? Is all the bitshifting
actually worth the space savings? With a separate counter your bugfix
below would look a lot more reasonable too.


> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 87d8dc8f51b9..69faa39da178 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
>  {
>  	/* Only should be called from guc_lrc_desc_pin() */
>  	atomic_set(&ce->guc_sched_state_no_lock, 0);

atomic_t without barriers or anything like that. tsk, tsk.

Do we really need this?

Also I looked through the callchain of init_sched_state and I couldn't
find any locking, nor any description about ownership that would explain
why there's no locking.

E.g. scrub_guc_desc_for_outstanding_g2h has an xa_for_each with no
protection. No idea how that one works. I also couldn't figure out how
anything else in there is protected (no spinlocks to be seen anywhere at
least).

And then there's stuff like this:

		/* Flush context */
		spin_lock_irqsave(&ce->guc_state.lock, flags);
		spin_unlock_irqrestore(&ce->guc_state.lock, flags);

This pattern seems very common, and it freaks me out.

Finally none of the locking or consistency rules are explained in the
kerneldoc (or even comments) in the relevant datastructures, which is not
great.

> -	ce->guc_state.sched_state = 0;
> +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;

The patch itself matches the commit message and makes sense. But like I
said, would be cleaner I think if it's just a separate counter.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

>  }
>  
>  static inline bool
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/9] drm/i915/guc: Fix blocked context accounting
  2021-08-11 10:30   ` Daniel Vetter
@ 2021-08-11 17:28     ` Matthew Brost
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2021-08-11 17:28 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gfx-internal-devel, dri-devel

On Wed, Aug 11, 2021 at 12:30:52PM +0200, Daniel Vetter wrote:
> On Wed, Aug 11, 2021 at 01:16:14AM +0000, Matthew Brost wrote:
> > Prior to this patch the blocked context counter was cleared on
> > init_sched_state (used during registering a context & resets) which is
> > incorrect. This state needs to be persistent or the counter can read the
> > incorrect value resulting in scheduling never getting enabled again.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> 

Ah, relized I sent this series to the wrong list, let's stop replying to
rev of series after this patch.

> Tiny bikeshed on that commit, but for SCHED_STATE_BLOCKED_MASK you want
> GENMASK. Also SCHED_STATE_BLOCKED is usually called
> SCHED_STATE_BLOCKED_BIAS or similar if you put a counter into that field.
> 
> But also ... we can't afford a separate counter? Is all the bitshifting
> actually worth the space savings? With a separate counter your bugfix
> below would look a lot more reasonable too.
> 

Could add a counter I suppose. A lot of this clean up can be done over
time. 

> 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 87d8dc8f51b9..69faa39da178 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
> >  {
> >  	/* Only should be called from guc_lrc_desc_pin() */
> >  	atomic_set(&ce->guc_sched_state_no_lock, 0);
> 
> atomic_t without barriers or anything like that. tsk, tsk.
> 
> Do we really need this?
> 
> Also I looked through the callchain of init_sched_state and I couldn't
> find any locking, nor any description about ownership that would explain
> why there's no locking.
> 
> E.g. scrub_guc_desc_for_outstanding_g2h has an xa_for_each with no
> protection. No idea how that one works. I also couldn't figure out how

In both the reset and context registration cases it is guaranteed no one
else can be touching this context. We can get a lock inversion if use
this lock while registering the context, thus we don't use it there.
All this locking complexity is going to be greatly reduced when we move
to the DRM scheduler + have a task to clean up the locking then.

> anything else in there is protected (no spinlocks to be seen anywhere at
> least).
> 
> And then there's stuff like this:
> 
> 		/* Flush context */
> 		spin_lock_irqsave(&ce->guc_state.lock, flags);
> 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> 
> This pattern seems very common, and it freaks me out.
> 

I wrote some doc up on this in one of my initial posts. It is basically
seal a bunch of races flying around related to resets. Cycling the lock
guarantees that contexts can see that reset is in flight and not send
out a H2G that requires a G2H response. Losing a G2H is fatal as we
depend on that response to take further action (e.g. release a fence, do
a put, etc...). 

> Finally none of the locking or consistency rules are explained in the
> kerneldoc (or even comments) in the relevant datastructures, which is not
> great.

Like I said, I have a patch for documentation but it hasn't made it into
the kernel yet. I'll include the doc patch in my next spin of the parallel
submission code.

> 
> > -	ce->guc_state.sched_state = 0;
> > +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> 
> The patch itself matches the commit message and makes sense. But like I
> said, would be cleaner I think if it's just a separate counter.
>

Can clean this in the future.

Matt

> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> >  }
> >  
> >  static inline bool
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-11  1:16 ` [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H Matthew Brost
@ 2021-08-12 14:11   ` Daniel Vetter
  2021-08-12 15:23     ` Matthew Brost
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2021-08-12 14:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: gfx-internal-devel, dri-devel

On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> Flush the work queue for GuC generated G2H messages durinr a GT reset.
> This is accomplished by spinning on the the list of outstanding G2H to
> go empty.
> 
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3cd2da6f5c03..e5eb2df11b4a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
>  		} while (!list_empty(&guc->ct.requests.incoming));
>  	}
> +
> +	/* Flush any GuC generated G2H */
> +	while (!list_empty(&guc->ct.requests.incoming))
> +		msleep(20);

flush_work or flush_workqueue, beacuse that comes with lockdep
annotations. Dont hand-roll stuff like this if at all possible.
-Daniel

> +
>  	scrub_guc_desc_for_outstanding_g2h(guc);
>  }
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-12 14:11   ` Daniel Vetter
@ 2021-08-12 15:23     ` Matthew Brost
  2021-08-12 19:47       ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Brost @ 2021-08-12 15:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gfx-internal-devel, dri-devel

On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > This is accomplished by spinning on the the list of outstanding G2H to
> > go empty.
> > 
> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> >  		} while (!list_empty(&guc->ct.requests.incoming));
> >  	}
> > +
> > +	/* Flush any GuC generated G2H */
> > +	while (!list_empty(&guc->ct.requests.incoming))
> > +		msleep(20);
> 
> flush_work or flush_workqueue, beacuse that comes with lockdep
> annotations. Dont hand-roll stuff like this if at all possible.

lockdep puked when used that.

Matt

> -Daniel
> 
> > +
> >  	scrub_guc_desc_for_outstanding_g2h(guc);
> >  }
> >  
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-12 15:23     ` Matthew Brost
@ 2021-08-12 19:47       ` Daniel Vetter
  2021-08-12 22:38         ` Matthew Brost
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2021-08-12 19:47 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, gfx-internal-devel, dri-devel

On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > This is accomplished by spinning on the the list of outstanding G2H to
> > > go empty.
> > > 
> > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > >  	}
> > > +
> > > +	/* Flush any GuC generated G2H */
> > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > +		msleep(20);
> > 
> > flush_work or flush_workqueue, beacuse that comes with lockdep
> > annotations. Dont hand-roll stuff like this if at all possible.
> 
> lockdep puked when used that.

Lockdep tends to be right ... So we definitely want that, but maybe a
different flavour, or there's something wrong with the workqueue setup.

This is the major reason why inventing any wheels locally in the kernel
isn't a good idea - aside from the duplicated code because likely there is
a solution for whatever you need. There's all the validation tools,
careful thought about semantics (especially around races) and all that
stuff that you're always missing on your hand-rolled thing. Aside from
anything hand-rolled is much harder to read, since intent isn't clear.
-Daniel


> 
> Matt
> 
> > -Daniel
> > 
> > > +
> > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > >  }
> > >  
> > > -- 
> > > 2.32.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-12 19:47       ` Daniel Vetter
@ 2021-08-12 22:38         ` Matthew Brost
  2021-08-13 15:11           ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Brost @ 2021-08-12 22:38 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gfx-internal-devel, dri-devel

On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote:
> On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > > This is accomplished by spinning on the the list of outstanding G2H to
> > > > go empty.
> > > > 
> > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: <stable@vger.kernel.org>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > > >  	}
> > > > +
> > > > +	/* Flush any GuC generated G2H */
> > > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > > +		msleep(20);
> > > 
> > > flush_work or flush_workqueue, beacuse that comes with lockdep
> > > annotations. Dont hand-roll stuff like this if at all possible.
> > 
> > lockdep puked when used that.
> 
> Lockdep tends to be right ... So we definitely want that, but maybe a
> different flavour, or there's something wrong with the workqueue setup.
> 

Here is a dependency chain that lockdep doesn't like.

fs_reclaim_acquire -> &gt->reset.mutex (shrinker)
workqueue -> fs_reclaim_acquire (error capture in workqueue)
&gt->reset.mutex -> workqueue (reset)

In practice I don't think we couldn't ever hit this but lockdep does
looks right here. Trying to work out how to fix this. We really need to
all G2H to done being processed before we proceed during a reset or we
have races. Have a few ideas of how to handle this but can't convince
myself any of them are fully safe.

Splat below:

[  154.625989] ======================================================
[  154.632195] WARNING: possible circular locking dependency detected
[  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
[  154.643991] ------------------------------------------------------
[  154.650196] i915_selftest/1673 is trying to acquire lock:
[  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0
[  154.665826]
               but task is already holding lock:
[  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
[  154.680659]
               which lock already depends on the new lock.

[  154.688857]
               the existing dependency chain (in reverse order) is:
[  154.696365]
               -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
[  154.702571]        lock_acquire+0xd2/0x300
[  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
[  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
[  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
[  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
[  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
[  154.733891]        pci_device_probe+0x9b/0x110
[  154.738362]        really_probe+0x1a6/0x3a0
[  154.742568]        __driver_probe_device+0xf9/0x170
[  154.747468]        driver_probe_device+0x19/0x90
[  154.752114]        __driver_attach+0x99/0x170
[  154.756492]        bus_for_each_dev+0x73/0xc0
[  154.760870]        bus_add_driver+0x14b/0x1f0
[  154.765248]        driver_register+0x67/0xb0
[  154.769542]        i915_init+0x18/0x8c [i915]
[  154.773964]        do_one_initcall+0x53/0x2e0
[  154.778343]        do_init_module+0x56/0x210
[  154.782639]        load_module+0x25fc/0x29f0
[  154.786934]        __do_sys_finit_module+0xae/0x110
[  154.791835]        do_syscall_64+0x38/0xc0
[  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  154.801558]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  154.807241]        lock_acquire+0xd2/0x300
[  154.811361]        fs_reclaim_acquire+0x9e/0xd0
[  154.815914]        kmem_cache_alloc_trace+0x30/0x790
[  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
[  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
[  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
[  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
[  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
[  154.850898]        process_one_work+0x264/0x590
[  154.855451]        worker_thread+0x4b/0x3a0
[  154.859655]        kthread+0x147/0x170
[  154.863428]        ret_from_fork+0x1f/0x30
[  154.867548]
               -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
[  154.875747]        check_prev_add+0x90/0xc30
[  154.880042]        __lock_acquire+0x1643/0x2110
[  154.884595]        lock_acquire+0xd2/0x300
[  154.888715]        __flush_work+0x373/0x4d0
[  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
[  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
[  154.905166]        reset_prepare+0x55/0x60 [i915]
[  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
[  154.914984]        do_device_reset+0x13/0x20 [i915]
[  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
[  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
[  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
[  154.937428]        __run_selftests.cold+0x96/0xee [i915]
[  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
[  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
[  154.953076]        pci_device_probe+0x9b/0x110
[  154.957545]        really_probe+0x1a6/0x3a0
[  154.961749]        __driver_probe_device+0xf9/0x170
[  154.966653]        driver_probe_device+0x19/0x90
[  154.971290]        __driver_attach+0x99/0x170
[  154.975671]        bus_for_each_dev+0x73/0xc0
[  154.980053]        bus_add_driver+0x14b/0x1f0
[  154.984431]        driver_register+0x67/0xb0
[  154.988725]        i915_init+0x18/0x8c [i915]
[  154.993149]        do_one_initcall+0x53/0x2e0
[  154.997527]        do_init_module+0x56/0x210
[  155.001822]        load_module+0x25fc/0x29f0
[  155.006118]        __do_sys_finit_module+0xae/0x110
[  155.011019]        do_syscall_64+0x38/0xc0
[  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  155.020729]
               other info that might help us debug this:

[  155.028752] Chain exists of:
                 (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex

[  155.041294]  Possible unsafe locking scenario:

[  155.047240]        CPU0                    CPU1
[  155.051791]        ----                    ----
[  155.056344]   lock(&gt->reset.mutex);
[  155.060026]                                lock(fs_reclaim);
[  155.065706]                                lock(&gt->reset.mutex);
[  155.071912]   lock((work_completion)(&ct->requests.worker));
[  155.077595]
                *** DEADLOCK ***

[  155.083534] 2 locks held by i915_selftest/1673:
[  155.088086]  #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170
[  155.096460]  #1: ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
[  155.105845]
               stack backtrace:
[  155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G     U            5.14.0-rc5-guc+ #50
[  155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020
[  155.132530] Call Trace:
[  155.134999]  dump_stack_lvl+0x57/0x7d
[  155.138690]  check_noncircular+0x12d/0x150
[  155.142814]  check_prev_add+0x90/0xc30
[  155.146587]  __lock_acquire+0x1643/0x2110
[  155.150618]  lock_acquire+0xd2/0x300
[  155.154218]  ? __flush_work+0x350/0x4d0
[  155.158073]  ? __lock_acquire+0x5f3/0x2110
[  155.162194]  __flush_work+0x373/0x4d0
[  155.165883]  ? __flush_work+0x350/0x4d0
[  155.169739]  ? mark_held_locks+0x49/0x70
[  155.173686]  ? _raw_spin_unlock_irqrestore+0x50/0x70
[  155.178679]  intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
[  155.184857]  ? _raw_spin_unlock_irqrestore+0x50/0x70
[  155.189843]  intel_uc_reset_prepare+0x40/0x50 [i915]
[  155.194891]  reset_prepare+0x55/0x60 [i915]
[  155.199145]  intel_gt_reset+0x11c/0x300 [i915]
[  155.203657]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
[  155.208641]  ? do_engine_reset+0x10/0x10 [i915]
[  155.213243]  do_device_reset+0x13/0x20 [i915]
[  155.217664]  check_whitelist_across_reset+0x166/0x250 [i915]
[  155.223415]  live_reset_whitelist.cold+0x6a/0x7a [i915]
[  155.228725]  __i915_subtests.cold+0x20/0x74 [i915]
[  155.233593]  ? __i915_live_teardown+0x50/0x50 [i915]
[  155.238644]  ? __intel_gt_live_setup+0x30/0x30 [i915]
[  155.243773]  __run_selftests.cold+0x96/0xee [i915]
[  155.248646]  i915_live_selftests+0x2c/0x60 [i915]
[  155.253425]  i915_pci_probe+0x93/0x1c0 [i915]
[  155.257854]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
[  155.262839]  pci_device_probe+0x9b/0x110
[  155.266785]  really_probe+0x1a6/0x3a0
[  155.270467]  __driver_probe_device+0xf9/0x170
[  155.274846]  driver_probe_device+0x19/0x90
[  155.278968]  __driver_attach+0x99/0x170
[  155.282824]  ? __device_attach_driver+0xd0/0xd0
[  155.287376]  ? __device_attach_driver+0xd0/0xd0
[  155.291928]  bus_for_each_dev+0x73/0xc0
[  155.295792]  bus_add_driver+0x14b/0x1f0
[  155.299648]  driver_register+0x67/0xb0
[  155.303419]  i915_init+0x18/0x8c [i915]
[  155.307328]  ? 0xffffffffa0188000
[  155.310669]  do_one_initcall+0x53/0x2e0
[  155.314525]  ? rcu_read_lock_sched_held+0x4d/0x80
[  155.319253]  ? kmem_cache_alloc_trace+0x5ad/0x790
[  155.323982]  do_init_module+0x56/0x210
[  155.327754]  load_module+0x25fc/0x29f0
[  155.331528]  ? __do_sys_finit_module+0xae/0x110
[  155.336081]  __do_sys_finit_module+0xae/0x110
[  155.340462]  do_syscall_64+0x38/0xc0
[  155.344060]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  155.349137] RIP: 0033:0x7efc4496389d
[  155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
[  155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d
[  155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004
[  155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0
[  155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610
[  155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240

> This is the major reason why inventing any wheels locally in the kernel
> isn't a good idea - aside from the duplicated code because likely there is
> a solution for whatever you need. There's all the validation tools,
> careful thought about semantics (especially around races) and all that
> stuff that you're always missing on your hand-rolled thing. Aside from
> anything hand-rolled is much harder to read, since intent isn't clear.
> -Daniel
> 
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > > +
> > > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > > >  }
> > > >  
> > > > -- 
> > > > 2.32.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-12 22:38         ` Matthew Brost
@ 2021-08-13 15:11           ` Daniel Vetter
  2021-08-13 19:02             ` Matthew Brost
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2021-08-13 15:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, gfx-internal-devel, dri-devel

On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote:
> On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > > > This is accomplished by spinning on the the list of outstanding G2H to
> > > > > go empty.
> > > > > 
> > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: <stable@vger.kernel.org>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > > > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > > > >  	}
> > > > > +
> > > > > +	/* Flush any GuC generated G2H */
> > > > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > > > +		msleep(20);
> > > > 
> > > > flush_work or flush_workqueue, beacuse that comes with lockdep
> > > > annotations. Dont hand-roll stuff like this if at all possible.
> > > 
> > > lockdep puked when used that.
> > 
> > Lockdep tends to be right ... So we definitely want that, but maybe a
> > different flavour, or there's something wrong with the workqueue setup.
> > 
> 
> Here is a dependency chain that lockdep doesn't like.
> 
> fs_reclaim_acquire -> &gt->reset.mutex (shrinker)
> workqueue -> fs_reclaim_acquire (error capture in workqueue)
> &gt->reset.mutex -> workqueue (reset)
> 
> In practice I don't think we couldn't ever hit this but lockdep does
> looks right here. Trying to work out how to fix this. We really need to
> all G2H to done being processed before we proceed during a reset or we
> have races. Have a few ideas of how to handle this but can't convince
> myself any of them are fully safe.

It might be false sharing due to a single workqueue, or a single-threaded
workqueue.

Essentially the lockdep annotations for work_struct track two things:
- dependencies against the specific work item
- dependencies against anything queued on that work queue, if you flush
  the entire queue, or if you flush a work item that's on a
  single-threaded queue.

Because if guc/host communication is inverted like this here, you have a
much bigger problem.

Note that if you pick a different workqueue for your guc work stuff then
you need to make sure that's all properly flushed on suspend and driver
unload.

It might also be that the reset work is on the wrong workqueue.

Either way, this must be fixed, because I've seen too many of these "it
never happens in practice" blow up, plus if your locking scheme is
engineered with quicksand forget about anyone ever understanding it.
-Daniel

> 
> Splat below:
> 
> [  154.625989] ======================================================
> [  154.632195] WARNING: possible circular locking dependency detected
> [  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
> [  154.643991] ------------------------------------------------------
> [  154.650196] i915_selftest/1673 is trying to acquire lock:
> [  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0
> [  154.665826]
>                but task is already holding lock:
> [  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> [  154.680659]
>                which lock already depends on the new lock.
> 
> [  154.688857]
>                the existing dependency chain (in reverse order) is:
> [  154.696365]
>                -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
> [  154.702571]        lock_acquire+0xd2/0x300
> [  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
> [  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
> [  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
> [  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
> [  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
> [  154.733891]        pci_device_probe+0x9b/0x110
> [  154.738362]        really_probe+0x1a6/0x3a0
> [  154.742568]        __driver_probe_device+0xf9/0x170
> [  154.747468]        driver_probe_device+0x19/0x90
> [  154.752114]        __driver_attach+0x99/0x170
> [  154.756492]        bus_for_each_dev+0x73/0xc0
> [  154.760870]        bus_add_driver+0x14b/0x1f0
> [  154.765248]        driver_register+0x67/0xb0
> [  154.769542]        i915_init+0x18/0x8c [i915]
> [  154.773964]        do_one_initcall+0x53/0x2e0
> [  154.778343]        do_init_module+0x56/0x210
> [  154.782639]        load_module+0x25fc/0x29f0
> [  154.786934]        __do_sys_finit_module+0xae/0x110
> [  154.791835]        do_syscall_64+0x38/0xc0
> [  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  154.801558]
>                -> #1 (fs_reclaim){+.+.}-{0:0}:
> [  154.807241]        lock_acquire+0xd2/0x300
> [  154.811361]        fs_reclaim_acquire+0x9e/0xd0
> [  154.815914]        kmem_cache_alloc_trace+0x30/0x790
> [  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
> [  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
> [  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
> [  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
> [  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
> [  154.850898]        process_one_work+0x264/0x590
> [  154.855451]        worker_thread+0x4b/0x3a0
> [  154.859655]        kthread+0x147/0x170
> [  154.863428]        ret_from_fork+0x1f/0x30
> [  154.867548]
>                -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
> [  154.875747]        check_prev_add+0x90/0xc30
> [  154.880042]        __lock_acquire+0x1643/0x2110
> [  154.884595]        lock_acquire+0xd2/0x300
> [  154.888715]        __flush_work+0x373/0x4d0
> [  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> [  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
> [  154.905166]        reset_prepare+0x55/0x60 [i915]
> [  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
> [  154.914984]        do_device_reset+0x13/0x20 [i915]
> [  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
> [  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
> [  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
> [  154.937428]        __run_selftests.cold+0x96/0xee [i915]
> [  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
> [  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
> [  154.953076]        pci_device_probe+0x9b/0x110
> [  154.957545]        really_probe+0x1a6/0x3a0
> [  154.961749]        __driver_probe_device+0xf9/0x170
> [  154.966653]        driver_probe_device+0x19/0x90
> [  154.971290]        __driver_attach+0x99/0x170
> [  154.975671]        bus_for_each_dev+0x73/0xc0
> [  154.980053]        bus_add_driver+0x14b/0x1f0
> [  154.984431]        driver_register+0x67/0xb0
> [  154.988725]        i915_init+0x18/0x8c [i915]
> [  154.993149]        do_one_initcall+0x53/0x2e0
> [  154.997527]        do_init_module+0x56/0x210
> [  155.001822]        load_module+0x25fc/0x29f0
> [  155.006118]        __do_sys_finit_module+0xae/0x110
> [  155.011019]        do_syscall_64+0x38/0xc0
> [  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  155.020729]
>                other info that might help us debug this:
> 
> [  155.028752] Chain exists of:
>                  (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex
> 
> [  155.041294]  Possible unsafe locking scenario:
> 
> [  155.047240]        CPU0                    CPU1
> [  155.051791]        ----                    ----
> [  155.056344]   lock(&gt->reset.mutex);
> [  155.060026]                                lock(fs_reclaim);
> [  155.065706]                                lock(&gt->reset.mutex);
> [  155.071912]   lock((work_completion)(&ct->requests.worker));
> [  155.077595]
>                 *** DEADLOCK ***
> 
> [  155.083534] 2 locks held by i915_selftest/1673:
> [  155.088086]  #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170
> [  155.096460]  #1: ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> [  155.105845]
>                stack backtrace:
> [  155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G     U            5.14.0-rc5-guc+ #50
> [  155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020
> [  155.132530] Call Trace:
> [  155.134999]  dump_stack_lvl+0x57/0x7d
> [  155.138690]  check_noncircular+0x12d/0x150
> [  155.142814]  check_prev_add+0x90/0xc30
> [  155.146587]  __lock_acquire+0x1643/0x2110
> [  155.150618]  lock_acquire+0xd2/0x300
> [  155.154218]  ? __flush_work+0x350/0x4d0
> [  155.158073]  ? __lock_acquire+0x5f3/0x2110
> [  155.162194]  __flush_work+0x373/0x4d0
> [  155.165883]  ? __flush_work+0x350/0x4d0
> [  155.169739]  ? mark_held_locks+0x49/0x70
> [  155.173686]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> [  155.178679]  intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> [  155.184857]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> [  155.189843]  intel_uc_reset_prepare+0x40/0x50 [i915]
> [  155.194891]  reset_prepare+0x55/0x60 [i915]
> [  155.199145]  intel_gt_reset+0x11c/0x300 [i915]
> [  155.203657]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> [  155.208641]  ? do_engine_reset+0x10/0x10 [i915]
> [  155.213243]  do_device_reset+0x13/0x20 [i915]
> [  155.217664]  check_whitelist_across_reset+0x166/0x250 [i915]
> [  155.223415]  live_reset_whitelist.cold+0x6a/0x7a [i915]
> [  155.228725]  __i915_subtests.cold+0x20/0x74 [i915]
> [  155.233593]  ? __i915_live_teardown+0x50/0x50 [i915]
> [  155.238644]  ? __intel_gt_live_setup+0x30/0x30 [i915]
> [  155.243773]  __run_selftests.cold+0x96/0xee [i915]
> [  155.248646]  i915_live_selftests+0x2c/0x60 [i915]
> [  155.253425]  i915_pci_probe+0x93/0x1c0 [i915]
> [  155.257854]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> [  155.262839]  pci_device_probe+0x9b/0x110
> [  155.266785]  really_probe+0x1a6/0x3a0
> [  155.270467]  __driver_probe_device+0xf9/0x170
> [  155.274846]  driver_probe_device+0x19/0x90
> [  155.278968]  __driver_attach+0x99/0x170
> [  155.282824]  ? __device_attach_driver+0xd0/0xd0
> [  155.287376]  ? __device_attach_driver+0xd0/0xd0
> [  155.291928]  bus_for_each_dev+0x73/0xc0
> [  155.295792]  bus_add_driver+0x14b/0x1f0
> [  155.299648]  driver_register+0x67/0xb0
> [  155.303419]  i915_init+0x18/0x8c [i915]
> [  155.307328]  ? 0xffffffffa0188000
> [  155.310669]  do_one_initcall+0x53/0x2e0
> [  155.314525]  ? rcu_read_lock_sched_held+0x4d/0x80
> [  155.319253]  ? kmem_cache_alloc_trace+0x5ad/0x790
> [  155.323982]  do_init_module+0x56/0x210
> [  155.327754]  load_module+0x25fc/0x29f0
> [  155.331528]  ? __do_sys_finit_module+0xae/0x110
> [  155.336081]  __do_sys_finit_module+0xae/0x110
> [  155.340462]  do_syscall_64+0x38/0xc0
> [  155.344060]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  155.349137] RIP: 0033:0x7efc4496389d
> [  155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> [  155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d
> [  155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004
> [  155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0
> [  155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610
> [  155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240
> 
> > This is the major reason why inventing any wheels locally in the kernel
> > isn't a good idea - aside from the duplicated code because likely there is
> > a solution for whatever you need. There's all the validation tools,
> > careful thought about semantics (especially around races) and all that
> > stuff that you're always missing on your hand-rolled thing. Aside from
> > anything hand-rolled is much harder to read, since intent isn't clear.
> > -Daniel
> > 
> > 
> > > 
> > > Matt
> > > 
> > > > -Daniel
> > > > 
> > > > > +
> > > > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.32.0
> > > > > 
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-13 15:11           ` Daniel Vetter
@ 2021-08-13 19:02             ` Matthew Brost
  2021-08-16 15:39               ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Brost @ 2021-08-13 19:02 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gfx-internal-devel, dri-devel

On Fri, Aug 13, 2021 at 05:11:59PM +0200, Daniel Vetter wrote:
> On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote:
> > On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> > > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > > > > This is accomplished by spinning on the the list of outstanding G2H to
> > > > > > go empty.
> > > > > > 
> > > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: <stable@vger.kernel.org>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > > > > >  1 file changed, 5 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > > > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > > > > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > > > > >  	}
> > > > > > +
> > > > > > +	/* Flush any GuC generated G2H */
> > > > > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > > > > +		msleep(20);
> > > > > 
> > > > > flush_work or flush_workqueue, beacuse that comes with lockdep
> > > > > annotations. Dont hand-roll stuff like this if at all possible.
> > > > 
> > > > lockdep puked when used that.
> > > 
> > > Lockdep tends to be right ... So we definitely want that, but maybe a
> > > different flavour, or there's something wrong with the workqueue setup.
> > > 
> > 
> > Here is a dependency chain that lockdep doesn't like.
> > 
> > fs_reclaim_acquire -> &gt->reset.mutex (shrinker)
> > workqueue -> fs_reclaim_acquire (error capture in workqueue)
> > &gt->reset.mutex -> workqueue (reset)
> > 
> > In practice I don't think we couldn't ever hit this but lockdep does
> > looks right here. Trying to work out how to fix this. We really need to
> > all G2H to done being processed before we proceed during a reset or we
> > have races. Have a few ideas of how to handle this but can't convince
> > myself any of them are fully safe.
> 
> It might be false sharing due to a single workqueue, or a single-threaded
> workqueue.
> 
> Essentially the lockdep annotations for work_struct track two things:
> - dependencies against the specific work item
> - dependencies against anything queued on that work queue, if you flush
>   the entire queue, or if you flush a work item that's on a
>   single-threaded queue.
> 
> Because if guc/host communication is inverted like this here, you have a
> much bigger problem.
> 
> Note that if you pick a different workqueue for your guc work stuff then
> you need to make sure that's all properly flushed on suspend and driver
> unload.
> 
> It might also be that the reset work is on the wrong workqueue.
> 
> Either way, this must be fixed, because I've seen too many of these "it
> never happens in practice" blow up, plus if your locking scheme is
> engineered with quicksand forget about anyone ever understanding it.

The solution is to allocate memory for the error capture in an atomic
context if the error capture is being done from the G2H work queue. That
means this can possibly fail if the system is under memory pressure but
that is better than a lockdep splat.

Matt

> -Daniel
> 
> > 
> > Splat below:
> > 
> > [  154.625989] ======================================================
> > [  154.632195] WARNING: possible circular locking dependency detected
> > [  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
> > [  154.643991] ------------------------------------------------------
> > [  154.650196] i915_selftest/1673 is trying to acquire lock:
> > [  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0
> > [  154.665826]
> >                but task is already holding lock:
> > [  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > [  154.680659]
> >                which lock already depends on the new lock.
> > 
> > [  154.688857]
> >                the existing dependency chain (in reverse order) is:
> > [  154.696365]
> >                -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
> > [  154.702571]        lock_acquire+0xd2/0x300
> > [  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
> > [  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
> > [  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
> > [  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
> > [  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
> > [  154.733891]        pci_device_probe+0x9b/0x110
> > [  154.738362]        really_probe+0x1a6/0x3a0
> > [  154.742568]        __driver_probe_device+0xf9/0x170
> > [  154.747468]        driver_probe_device+0x19/0x90
> > [  154.752114]        __driver_attach+0x99/0x170
> > [  154.756492]        bus_for_each_dev+0x73/0xc0
> > [  154.760870]        bus_add_driver+0x14b/0x1f0
> > [  154.765248]        driver_register+0x67/0xb0
> > [  154.769542]        i915_init+0x18/0x8c [i915]
> > [  154.773964]        do_one_initcall+0x53/0x2e0
> > [  154.778343]        do_init_module+0x56/0x210
> > [  154.782639]        load_module+0x25fc/0x29f0
> > [  154.786934]        __do_sys_finit_module+0xae/0x110
> > [  154.791835]        do_syscall_64+0x38/0xc0
> > [  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  154.801558]
> >                -> #1 (fs_reclaim){+.+.}-{0:0}:
> > [  154.807241]        lock_acquire+0xd2/0x300
> > [  154.811361]        fs_reclaim_acquire+0x9e/0xd0
> > [  154.815914]        kmem_cache_alloc_trace+0x30/0x790
> > [  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
> > [  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
> > [  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
> > [  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
> > [  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
> > [  154.850898]        process_one_work+0x264/0x590
> > [  154.855451]        worker_thread+0x4b/0x3a0
> > [  154.859655]        kthread+0x147/0x170
> > [  154.863428]        ret_from_fork+0x1f/0x30
> > [  154.867548]
> >                -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
> > [  154.875747]        check_prev_add+0x90/0xc30
> > [  154.880042]        __lock_acquire+0x1643/0x2110
> > [  154.884595]        lock_acquire+0xd2/0x300
> > [  154.888715]        __flush_work+0x373/0x4d0
> > [  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > [  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
> > [  154.905166]        reset_prepare+0x55/0x60 [i915]
> > [  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
> > [  154.914984]        do_device_reset+0x13/0x20 [i915]
> > [  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
> > [  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
> > [  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
> > [  154.937428]        __run_selftests.cold+0x96/0xee [i915]
> > [  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
> > [  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
> > [  154.953076]        pci_device_probe+0x9b/0x110
> > [  154.957545]        really_probe+0x1a6/0x3a0
> > [  154.961749]        __driver_probe_device+0xf9/0x170
> > [  154.966653]        driver_probe_device+0x19/0x90
> > [  154.971290]        __driver_attach+0x99/0x170
> > [  154.975671]        bus_for_each_dev+0x73/0xc0
> > [  154.980053]        bus_add_driver+0x14b/0x1f0
> > [  154.984431]        driver_register+0x67/0xb0
> > [  154.988725]        i915_init+0x18/0x8c [i915]
> > [  154.993149]        do_one_initcall+0x53/0x2e0
> > [  154.997527]        do_init_module+0x56/0x210
> > [  155.001822]        load_module+0x25fc/0x29f0
> > [  155.006118]        __do_sys_finit_module+0xae/0x110
> > [  155.011019]        do_syscall_64+0x38/0xc0
> > [  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  155.020729]
> >                other info that might help us debug this:
> > 
> > [  155.028752] Chain exists of:
> >                  (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex
> > 
> > [  155.041294]  Possible unsafe locking scenario:
> > 
> > [  155.047240]        CPU0                    CPU1
> > [  155.051791]        ----                    ----
> > [  155.056344]   lock(&gt->reset.mutex);
> > [  155.060026]                                lock(fs_reclaim);
> > [  155.065706]                                lock(&gt->reset.mutex);
> > [  155.071912]   lock((work_completion)(&ct->requests.worker));
> > [  155.077595]
> >                 *** DEADLOCK ***
> > 
> > [  155.083534] 2 locks held by i915_selftest/1673:
> > [  155.088086]  #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170
> > [  155.096460]  #1: ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > [  155.105845]
> >                stack backtrace:
> > [  155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G     U            5.14.0-rc5-guc+ #50
> > [  155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020
> > [  155.132530] Call Trace:
> > [  155.134999]  dump_stack_lvl+0x57/0x7d
> > [  155.138690]  check_noncircular+0x12d/0x150
> > [  155.142814]  check_prev_add+0x90/0xc30
> > [  155.146587]  __lock_acquire+0x1643/0x2110
> > [  155.150618]  lock_acquire+0xd2/0x300
> > [  155.154218]  ? __flush_work+0x350/0x4d0
> > [  155.158073]  ? __lock_acquire+0x5f3/0x2110
> > [  155.162194]  __flush_work+0x373/0x4d0
> > [  155.165883]  ? __flush_work+0x350/0x4d0
> > [  155.169739]  ? mark_held_locks+0x49/0x70
> > [  155.173686]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> > [  155.178679]  intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > [  155.184857]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> > [  155.189843]  intel_uc_reset_prepare+0x40/0x50 [i915]
> > [  155.194891]  reset_prepare+0x55/0x60 [i915]
> > [  155.199145]  intel_gt_reset+0x11c/0x300 [i915]
> > [  155.203657]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > [  155.208641]  ? do_engine_reset+0x10/0x10 [i915]
> > [  155.213243]  do_device_reset+0x13/0x20 [i915]
> > [  155.217664]  check_whitelist_across_reset+0x166/0x250 [i915]
> > [  155.223415]  live_reset_whitelist.cold+0x6a/0x7a [i915]
> > [  155.228725]  __i915_subtests.cold+0x20/0x74 [i915]
> > [  155.233593]  ? __i915_live_teardown+0x50/0x50 [i915]
> > [  155.238644]  ? __intel_gt_live_setup+0x30/0x30 [i915]
> > [  155.243773]  __run_selftests.cold+0x96/0xee [i915]
> > [  155.248646]  i915_live_selftests+0x2c/0x60 [i915]
> > [  155.253425]  i915_pci_probe+0x93/0x1c0 [i915]
> > [  155.257854]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > [  155.262839]  pci_device_probe+0x9b/0x110
> > [  155.266785]  really_probe+0x1a6/0x3a0
> > [  155.270467]  __driver_probe_device+0xf9/0x170
> > [  155.274846]  driver_probe_device+0x19/0x90
> > [  155.278968]  __driver_attach+0x99/0x170
> > [  155.282824]  ? __device_attach_driver+0xd0/0xd0
> > [  155.287376]  ? __device_attach_driver+0xd0/0xd0
> > [  155.291928]  bus_for_each_dev+0x73/0xc0
> > [  155.295792]  bus_add_driver+0x14b/0x1f0
> > [  155.299648]  driver_register+0x67/0xb0
> > [  155.303419]  i915_init+0x18/0x8c [i915]
> > [  155.307328]  ? 0xffffffffa0188000
> > [  155.310669]  do_one_initcall+0x53/0x2e0
> > [  155.314525]  ? rcu_read_lock_sched_held+0x4d/0x80
> > [  155.319253]  ? kmem_cache_alloc_trace+0x5ad/0x790
> > [  155.323982]  do_init_module+0x56/0x210
> > [  155.327754]  load_module+0x25fc/0x29f0
> > [  155.331528]  ? __do_sys_finit_module+0xae/0x110
> > [  155.336081]  __do_sys_finit_module+0xae/0x110
> > [  155.340462]  do_syscall_64+0x38/0xc0
> > [  155.344060]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  155.349137] RIP: 0033:0x7efc4496389d
> > [  155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> > [  155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [  155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d
> > [  155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004
> > [  155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0
> > [  155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610
> > [  155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240
> > 
> > > This is the major reason why inventing any wheels locally in the kernel
> > > isn't a good idea - aside from the duplicated code because likely there is
> > > a solution for whatever you need. There's all the validation tools,
> > > careful thought about semantics (especially around races) and all that
> > > stuff that you're always missing on your hand-rolled thing. Aside from
> > > anything hand-rolled is much harder to read, since intent isn't clear.
> > > -Daniel
> > > 
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > +
> > > > > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > > > > >  }
> > > > > >  
> > > > > > -- 
> > > > > > 2.32.0
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
  2021-08-13 19:02             ` Matthew Brost
@ 2021-08-16 15:39               ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2021-08-16 15:39 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, gfx-internal-devel, dri-devel

On Fri, Aug 13, 2021 at 07:02:55PM +0000, Matthew Brost wrote:
> On Fri, Aug 13, 2021 at 05:11:59PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> > > > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > > > > > This is accomplished by spinning on the the list of outstanding G2H to
> > > > > > > go empty.
> > > > > > > 
> > > > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > Cc: <stable@vger.kernel.org>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > > > > > >  1 file changed, 5 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > > > > >  			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > > > > > >  		} while (!list_empty(&guc->ct.requests.incoming));
> > > > > > >  	}
> > > > > > > +
> > > > > > > +	/* Flush any GuC generated G2H */
> > > > > > > +	while (!list_empty(&guc->ct.requests.incoming))
> > > > > > > +		msleep(20);
> > > > > > 
> > > > > > flush_work or flush_workqueue, beacuse that comes with lockdep
> > > > > > annotations. Dont hand-roll stuff like this if at all possible.
> > > > > 
> > > > > lockdep puked when used that.
> > > > 
> > > > Lockdep tends to be right ... So we definitely want that, but maybe a
> > > > different flavour, or there's something wrong with the workqueue setup.
> > > > 
> > > 
> > > Here is a dependency chain that lockdep doesn't like.
> > > 
> > > fs_reclaim_acquire -> &gt->reset.mutex (shrinker)
> > > workqueue -> fs_reclaim_acquire (error capture in workqueue)
> > > &gt->reset.mutex -> workqueue (reset)
> > > 
> > > In practice I don't think we couldn't ever hit this but lockdep does
> > > looks right here. Trying to work out how to fix this. We really need to
> > > all G2H to done being processed before we proceed during a reset or we
> > > have races. Have a few ideas of how to handle this but can't convince
> > > myself any of them are fully safe.
> > 
> > It might be false sharing due to a single workqueue, or a single-threaded
> > workqueue.
> > 
> > Essentially the lockdep annotations for work_struct track two things:
> > - dependencies against the specific work item
> > - dependencies against anything queued on that work queue, if you flush
> >   the entire queue, or if you flush a work item that's on a
> >   single-threaded queue.
> > 
> > Because if guc/host communication is inverted like this here, you have a
> > much bigger problem.
> > 
> > Note that if you pick a different workqueue for your guc work stuff then
> > you need to make sure that's all properly flushed on suspend and driver
> > unload.
> > 
> > It might also be that the reset work is on the wrong workqueue.
> > 
> > Either way, this must be fixed, because I've seen too many of these "it
> > never happens in practice" blow up, plus if your locking scheme is
> > engineered with quicksand forget about anyone ever understanding it.
> 
> The solution is to allocate memory for the error capture in an atomic
> context if the error capture is being done from the G2H work queue. That
> means this can possibly fail if the system is under memory pressure but
> that is better than a lockdep splat.

Ah yeah if this is for error capture then GFP_ATOMIC is the right option.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > 
> > > Splat below:
> > > 
> > > [  154.625989] ======================================================
> > > [  154.632195] WARNING: possible circular locking dependency detected
> > > [  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
> > > [  154.643991] ------------------------------------------------------
> > > [  154.650196] i915_selftest/1673 is trying to acquire lock:
> > > [  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0
> > > [  154.665826]
> > >                but task is already holding lock:
> > > [  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > > [  154.680659]
> > >                which lock already depends on the new lock.
> > > 
> > > [  154.688857]
> > >                the existing dependency chain (in reverse order) is:
> > > [  154.696365]
> > >                -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
> > > [  154.702571]        lock_acquire+0xd2/0x300
> > > [  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
> > > [  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
> > > [  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
> > > [  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
> > > [  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
> > > [  154.733891]        pci_device_probe+0x9b/0x110
> > > [  154.738362]        really_probe+0x1a6/0x3a0
> > > [  154.742568]        __driver_probe_device+0xf9/0x170
> > > [  154.747468]        driver_probe_device+0x19/0x90
> > > [  154.752114]        __driver_attach+0x99/0x170
> > > [  154.756492]        bus_for_each_dev+0x73/0xc0
> > > [  154.760870]        bus_add_driver+0x14b/0x1f0
> > > [  154.765248]        driver_register+0x67/0xb0
> > > [  154.769542]        i915_init+0x18/0x8c [i915]
> > > [  154.773964]        do_one_initcall+0x53/0x2e0
> > > [  154.778343]        do_init_module+0x56/0x210
> > > [  154.782639]        load_module+0x25fc/0x29f0
> > > [  154.786934]        __do_sys_finit_module+0xae/0x110
> > > [  154.791835]        do_syscall_64+0x38/0xc0
> > > [  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  154.801558]
> > >                -> #1 (fs_reclaim){+.+.}-{0:0}:
> > > [  154.807241]        lock_acquire+0xd2/0x300
> > > [  154.811361]        fs_reclaim_acquire+0x9e/0xd0
> > > [  154.815914]        kmem_cache_alloc_trace+0x30/0x790
> > > [  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
> > > [  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
> > > [  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
> > > [  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
> > > [  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
> > > [  154.850898]        process_one_work+0x264/0x590
> > > [  154.855451]        worker_thread+0x4b/0x3a0
> > > [  154.859655]        kthread+0x147/0x170
> > > [  154.863428]        ret_from_fork+0x1f/0x30
> > > [  154.867548]
> > >                -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
> > > [  154.875747]        check_prev_add+0x90/0xc30
> > > [  154.880042]        __lock_acquire+0x1643/0x2110
> > > [  154.884595]        lock_acquire+0xd2/0x300
> > > [  154.888715]        __flush_work+0x373/0x4d0
> > > [  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > > [  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
> > > [  154.905166]        reset_prepare+0x55/0x60 [i915]
> > > [  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
> > > [  154.914984]        do_device_reset+0x13/0x20 [i915]
> > > [  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
> > > [  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
> > > [  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
> > > [  154.937428]        __run_selftests.cold+0x96/0xee [i915]
> > > [  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
> > > [  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
> > > [  154.953076]        pci_device_probe+0x9b/0x110
> > > [  154.957545]        really_probe+0x1a6/0x3a0
> > > [  154.961749]        __driver_probe_device+0xf9/0x170
> > > [  154.966653]        driver_probe_device+0x19/0x90
> > > [  154.971290]        __driver_attach+0x99/0x170
> > > [  154.975671]        bus_for_each_dev+0x73/0xc0
> > > [  154.980053]        bus_add_driver+0x14b/0x1f0
> > > [  154.984431]        driver_register+0x67/0xb0
> > > [  154.988725]        i915_init+0x18/0x8c [i915]
> > > [  154.993149]        do_one_initcall+0x53/0x2e0
> > > [  154.997527]        do_init_module+0x56/0x210
> > > [  155.001822]        load_module+0x25fc/0x29f0
> > > [  155.006118]        __do_sys_finit_module+0xae/0x110
> > > [  155.011019]        do_syscall_64+0x38/0xc0
> > > [  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  155.020729]
> > >                other info that might help us debug this:
> > > 
> > > [  155.028752] Chain exists of:
> > >                  (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex
> > > 
> > > [  155.041294]  Possible unsafe locking scenario:
> > > 
> > > [  155.047240]        CPU0                    CPU1
> > > [  155.051791]        ----                    ----
> > > [  155.056344]   lock(&gt->reset.mutex);
> > > [  155.060026]                                lock(fs_reclaim);
> > > [  155.065706]                                lock(&gt->reset.mutex);
> > > [  155.071912]   lock((work_completion)(&ct->requests.worker));
> > > [  155.077595]
> > >                 *** DEADLOCK ***
> > > 
> > > [  155.083534] 2 locks held by i915_selftest/1673:
> > > [  155.088086]  #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170
> > > [  155.096460]  #1: ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > > [  155.105845]
> > >                stack backtrace:
> > > [  155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G     U            5.14.0-rc5-guc+ #50
> > > [  155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020
> > > [  155.132530] Call Trace:
> > > [  155.134999]  dump_stack_lvl+0x57/0x7d
> > > [  155.138690]  check_noncircular+0x12d/0x150
> > > [  155.142814]  check_prev_add+0x90/0xc30
> > > [  155.146587]  __lock_acquire+0x1643/0x2110
> > > [  155.150618]  lock_acquire+0xd2/0x300
> > > [  155.154218]  ? __flush_work+0x350/0x4d0
> > > [  155.158073]  ? __lock_acquire+0x5f3/0x2110
> > > [  155.162194]  __flush_work+0x373/0x4d0
> > > [  155.165883]  ? __flush_work+0x350/0x4d0
> > > [  155.169739]  ? mark_held_locks+0x49/0x70
> > > [  155.173686]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> > > [  155.178679]  intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > > [  155.184857]  ? _raw_spin_unlock_irqrestore+0x50/0x70
> > > [  155.189843]  intel_uc_reset_prepare+0x40/0x50 [i915]
> > > [  155.194891]  reset_prepare+0x55/0x60 [i915]
> > > [  155.199145]  intel_gt_reset+0x11c/0x300 [i915]
> > > [  155.203657]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > > [  155.208641]  ? do_engine_reset+0x10/0x10 [i915]
> > > [  155.213243]  do_device_reset+0x13/0x20 [i915]
> > > [  155.217664]  check_whitelist_across_reset+0x166/0x250 [i915]
> > > [  155.223415]  live_reset_whitelist.cold+0x6a/0x7a [i915]
> > > [  155.228725]  __i915_subtests.cold+0x20/0x74 [i915]
> > > [  155.233593]  ? __i915_live_teardown+0x50/0x50 [i915]
> > > [  155.238644]  ? __intel_gt_live_setup+0x30/0x30 [i915]
> > > [  155.243773]  __run_selftests.cold+0x96/0xee [i915]
> > > [  155.248646]  i915_live_selftests+0x2c/0x60 [i915]
> > > [  155.253425]  i915_pci_probe+0x93/0x1c0 [i915]
> > > [  155.257854]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > > [  155.262839]  pci_device_probe+0x9b/0x110
> > > [  155.266785]  really_probe+0x1a6/0x3a0
> > > [  155.270467]  __driver_probe_device+0xf9/0x170
> > > [  155.274846]  driver_probe_device+0x19/0x90
> > > [  155.278968]  __driver_attach+0x99/0x170
> > > [  155.282824]  ? __device_attach_driver+0xd0/0xd0
> > > [  155.287376]  ? __device_attach_driver+0xd0/0xd0
> > > [  155.291928]  bus_for_each_dev+0x73/0xc0
> > > [  155.295792]  bus_add_driver+0x14b/0x1f0
> > > [  155.299648]  driver_register+0x67/0xb0
> > > [  155.303419]  i915_init+0x18/0x8c [i915]
> > > [  155.307328]  ? 0xffffffffa0188000
> > > [  155.310669]  do_one_initcall+0x53/0x2e0
> > > [  155.314525]  ? rcu_read_lock_sched_held+0x4d/0x80
> > > [  155.319253]  ? kmem_cache_alloc_trace+0x5ad/0x790
> > > [  155.323982]  do_init_module+0x56/0x210
> > > [  155.327754]  load_module+0x25fc/0x29f0
> > > [  155.331528]  ? __do_sys_finit_module+0xae/0x110
> > > [  155.336081]  __do_sys_finit_module+0xae/0x110
> > > [  155.340462]  do_syscall_64+0x38/0xc0
> > > [  155.344060]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  155.349137] RIP: 0033:0x7efc4496389d
> > > [  155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> > > [  155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > [  155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d
> > > [  155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004
> > > [  155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0
> > > [  155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610
> > > [  155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240
> > > 
> > > > This is the major reason why inventing any wheels locally in the kernel
> > > > isn't a good idea - aside from the duplicated code because likely there is
> > > > a solution for whatever you need. There's all the validation tools,
> > > > careful thought about semantics (especially around races) and all that
> > > > stuff that you're always missing on your hand-rolled thing. Aside from
> > > > anything hand-rolled is much harder to read, since intent isn't clear.
> > > > -Daniel
> > > > 
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > -Daniel
> > > > > > 
> > > > > > > +
> > > > > > >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > > > > > >  }
> > > > > > >  
> > > > > > > -- 
> > > > > > > 2.32.0
> > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > http://blog.ffwll.ch
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-08-16 15:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11  1:16 [PATCH 0/9] Clean up some CI failures for GuC submission Matthew Brost
2021-08-11  1:16 ` [PATCH 1/9] drm/i915/guc: Fix blocked context accounting Matthew Brost
2021-08-11 10:30   ` Daniel Vetter
2021-08-11 17:28     ` Matthew Brost
2021-08-11  1:16 ` [PATCH 2/9] drm/i915/guc: outstanding G2H accounting Matthew Brost
2021-08-11  1:16 ` [PATCH 3/9] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
2021-08-11  1:16 ` [PATCH 4/9] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
2021-08-11  1:16 ` [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H Matthew Brost
2021-08-12 14:11   ` Daniel Vetter
2021-08-12 15:23     ` Matthew Brost
2021-08-12 19:47       ` Daniel Vetter
2021-08-12 22:38         ` Matthew Brost
2021-08-13 15:11           ` Daniel Vetter
2021-08-13 19:02             ` Matthew Brost
2021-08-16 15:39               ` Daniel Vetter
2021-08-11  1:16 ` [PATCH 6/9] drm/i915/guc: Do not clear enable during reset in an enable is inflight Matthew Brost
2021-08-11  1:16 ` [PATCH 7/9] drm/i915/guc: Don't enable scheduling on a banned context Matthew Brost
2021-08-11  1:16 ` [PATCH 8/9] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
2021-08-11  1:16 ` [PATCH 9/9] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.