intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC
@ 2021-08-16 13:51 Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 01/22] drm/i915/guc: Fix blocked context accounting Matthew Brost
                   ` (25 more replies)
  0 siblings, 26 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Daniel Vetter pointed out that locking in the GuC submission code was
overly complicated, let's clean this up a bit before introducing more
features in the GuC submission backend.

Also fix some CI failures, port fixes from our internal tree, and add a
few more selftests for coverage.

Lastly, add some kernel DOC explaining how the GuC submission backend
works.

v2: Fix logic error in 'Workaround reset G2H is received after schedule
done G2H', don't propagate errors to dependent fences in execlists
submissiom, resolve checkpatch issues, resend to correct lists

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (22):
  drm/i915/guc: Fix blocked context accounting
  drm/i915/guc: Fix outstanding G2H accounting
  drm/i915/guc: Unwind context requests in reverse order
  drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  drm/i915/execlists: Do not propagate errors to dependent fences
  drm/i915/selftests: Add a cancel request selftest that triggers a
    reset
  drm/i915/guc: Don't enable scheduling on a banned context, guc_id
    invalid, not registered
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  drm/i915/guc: Take context ref when cancelling request
  drm/i915/guc: Don't touch guc_state.sched_state without a lock
  drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  drm/i915: Allocate error capture in atomic context
  drm/i915/guc: Flush G2H work queue during reset
  drm/i915/guc: Release submit fence from an IRQ
  drm/i915/guc: Move guc_blocked fence to struct guc_state
  drm/i915/guc: Rework and simplify locking
  drm/i915/guc: Proper xarray usage for contexts_lookup
  drm/i915/guc: Drop pin count check trick between sched_disable and
    re-pin
  drm/i915/guc: Move GuC priority fields in context under guc_active
  drm/i915/guc: Add GuC kernel doc

 drivers/gpu/drm/i915/gt/intel_context.c       |   5 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  68 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 690 +++++++++++-------
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++
 drivers/gpu/drm/i915/i915_gpu_error.c         |  37 +-
 drivers/gpu/drm/i915/i915_request.h           |  23 +-
 drivers/gpu/drm/i915/i915_trace.h             |   8 +-
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 +++
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 14 files changed, 813 insertions(+), 311 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 01/22] drm/i915/guc: Fix blocked context accounting
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Prior to this patch the blocked context counter was cleared on
init_sched_state (used during registering a context & resets) which is
incorrect. This state needs to be persistent or the counter can read the
incorrect value resulting in scheduling never getting enabled again.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..69faa39da178 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
 static inline bool
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 01/22] drm/i915/guc: Fix blocked context accounting Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17  9:39   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 03/22] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 69faa39da178..b5d3972ae164 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (err == -EBUSY && g2h_len_dw)
+		atomic_dec(&guc->outstanding_submission_g2h);
+
 	return err;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 03/22] drm/i915/guc: Unwind context requests in reverse order
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 01/22] drm/i915/guc: Fix blocked context accounting Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 04/22] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reverse and
append to the head of the priority list to fix this.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b5d3972ae164..bc51caba50d0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -799,9 +799,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_active.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
@@ -818,7 +818,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
 		spin_lock(&ce->guc_active.lock);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 04/22] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (2 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 03/22] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bc51caba50d0..3cd2da6f5c03 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -806,8 +806,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
-
 		__i915_request_unsubmit(rq);
 
 		/* Push the request back into the queue for later resubmission. */
@@ -820,8 +818,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (3 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 04/22] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17  9:32   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences Matthew Brost
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

If the context is reset as a result of the request cancelation the
context reset G2H is received after schedule disable done G2H which is
likely the wrong order. The schedule disable done G2H release the
waiting request cancelation code which resubmits the context. This races
with the context reset G2H which also wants to resubmit the context but
in this case it really should be a NOP as request cancelation code owns
the resubmit. Use some clever tricks of checking the context state to
seal this race until if / when the GuC firmware is fixed.

v2:
 (Checkpatch)
  - Fix typos

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3cd2da6f5c03..c3b7bf7319dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -826,17 +826,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
+	bool skip = false;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission.
+	 *
+	 * XXX: If the context is reset as a result of the request cancellation
+	 * this G2H is received after the schedule disable complete G2H which is
+	 * likely wrong as this creates a race between the request cancellation
+	 * code re-submitting the context and this G2H handler. This likely
+	 * should be fixed in the GuC but until if / when that gets fixed we
+	 * need to workaround this. Convert this function to a NOP if a pending
+	 * enable is in flight as this indicates that a request cancellation has
+	 * occurred.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (likely(!context_pending_enable(ce))) {
+		clr_context_enabled(ce);
+	} else {
+		skip = true;
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(skip))
+		goto out_put;
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -855,6 +873,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 out_replay:
 	guc_reset_state(ce, head, stalled);
 	__unwind_incomplete_requests(ce);
+out_put:
 	intel_context_put(ce);
 }
 
@@ -1599,6 +1618,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
 			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
 					true);
 		}
+
+		/*
+		 * XXX: Racey if context is reset, see comment in
+		 * __guc_reset_context().
+		 */
+		flush_work(&ce_to_guc(ce)->ct.requests.worker);
+
 		guc_context_unblock(ce);
 	}
 }
@@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
-	if (likely(!intel_context_is_banned(ce))) {
+	/*
+	 * XXX: Racey if request cancellation has occurred, see comment in
+	 * __guc_reset_context().
+	 */
+	if (likely(!intel_context_is_banned(ce) &&
+		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (4 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17  9:21   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 07/22] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Progagating errors to dependent fences is wrong, don't do it. Selftest
in following patch exposes this bug.

Fixes: 8e9f84cf5cac ("drm/i915/gt: Propagate change in error status to children on unhold")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index de5f9c86b9a4..cafb0608ffb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
 			if (p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
-			/* Propagate any change in error status */
-			if (rq->fence.error)
-				i915_request_set_error_once(w, rq->fence.error);
-
 			if (w->engine != rq->engine)
 				continue;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 07/22] drm/i915/selftests: Add a cancel request selftest that triggers a reset
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (5 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Add a cancel request selftest that results in an engine reset to cancel
the request as it is non-preemptable. Also insert a NOP request after
the cancelled request and confirm that it completely successfully.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 ++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index d67710d10615..e2c5db77f087 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -772,6 +772,98 @@ static int __cancel_completed(struct intel_engine_cs *engine)
 	return err;
 }
 
+static int __cancel_reset(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq, *nop;
+	unsigned long preempt_timeout_ms;
+	int err = 0;
+
+	preempt_timeout_ms = engine->props.preempt_timeout_ms;
+	engine->props.preempt_timeout_ms = 100;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		goto out_restore;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	nop = intel_context_create_request(ce);
+	if (IS_ERR(nop))
+		goto out_nop;
+	i915_request_get(nop);
+	i915_request_add(nop);
+
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel hung request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+		goto out_nop;
+	}
+
+	if (i915_request_wait(nop, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to complete nop request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (nop->fence.error != 0) {
+		pr_err("%s: Nop request errored (%u)\n",
+		       engine->name, nop->fence.error);
+		err = -EINVAL;
+	}
+
+out_nop:
+	i915_request_put(nop);
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+out_restore:
+	engine->props.preempt_timeout_ms = preempt_timeout_ms;
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
 static int live_cancel_request(void *arg)
 {
 	struct drm_i915_private *i915 = arg;
@@ -804,6 +896,14 @@ static int live_cancel_request(void *arg)
 			return err;
 		if (err2)
 			return err2;
+
+		/* Expects reset so call outside of igt_live_test_* */
+		err = __cancel_reset(engine);
+		if (err)
+			return err;
+
+		if (igt_flush_test(i915))
+			return -EIO;
 	}
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (6 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 07/22] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17  9:47   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 09/22] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

When unblocking a context, do not enable scheduling if the context is
banned, guc_id invalid, or not registered.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c3b7bf7319dd..353899634fa8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
+		     context_guc_id_invalid(ce) ||
+		     !lrc_desc_registered(guc, ce->guc_id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 09/22] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (7 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 10/22] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

GuC submission has exposed an existing memory corruption in
live_lrc_isolation. We believe that some writes to the watchdog offsets
in the LRC (0x178 & 0x17c) can result in trashing of portions of the
address space. With GuC submission there are additional objects which
can move the context redzone into the space that is trashed. To
workaround this avoid poisoning the watchdog.

v2:
 (Daniel Vetter)
  - Add VLK ref in code to workaround

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..cdc6ae48a1e1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog - VLK-22772 */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 10/22] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (8 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 09/22] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 11/22] drm/i915/guc: Take context ref when cancelling request Matthew Brost
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

v2:
 (Daniel Vetter)
  - Use ifdef instead of macros for selftests
v3:
 (Checkpatch)
  - A space after 'switch' statement

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  25 ++++
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 6 files changed, 184 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..3a73f3117873 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -198,6 +198,24 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
+	 */
+	bool drop_schedule_enable;
+
+	/**
+	 * @drop_schedule_disable: Force drop of schedule disable G2H for
+	 * selftest
+	 */
+	bool drop_schedule_disable;
+
+	/**
+	 * @drop_deregister: Force drop of deregister G2H for selftest
+	 */
+	bool drop_deregister;
+#endif
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 353899634fa8..bffd0199dc15 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2634,6 +2634,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+	if (unlikely(ce->drop_deregister)) {
+		ce->drop_deregister = false;
+		return 0;
+	}
+#endif
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2688,10 +2695,24 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_enable)) {
+			ce->drop_schedule_enable = false;
+			return 0;
+		}
+#endif
+
 		clr_context_pending_enable(ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_disable)) {
+			ce->drop_schedule_disable = false;
+			return 0;
+		}
+#endif
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -3068,3 +3089,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..264e2f705c17
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch (i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 11/22] drm/i915/guc: Take context ref when cancelling request
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (9 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 10/22] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A context can get destroyed after cancelling a request so take a
reference to context when cancelling a request.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bffd0199dc15..89126be26786 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1613,8 +1613,10 @@ static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
 	if (i915_sw_fence_signaled(&rq->submit)) {
-		struct i915_sw_fence *fence = guc_context_block(ce);
+		struct i915_sw_fence *fence;
 
+		intel_context_get(ce);
+		fence = guc_context_block(ce);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -1629,6 +1631,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
 		guc_context_unblock(ce);
+		intel_context_put(ce);
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (10 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 11/22] drm/i915/guc: Take context ref when cancelling request Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17  7:21   ` kernel test robot
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 13/22] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Before we did some clever tricks to not use the a lock when touching
guc_state.sched_state in certain cases. Don't do that, enforce the use
of the lock.

Part of this is removing a dead code path from guc_lrc_desc_pin where a
context could be deregistered when the aforementioned function was
called from the submission path. Remove this dead code and add a
GEM_BUG_ON if this path is ever attempted to be used.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 57 ++++++++++---------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 89126be26786..8d45585773f3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -150,11 +150,22 @@ static inline void clr_context_registered(struct intel_context *ce)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
 static inline void init_sched_state(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() */
+	lockdep_assert_held(&ce->guc_state.lock);
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
+static inline bool sched_state_is_init(struct intel_context *ce)
+{
+	/*
+	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
+	 * suspend.
+	 */
+	return !(atomic_read(&ce->guc_sched_state_no_lock) &
+		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
+		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
+}
+
 static inline bool
 context_wait_for_deregister_to_register(struct intel_context *ce)
 {
@@ -165,7 +176,7 @@ context_wait_for_deregister_to_register(struct intel_context *ce)
 static inline void
 set_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() without lock */
+	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |=
 		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
@@ -599,9 +610,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
 	xa_for_each(&guc->context_lookup, index, ce) {
-		/* Flush context */
 		spin_lock_irqsave(&ce->guc_state.lock, flags);
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -617,6 +626,8 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
 		if (pending_enable || destroyed || deregister) {
 			atomic_dec(&guc->outstanding_submission_g2h);
 			if (deregister)
@@ -1318,6 +1329,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
+	GEM_BUG_ON(!sched_state_is_init(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -1346,7 +1358,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->priority = ce->guc_prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
-	init_sched_state(ce);
 
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
@@ -1357,26 +1368,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	 * registering this context.
 	 */
 	if (context_registered) {
+		bool disabled;
+		unsigned long flags;
+
 		trace_intel_context_steal_guc_id(ce);
-		if (!loop) {
+		GEM_BUG_ON(!loop);
+
+		/* Seal race with Reset */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		disabled = submission_disabled(guc);
+		if (likely(!disabled)) {
 			set_context_wait_for_deregister_to_register(ce);
 			intel_context_get(ce);
-		} else {
-			bool disabled;
-			unsigned long flags;
-
-			/* Seal race with Reset */
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
-			disabled = submission_disabled(guc);
-			if (likely(!disabled)) {
-				set_context_wait_for_deregister_to_register(ce);
-				intel_context_get(ce);
-			}
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-			if (unlikely(disabled)) {
-				reset_lrc_desc(guc, desc_idx);
-				return 0;	/* Will get registered later */
-			}
+		}
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		if (unlikely(disabled)) {
+			reset_lrc_desc(guc, desc_idx);
+			return 0;	/* Will get registered later */
 		}
 
 		/*
@@ -1385,10 +1393,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		 */
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = deregister_context(ce, ce->guc_id, loop);
-		if (unlikely(ret == -EBUSY)) {
-			clr_context_wait_for_deregister_to_register(ce);
-			intel_context_put(ce);
-		} else if (unlikely(ret == -ENODEV)) {
+		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
 	} else {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 13/22] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (11 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context Matthew Brost
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Reset LRC descriptor if a context register returns -ENODEV as this means
we are mid-reset.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8d45585773f3..3a01743e09ea 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1399,10 +1399,12 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	} else {
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = register_context(ce, loop);
-		if (unlikely(ret == -EBUSY))
+		if (unlikely(ret == -EBUSY)) {
+			reset_lrc_desc(guc, desc_idx);
+		} else if (unlikely(ret == -ENODEV)) {
 			reset_lrc_desc(guc, desc_idx);
-		else if (unlikely(ret == -ENODEV))
 			ret = 0;	/* Will get registered later */
+		}
 	}
 
 	return ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (12 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 13/22] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:06   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Error captures can now be done in a work queue processing G2H messages.
These messages need to be completely done being processed in the reset
path, to avoid races in the missing G2H cleanup, which create a
dependency on memory allocations and dma fences (i915_requests).
Requests depend on resets, thus now we have a circular dependency. To
work around this, allocate the error capture in an atomic context.

Fixes: dc0dad365c5e ("Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126aef3 ("Capture error state on context reset")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 37 +++++++++++++--------------
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 0f08bcfbe964..453376aa6d9f 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -49,7 +49,6 @@
 #include "i915_memcpy.h"
 #include "i915_scatterlist.h"
 
-#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
 #define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)
 
 static void __sg_set_buf(struct scatterlist *sg,
@@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	if (e->cur == e->end) {
 		struct scatterlist *sgl;
 
-		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
+		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
 		if (!sgl) {
 			e->err = -ENOMEM;
 			return false;
@@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	}
 
 	e->size = ALIGN(len + 1, SZ_64K);
-	e->buf = kmalloc(e->size, ALLOW_FAIL);
+	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	if (!e->buf) {
 		e->size = PAGE_ALIGN(len + 1);
-		e->buf = kmalloc(e->size, GFP_KERNEL);
+		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	}
 	if (!e->buf) {
 		e->err = -ENOMEM;
@@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
 {
 	struct z_stream_s *zstream = &c->zstream;
 
-	if (pool_init(&c->pool, ALLOW_FAIL))
+	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
 		return false;
 
 	zstream->workspace =
 		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
-			ALLOW_FAIL);
+			ATOMIC_MAYFAIL);
 	if (!zstream->workspace) {
 		pool_fini(&c->pool);
 		return false;
@@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
 
 	c->tmp = NULL;
 	if (i915_has_memcpy_from_wc())
-		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
+		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 
 	return true;
 }
@@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
 	if (dst->page_count >= dst->num_pages)
 		return ERR_PTR(-ENOSPC);
 
-	page = pool_alloc(&c->pool, ALLOW_FAIL);
+	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 
@@ -376,7 +375,7 @@ struct i915_vma_compress {
 
 static bool compress_init(struct i915_vma_compress *c)
 {
-	return pool_init(&c->pool, ALLOW_FAIL) == 0;
+	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
 }
 
 static bool compress_start(struct i915_vma_compress *c)
@@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
 {
 	void *ptr;
 
-	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
+	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!ptr)
 		return -ENOMEM;
 
@@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
 
 	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
 	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
-	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
+	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
 	if (!dst)
 		return NULL;
 
@@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
 	if (!ee)
 		return NULL;
 
@@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
 		struct intel_engine_coredump *ee;
 
 		/* Refill our page pool before entering atomic section */
-		pool_refill(&compress->pool, ALLOW_FAIL);
+		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
 
 		ee = capture_engine(engine, compress);
 		if (!ee)
@@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	const struct intel_uc *uc = &gt->_gt->uc;
 	struct intel_uc_coredump *error_uc;
 
-	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
+	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
 	if (!error_uc)
 		return NULL;
 
@@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	 * As modparams are generally accesible from the userspace make
 	 * explicit copies of the firmware paths.
 	 */
-	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
-	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
+	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
+	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
 	error_uc->guc_log =
 		i915_vma_coredump_create(gt->_gt,
 					 uc->guc.log.vma, "GuC log buffer",
@@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
 {
 	struct i915_vma_compress *compress;
 
-	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
+	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
 	if (!compress)
 		return NULL;
 
@@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 	if (IS_ERR(error))
 		return error;
 
-	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
+	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
 	if (!error)
 		return ERR_PTR(-ENOMEM);
 
-	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
+	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
 	if (error->gt) {
 		struct i915_vma_compress *compress;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (13 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:06   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ Matthew Brost
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3a01743e09ea..8c560ed14976 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -707,8 +707,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -724,20 +722,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
+
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (14 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:08   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A subsequent patch will flip the locking hierarchy from
ce->guc_state.lock -> sched_engine->lock to sched_engine->lock ->
ce->guc_state.lock. As such we need to release the submit fence for a
request from an IRQ to break a lock inversion - i.e. the fence must be
release went holding ce->guc_state.lock and the releasing of the can
acquire sched_engine->lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
 drivers/gpu/drm/i915/i915_request.h               |  5 +++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8c560ed14976..9ae4633aa7cb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2017,6 +2017,14 @@ static const struct intel_context_ops guc_context_ops = {
 	.create_virtual = guc_create_virtual,
 };
 
+static void submit_work_cb(struct irq_work *wrk)
+{
+	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
+
+	might_lock(&rq->engine->sched_engine->lock);
+	i915_sw_fence_complete(&rq->submit);
+}
+
 static void __guc_signal_context_fence(struct intel_context *ce)
 {
 	struct i915_request *rq;
@@ -2026,8 +2034,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
 	if (!list_empty(&ce->guc_state.fences))
 		trace_intel_context_fence_release(ce);
 
+	/*
+	 * Use an IRQ to ensure locking order of sched_engine->lock ->
+	 * ce->guc_state.lock is preserved.
+	 */
 	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
-		i915_sw_fence_complete(&rq->submit);
+		irq_work_queue(&rq->submit_work);
 
 	INIT_LIST_HEAD(&ce->guc_state.fences);
 }
@@ -2137,6 +2149,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
+		init_irq_work(&rq->submit_work, submit_work_cb);
 		i915_sw_fence_await(&rq->submit);
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 1bc1349ba3c2..d818cfbfc41d 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -218,6 +218,11 @@ struct i915_request {
 	};
 	struct llist_head execute_cb;
 	struct i915_sw_fence semaphore;
+	/**
+	 * @submit_work: complete submit fence from an IRQ if needed for
+	 * locking hierarchy reasons.
+	 */
+	struct irq_work submit_work;
 
 	/*
 	 * A list of everyone we wait upon, and everyone who waits upon us.
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (15 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:10   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking Matthew Brost
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Move guc_blocked fence to struct guc_state as the lock which protects
the fence lives there.

s/ce->guc_blocked/ce->guc_state.blocked_fence/g

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c        |  5 +++--
 drivers/gpu/drm/i915/gt/intel_context_types.h  |  5 ++---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +++++++++---------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 745e84c72c90..0e48939ec85f 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -405,8 +405,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
 	 */
-	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_init(&ce->guc_state.blocked_fence,
+			   sw_fence_dummy_notify);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 
 	i915_active_init(&ce->active,
 			 __intel_context_active, __intel_context_retire, 0);
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 3a73f3117873..c06171ee8792 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -167,6 +167,8 @@ struct intel_context {
 		 * fence related to GuC submission
 		 */
 		struct list_head fences;
+		/* GuC context blocked fence */
+		struct i915_sw_fence blocked_fence;
 	} guc_state;
 
 	struct {
@@ -190,9 +192,6 @@ struct intel_context {
 	 */
 	struct list_head guc_id_link;
 
-	/* GuC context blocked fence */
-	struct i915_sw_fence guc_blocked;
-
 	/*
 	 * GuC priority management
 	 */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9ae4633aa7cb..7aa16371908a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1482,24 +1482,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (!i915_sw_fence_done(&ce->guc_blocked))
-		i915_sw_fence_complete(&ce->guc_blocked);
+	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
+		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
 }
 
 static void guc_blocked_fence_reinit(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
+	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
 
 	/*
 	 * This fence is always complete unless a pending schedule disable is
 	 * outstanding. We arm the fence here and complete it when we receive
 	 * the pending schedule disable complete message.
 	 */
-	i915_sw_fence_fini(&ce->guc_blocked);
-	i915_sw_fence_reinit(&ce->guc_blocked);
-	i915_sw_fence_await(&ce->guc_blocked);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
+	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
+	i915_sw_fence_await(&ce->guc_state.blocked_fence);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 }
 
 static u16 prep_context_pending_disable(struct intel_context *ce)
@@ -1539,7 +1539,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 		if (enabled)
 			clr_context_enabled(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-		return &ce->guc_blocked;
+		return &ce->guc_state.blocked_fence;
 	}
 
 	/*
@@ -1555,7 +1555,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	with_intel_runtime_pm(runtime_pm, wakeref)
 		__guc_context_sched_disable(guc, ce, guc_id);
 
-	return &ce->guc_blocked;
+	return &ce->guc_state.blocked_fence;
 }
 
 static void guc_context_unblock(struct intel_context *ce)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (16 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:15   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Rework and simplify the locking with GuC subission. Drop
sched_state_no_lock and move all fields under the guc_state.sched_state
and protect all these fields with guc_state.lock . This requires
changing the locking hierarchy from guc_state.lock -> sched_engine.lock
to sched_engine.lock -> guc_state.lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
 drivers/gpu/drm/i915/i915_trace.h             |   6 +-
 3 files changed, 89 insertions(+), 108 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index c06171ee8792..d5d643b04d54 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -161,7 +161,7 @@ struct intel_context {
 		 * sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
-		u16 sched_state;
+		u32 sched_state;
 		/*
 		 * fences: maintains of list of requests that have a submit
 		 * fence related to GuC submission
@@ -178,9 +178,6 @@ struct intel_context {
 		struct list_head requests;
 	} guc_active;
 
-	/* GuC scheduling state flags that do not require a lock. */
-	atomic_t guc_sched_state_no_lock;
-
 	/* GuC LRC descriptor ID */
 	u16 guc_id;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 7aa16371908a..ba19b99173fc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
-/*
- * Below is a set of functions which control the GuC scheduling state which do
- * not require a lock as all state transitions are mutually exclusive. i.e. It
- * is not possible for the context pinning code and submission, for the same
- * context, to be executing simultaneously. We still need an atomic as it is
- * possible for some of the bits to changing at the same time though.
- */
-#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
-#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
-#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
-static inline bool context_enabled(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_ENABLED);
-}
-
-static inline void set_context_enabled(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_enabled(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
-		   &ce->guc_sched_state_no_lock);
-}
-
-static inline bool context_pending_enable(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
-}
-
-static inline void set_context_pending_enable(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		  &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_pending_enable(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		   &ce->guc_sched_state_no_lock);
-}
-
-static inline bool context_registered(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_REGISTERED);
-}
-
-static inline void set_context_registered(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
-		  &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_registered(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
-		   &ce->guc_sched_state_no_lock);
-}
-
 /*
  * Below is a set of functions which control the GuC scheduling state which
- * require a lock, aside from the special case where the functions are called
- * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
- * path to be executing on the context.
+ * require a lock.
  */
 #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
 #define SCHED_STATE_DESTROYED				BIT(1)
 #define SCHED_STATE_PENDING_DISABLE			BIT(2)
 #define SCHED_STATE_BANNED				BIT(3)
-#define SCHED_STATE_BLOCKED_SHIFT			4
+#define SCHED_STATE_ENABLED				BIT(4)
+#define SCHED_STATE_PENDING_ENABLE			BIT(5)
+#define SCHED_STATE_REGISTERED				BIT(6)
+#define SCHED_STATE_BLOCKED_SHIFT			7
 #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
 static inline void init_sched_state(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	atomic_set(&ce->guc_sched_state_no_lock, 0);
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
@@ -161,9 +98,8 @@ static inline bool sched_state_is_init(struct intel_context *ce)
 	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
 	 * suspend.
 	 */
-	return !(atomic_read(&ce->guc_sched_state_no_lock) &
-		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
-		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
+	return !(ce->guc_state.sched_state &=
+		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
 }
 
 static inline bool
@@ -236,6 +172,57 @@ static inline void clr_context_banned(struct intel_context *ce)
 	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
 }
 
+static inline bool context_enabled(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
+}
+
+static inline void set_context_enabled(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
+}
+
+static inline void clr_context_enabled(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
+}
+
+static inline bool context_pending_enable(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline void set_context_pending_enable(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline void clr_context_pending_enable(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline bool context_registered(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
+}
+
+static inline void set_context_registered(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
+}
+
+static inline void clr_context_registered(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
+}
+
 static inline u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
@@ -244,7 +231,6 @@ static inline u32 context_blocked(struct intel_context *ce)
 
 static inline void incr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
@@ -254,7 +240,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
 
 static inline void decr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
@@ -443,6 +428,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	u32 g2h_len_dw = 0;
 	bool enabled;
 
+	lockdep_assert_held(&rq->engine->sched_engine->lock);
+
 	/*
 	 * Corner case where requests were sitting in the priority list or a
 	 * request resubmitted after the context was banned.
@@ -450,7 +437,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(intel_context_is_banned(ce))) {
 		i915_request_put(i915_request_mark_eio(rq));
 		intel_engine_signal_breadcrumbs(ce->engine);
-		goto out;
+		return 0;
 	}
 
 	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
@@ -463,9 +450,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
 		err = guc_lrc_desc_pin(ce, false);
 		if (unlikely(err))
-			goto out;
+			return err;
 	}
 
+	spin_lock(&ce->guc_state.lock);
+
 	/*
 	 * The request / context will be run on the hardware when scheduling
 	 * gets enabled in the unblock.
@@ -500,6 +489,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_i915_request_guc_submit(rq);
 
 out:
+	spin_unlock(&ce->guc_state.lock);
 	return err;
 }
 
@@ -720,8 +710,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
-	guc_flush_submissions(guc);
-
 	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
@@ -1125,7 +1113,11 @@ static int steal_guc_id(struct intel_guc *guc)
 
 		list_del_init(&ce->guc_id_link);
 		guc_id = ce->guc_id;
+
+		spin_lock(&ce->guc_state.lock);
 		clr_context_registered(ce);
+		spin_unlock(&ce->guc_state.lock);
+
 		set_context_guc_id_invalid(ce);
 		return guc_id;
 	} else {
@@ -1161,6 +1153,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
+	might_lock(&ce->guc_state.lock);
+
 	if (context_guc_id_invalid(ce)) {
 		ret = assign_guc_id(guc, &ce->guc_id);
 		if (ret)
@@ -1240,8 +1234,13 @@ static int register_context(struct intel_context *ce, bool loop)
 	trace_intel_context_register(ce);
 
 	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
-	if (likely(!ret))
+	if (likely(!ret)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		set_context_registered(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	}
 
 	return ret;
 }
@@ -1517,7 +1516,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
 static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1526,13 +1524,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
-	/*
-	 * Sync with submission path, increment before below changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	incr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1561,7 +1553,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 static void guc_context_unblock(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1586,13 +1577,7 @@ static void guc_context_unblock(struct intel_context *ce)
 		intel_context_get(ce);
 	}
 
-	/*
-	 * Sync with submission path, decrement after above changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	decr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
@@ -1702,7 +1687,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id)) {
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_enabled(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
 
@@ -1752,7 +1739,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
 	GEM_BUG_ON(context_enabled(ce));
 
-	clr_context_registered(ce);
 	deregister_context(ce, ce->guc_id, true);
 }
 
@@ -1825,8 +1811,10 @@ static void guc_context_destroy(struct kref *kref)
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
-	if (likely(!disabled))
+	if (likely(!disabled)) {
 		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
@@ -2695,8 +2683,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		     (!context_pending_enable(ce) &&
 		     !context_pending_disable(ce)))) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
-			atomic_read(&ce->guc_sched_state_no_lock),
+			"Bad context sched_state 0x%x, desc_idx %u",
 			ce->guc_state.sched_state, desc_idx);
 		return -EPROTO;
 	}
@@ -2711,7 +2698,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		}
 #endif
 
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_pending_enable(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
@@ -2985,9 +2974,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
 			   atomic_read(&ce->guc_id_ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
-			   ce->guc_state.sched_state,
-			   atomic_read(&ce->guc_sched_state_no_lock));
+		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+			   ce->guc_state.sched_state);
 
 		guc_log_context_priority(p, ce);
 	}
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 806ad688274b..0a77eb2944b5 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
 			     __field(u32, guc_id)
 			     __field(int, pin_count)
 			     __field(u32, sched_state)
-			     __field(u32, guc_sched_state_no_lock)
 			     __field(u8, guc_prio)
 			     ),
 
@@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
 			   __entry->guc_id = ce->guc_id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_sched_state_no_lock =
-			   atomic_read(&ce->guc_sched_state_no_lock);
 			   __entry->guc_prio = ce->guc_prio;
 			   ),
 
-		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
+		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
 			      __entry->guc_id, __entry->pin_count,
 			      __entry->sched_state,
-			      __entry->guc_sched_state_no_lock,
 			      __entry->guc_prio)
 );
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (17 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 10:27   ` Daniel Vetter
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 20/22] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Lock the xarray and take ref to the context if needed.

v2:
 (Checkpatch)
  - Add new line after declaration

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 84 ++++++++++++++++---
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba19b99173fc..2ecb2f002bed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -599,8 +599,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	unsigned long index, flags;
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		/*
+		 * Corner case where the ref count on the object is zero but and
+		 * deregister G2H was lost. In this case we don't touch the ref
+		 * count and finish the destroy of the context.
+		 */
+		bool do_put = kref_get_unless_zero(&ce->ref);
+
+		xa_unlock(&guc->context_lookup);
+
+		spin_lock(&ce->guc_state.lock);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -616,7 +626,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		spin_unlock(&ce->guc_state.lock);
+
+		GEM_BUG_ON(!do_put && !destroyed);
 
 		if (pending_enable || destroyed || deregister) {
 			atomic_dec(&guc->outstanding_submission_g2h);
@@ -645,7 +657,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 
 			intel_context_put(ce);
 		}
+
+		if (do_put)
+			intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static inline bool
@@ -866,16 +883,26 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
 	}
 
-	xa_for_each(&guc->context_lookup, index, ce)
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		intel_context_get(ce);
+		xa_unlock(&guc->context_lookup);
+
 		if (intel_context_is_pinned(ce))
 			__guc_reset_context(ce, stalled);
 
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
@@ -950,11 +977,21 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
+
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		intel_context_get(ce);
+		xa_unlock(&guc->context_lookup);
 
-	xa_for_each(&guc->context_lookup, index, ce)
 		if (intel_context_is_pinned(ce))
 			guc_cancel_context_requests(ce);
 
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	guc_cancel_sched_engine_requests(guc->sched_engine);
 
 	/* GuC is blown away, drop all references to contexts */
@@ -2848,21 +2885,26 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 	struct intel_context *ce;
 	struct i915_request *rq;
 	unsigned long index;
+	unsigned long flags;
 
 	/* Reset called during driver load? GuC not yet initialised! */
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		intel_context_get(ce);
+		xa_unlock(&guc->context_lookup);
+
 		if (!intel_context_is_pinned(ce))
-			continue;
+			goto next;
 
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
 		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
@@ -2872,9 +2914,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
-			return;
+			intel_context_put(ce);
+			xa_lock(&guc->context_lookup);
+			goto done;
 		}
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+
 	}
+done:
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
@@ -2890,23 +2940,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		intel_context_get(ce);
+		xa_unlock(&guc->context_lookup);
+
 		if (!intel_context_is_pinned(ce))
-			continue;
+			goto next;
 
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
 		spin_lock_irqsave(&ce->guc_active.lock, flags);
 		intel_engine_dump_active_requests(&ce->guc_active.requests,
 						  hung_rq, m);
 		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_submission_print_info(struct intel_guc *guc,
@@ -2960,7 +3019,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
 		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
@@ -2979,6 +3040,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 		guc_log_context_priority(p, ce);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static struct intel_context *
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 20/22] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (18 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 21/22] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Drop pin count check trick between a sched_disable and re-pin, now rely
on the lock and counter of the number of committed requests to determine
if scheduling should be disabled on the context.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 49 ++++++++++++-------
 2 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index d5d643b04d54..524a35a78bf4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -169,6 +169,8 @@ struct intel_context {
 		struct list_head fences;
 		/* GuC context blocked fence */
 		struct i915_sw_fence blocked_fence;
+		/* GuC committed requests */
+		int number_committed_requests;
 	} guc_state;
 
 	struct {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2ecb2f002bed..c6ae6b4417c2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -247,6 +247,25 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
+static inline bool context_has_committed_requests(struct intel_context *ce)
+{
+	return !!ce->guc_state.number_committed_requests;
+}
+
+static inline void incr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	++ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static inline void decr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	--ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id == GUC_INVALID_LRC_ID;
@@ -1736,14 +1755,11 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
-	 * We have to check if the context has been disabled by another thread.
-	 * We also have to check if the context has been pinned again as another
-	 * pin operation is allowed to pass this function. Checking the pin
-	 * count, within ce->guc_state.lock, synchronizes this function with
-	 * guc_request_alloc ensuring a request doesn't slip through the
-	 * 'context_pending_disable' fence. Checking within the spin lock (can't
-	 * sleep) ensures another process doesn't pin this context and generate
-	 * a request before we set the 'context_pending_disable' flag here.
+	 * We have to check if the context has been disabled by another thread,
+	 * check if submssion has been disabled to seal a race with reset and
+	 * finally check if any more requests have been committed to the
+	 * context ensursing that a request doesn't slip through the
+	 * 'context_pending_disable' fence.
 	 */
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1752,7 +1768,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
-	if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+	if (unlikely(context_has_committed_requests(ce))) {
+		intel_context_sched_disable_unpin(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		return;
 	}
@@ -1785,6 +1802,7 @@ static void __guc_context_destroy(struct intel_context *ce)
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
 	intel_context_fini(ce);
@@ -2015,6 +2033,10 @@ static void remove_from_context(struct i915_request *rq)
 
 	spin_unlock_irq(&ce->guc_active.lock);
 
+	spin_lock_irq(&ce->guc_state.lock);
+	decr_context_committed_requests(ce);
+	spin_unlock_irq(&ce->guc_state.lock);
+
 	atomic_dec(&ce->guc_id_ref);
 	i915_request_notify_execute_cb_imm(rq);
 }
@@ -2162,15 +2184,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * schedule enable or context registration if either G2H is pending
 	 * respectfully. Once a G2H returns, the fence is released that is
 	 * blocking these requests (see guc_signal_context_fence).
-	 *
-	 * We can safely check the below fields outside of the lock as it isn't
-	 * possible for these fields to transition from being clear to set but
-	 * converse is possible, hence the need for the check within the lock.
 	 */
-	if (likely(!context_wait_for_deregister_to_register(ce) &&
-		   !context_pending_disable(ce)))
-		return 0;
-
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
@@ -2179,6 +2193,7 @@ static int guc_request_alloc(struct i915_request *rq)
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
 	}
+	incr_context_committed_requests(ce);
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 21/22] drm/i915/guc: Move GuC priority fields in context under guc_active
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (19 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 20/22] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc Matthew Brost
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Move GuC management fields in context under guc_active struct as this is
where the lock that protects theses fields lives. Also only set guc_prio
field once during context init.

Fixes: ee242ca704d3 ("drm/i915/guc: Implement GuC priority management")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 68 +++++++++++--------
 drivers/gpu/drm/i915/i915_trace.h             |  2 +-
 3 files changed, 45 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 524a35a78bf4..f6989e6807f7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -112,6 +112,7 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
+#define CONTEXT_GUC_INIT		10
 
 	struct {
 		u64 timeout_us;
@@ -178,6 +179,11 @@ struct intel_context {
 		spinlock_t lock;
 		/** requests: active requests on this context */
 		struct list_head requests;
+		/*
+		 * GuC priority management
+		 */
+		u8 guc_prio;
+		u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_active;
 
 	/* GuC LRC descriptor ID */
@@ -191,12 +197,6 @@ struct intel_context {
 	 */
 	struct list_head guc_id_link;
 
-	/*
-	 * GuC priority management
-	 */
-	u8 guc_prio;
-	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
-
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c6ae6b4417c2..eb06a4c7534e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1354,8 +1354,6 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
 	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio);
-
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -1363,8 +1361,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct intel_guc *guc = &engine->gt->uc.guc;
 	u32 desc_idx = ce->guc_id;
 	struct guc_lrc_desc *desc;
-	const struct i915_gem_context *ctx;
-	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
 	bool context_registered;
 	intel_wakeref_t wakeref;
 	int ret = 0;
@@ -1381,12 +1377,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
-	rcu_read_lock();
-	ctx = rcu_dereference(ce->gem_context);
-	if (ctx)
-		prio = ctx->sched.priority;
-	rcu_read_unlock();
-
 	reset_lrc_desc(guc, desc_idx);
 	set_lrc_desc_registered(guc, desc_idx, ce);
 
@@ -1395,8 +1385,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
 	desc->hw_context_desc = ce->lrc.lrca;
-	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
-	desc->priority = ce->guc_prio;
+	desc->priority = ce->guc_active.guc_prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
@@ -1798,10 +1787,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 
 static void __guc_context_destroy(struct intel_context *ce)
 {
-	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_active.guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
+		   ce->guc_active.guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
+		   ce->guc_active.guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
+		   ce->guc_active.guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
 	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
@@ -1911,14 +1900,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
 
 	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
 		   prio > GUC_CLIENT_PRIORITY_NORMAL);
+	lockdep_assert_held(&ce->guc_active.lock);
 
-	if (ce->guc_prio == prio || submission_disabled(guc) ||
-	    !context_registered(ce))
+	if (ce->guc_active.guc_prio == prio || submission_disabled(guc) ||
+	    !context_registered(ce)) {
+		ce->guc_active.guc_prio = prio;
 		return;
+	}
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
 
-	ce->guc_prio = prio;
+	ce->guc_active.guc_prio = prio;
 	trace_intel_context_set_prio(ce);
 }
 
@@ -1938,24 +1930,24 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.guc_prio_count));
 
-	++ce->guc_prio_count[guc_prio];
+	++ce->guc_active.guc_prio_count[guc_prio];
 
 	/* Overflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_active.guc_prio_count[guc_prio]);
 }
 
 static inline void sub_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.guc_prio_count));
 
 	/* Underflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_active.guc_prio_count[guc_prio]);
 
-	--ce->guc_prio_count[guc_prio];
+	--ce->guc_active.guc_prio_count[guc_prio];
 }
 
 static inline void update_context_prio(struct intel_context *ce)
@@ -1968,8 +1960,8 @@ static inline void update_context_prio(struct intel_context *ce)
 
 	lockdep_assert_held(&ce->guc_active.lock);
 
-	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
-		if (ce->guc_prio_count[i]) {
+	for (i = 0; i < ARRAY_SIZE(ce->guc_active.guc_prio_count); ++i) {
+		if (ce->guc_active.guc_prio_count[i]) {
 			guc_context_set_prio(guc, ce, i);
 			break;
 		}
@@ -2108,6 +2100,20 @@ static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static void guc_context_init(struct intel_context *ce)
+{
+	const struct i915_gem_context *ctx;
+	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(ce->gem_context);
+	if (ctx)
+		prio = ctx->sched.priority;
+	rcu_read_unlock();
+
+	ce->guc_active.guc_prio = map_i915_prio_to_guc_prio(prio);
+}
+
 static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
@@ -2139,6 +2145,9 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
+		guc_context_init(ce);
+
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
@@ -3018,13 +3027,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 {
 	int i;
 
-	drm_printf(p, "\t\tPriority: %d\n",
-		   ce->guc_prio);
+	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.guc_prio);
 	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
 	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
 	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
 		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
-			   i, ce->guc_prio_count[i]);
+			   i, ce->guc_active.guc_prio_count[i]);
 	}
 	drm_printf(p, "\n");
 }
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 0a77eb2944b5..518a6fa2cca7 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
 			   __entry->guc_id = ce->guc_id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_prio = ce->guc_prio;
+			   __entry->guc_prio = ce->guc_active.guc_prio;
 			   ),
 
 		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (20 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 21/22] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
@ 2021-08-16 13:51 ` Matthew Brost
  2021-08-17 11:11   ` Daniel Vetter
  2021-08-17 12:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2) Patchwork
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-16 13:51 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Add GuC kernel doc for all structures added thus far for GuC submission
and update the main GuC submission section with the new interface
details.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  42 +++++---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 101 ++++++++++++++----
 drivers/gpu/drm/i915/i915_request.h           |  18 ++--
 4 files changed, 131 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index f6989e6807f7..75d609a1bc33 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -156,44 +156,56 @@ struct intel_context {
 	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
 
 	struct {
-		/** lock: protects everything in guc_state */
+		/** @lock: protects everything in guc_state */
 		spinlock_t lock;
 		/**
-		 * sched_state: scheduling state of this context using GuC
+		 * @sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
 		u32 sched_state;
 		/*
-		 * fences: maintains of list of requests that have a submit
-		 * fence related to GuC submission
+		 * @fences: maintains a list of requests are currently being
+		 * fenced until a GuC operation completes
 		 */
 		struct list_head fences;
-		/* GuC context blocked fence */
+		/**
+		 * @blocked_fence: fence used to signal when the blocking of a
+		 * contexts submissions is complete.
+		 */
 		struct i915_sw_fence blocked_fence;
-		/* GuC committed requests */
+		/** @number_committed_requests: number of committed requests */
 		int number_committed_requests;
 	} guc_state;
 
 	struct {
-		/** lock: protects everything in guc_active */
+		/** @lock: protects everything in guc_active */
 		spinlock_t lock;
-		/** requests: active requests on this context */
+		/** @requests: list of active requests on this context */
 		struct list_head requests;
-		/*
-		 * GuC priority management
-		 */
+		/** @guc_prio: the contexts current guc priority */
 		u8 guc_prio;
+		/**
+		 * @guc_prio_count: a counter of the number requests inflight in
+		 * each priority bucket
+		 */
 		u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_active;
 
-	/* GuC LRC descriptor ID */
+	/**
+	 * @guc_id: unique handle which is used to communicate information with
+	 * the GuC about this context, protected by guc->contexts_lock
+	 */
 	u16 guc_id;
 
-	/* GuC LRC descriptor reference count */
+	/**
+	 * @guc_id_ref: the number of references to the guc_id, when
+	 * transitioning in and out of zero protected by guc->contexts_lock
+	 */
 	atomic_t guc_id_ref;
 
-	/*
-	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
+	/**
+	 * @guc_id_link: in guc->guc_id_list when the guc_id has no refs but is
+	 * still valid, protected by guc->contexts_lock
 	 */
 	struct list_head guc_id_link;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 2e27fe59786b..c0b3fdb601f0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -41,6 +41,10 @@ struct intel_guc {
 	spinlock_t irq_lock;
 	unsigned int msg_enabled_mask;
 
+	/**
+	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
+	 * submission, used to determine if the GT is idle
+	 */
 	atomic_t outstanding_submission_g2h;
 
 	struct {
@@ -49,12 +53,16 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/*
-	 * contexts_lock protects the pool of free guc ids and a linked list of
-	 * guc ids available to be stolen
+	/**
+	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id, and
+	 * ce->guc_id_ref when transitioning in and out of zero
 	 */
 	spinlock_t contexts_lock;
+	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/**
+	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
+	 */
 	struct list_head guc_id_list;
 
 	bool submission_supported;
@@ -70,7 +78,10 @@ struct intel_guc {
 	struct i915_vma *lrc_desc_pool;
 	void *lrc_desc_pool_vaddr;
 
-	/* guc_id to intel_context lookup */
+	/**
+	 * @context_lookup: used to intel_context from guc_id, if a context is
+	 * present in this structure it is registered with the GuC
+	 */
 	struct xarray context_lookup;
 
 	/* Control params for fw initialization */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index eb06a4c7534e..18ef363c6e5d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -28,21 +28,6 @@
 /**
  * DOC: GuC-based command submission
  *
- * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
- * firmware is moving to an updated submission interface and we plan to
- * turn submission back on when that lands. The below documentation (and related
- * code) matches the old submission model and will be updated as part of the
- * upgrade to the new flow.
- *
- * GuC stage descriptor:
- * During initialization, the driver allocates a static pool of 1024 such
- * descriptors, and shares them with the GuC. Currently, we only use one
- * descriptor. This stage descriptor lets the GuC know about the workqueue and
- * process descriptor. Theoretically, it also lets the GuC know about our HW
- * contexts (context ID, etc...), but we actually employ a kind of submission
- * where the GuC uses the LRCA sent via the work item instead. This is called
- * a "proxy" submission.
- *
  * The Scratch registers:
  * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
  * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
@@ -51,14 +36,86 @@
  * processes the request. The kernel driver polls waiting for this update and
  * then proceeds.
  *
- * Work Items:
- * There are several types of work items that the host may place into a
- * workqueue, each with its own requirements and limitations. Currently only
- * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
- * represents in-order queue. The kernel driver packs ring tail pointer and an
- * ELSP context descriptor dword into Work Item.
- * See guc_add_request()
+ * Command Transport buffers (CTBs):
+ * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
+ * G2H) are a message interface between the i915 and GuC used to controls
+ * submissions.
+ *
+ * Context registration:
+ * Before a context can be submitted it must be registered with the GuC via a
+ * H2G. A unique guc_id is associated with each context. The context is either
+ * registered at request creation time (normal operation) or at submission time
+ * (abnormal operation, e.g. after a reset).
+ *
+ * Context submission:
+ * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
+ * or context submit H2G is used to submit a context.
+ *
+ * Context unpin:
+ * To unpin a context a H2G is used to disable scheduling and when the
+ * corresponding G2H returns indicating the scheduling disable operation has
+ * completed it is safe to unpin the context. While a disable is in flight it
+ * isn't safe to resubmit the context so a fence is used to stall all future
+ * requests until the G2H is returned.
+ *
+ * Context deregistration:
+ * Before a context can be destroyed or we steal its guc_id we must deregister
+ * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
+ * submit anything to this guc_id until the deregister completes so a fence is
+ * used to stall all requests associated with this guc_ids until the
+ * corresponding G2H returns indicating the guc_id has been deregistered.
+ *
+ * guc_ids:
+ * Unique number associated with private GuC context data passed in during
+ * context registration / submission / deregistration. 64k available. Simple ida
+ * is used for allocation.
+ *
+ * Stealing guc_ids:
+ * If no guc_ids are available they can be stolen from another context at
+ * request creation time if that context is unpinned. If a guc_id can't be found
+ * we punt this problem to the user as we believe this is near impossible to hit
+ * during normal use cases.
+ *
+ * Locking:
+ * In the GuC submission code we have 4 basic spin locks which protect
+ * everything. Details about each below.
+ *
+ * sched_engine->lock
+ * This is the submission lock for all contexts that share a i915 schedule
+ * engine (sched_engine), thus only 1 context which share a sched_engine can be
+ * submitting at a time. Currently only 1 sched_engine used for all of GuC
+ * submission but that could change in the future.
+ *
+ * guc->contexts_lock
+ * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
+ * submission can hold this at a time.
+ *
+ * ce->guc_state.lock
+ * Protects everything under ce->guc_state. Ensures that a context is in the
+ * correct state before issuing a H2G. e.g. We don't issue a schedule disable
+ * on disabled context (bad idea), we don't issue schedule enable when a
+ * schedule disable is inflight, etc... Lock individual to each context.
+ *
+ * ce->guc_active.lock
+ * Protects everything under ce->guc_active which is the current requests
+ * inflight on the context / priority management. Lock individual to each
+ * context.
+ *
+ * Lock ordering rules:
+ * sched_engine->lock -> ce->guc_active.lock
+ * sched_engine->lock -> ce->guc_state.lock
+ * guc->contexts_lock -> ce->guc_state.lock
  *
+ * Reset races:
+ * When a GPU full reset is triggered it is assumed that some G2H responses to
+ * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
+ * fatal as we do certain operations upon receiving a G2H (e.g. destroy
+ * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
+ * context state and cleanup appropriately, however this is quite racey. To
+ * avoid races the rules are check for submission being disabled (i.e. check for
+ * mid reset) with the appropriate lock being held. If submission is disabled
+ * don't send the H2G or update the context state. The reset code must disable
+ * submission and grab all these locks before scrubbing for the missing G2H.
  */
 
 /* GuC Virtual Engine */
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index d818cfbfc41d..177eaf55adff 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -290,18 +290,20 @@ struct i915_request {
 		struct hrtimer timer;
 	} watchdog;
 
-	/*
-	 * Requests may need to be stalled when using GuC submission waiting for
-	 * certain GuC operations to complete. If that is the case, stalled
-	 * requests are added to a per context list of stalled requests. The
-	 * below list_head is the link in that list.
+	/**
+	 * @guc_fence_link: Requests may need to be stalled when using GuC
+	 * submission waiting for certain GuC operations to complete. If that is
+	 * the case, stalled requests are added to a per context list of stalled
+	 * requests. The below list_head is the link in that list. Protected by
+	 * ce->guc_state.lock.
 	 */
 	struct list_head guc_fence_link;
 
 	/**
-	 * Priority level while the request is inflight. Differs from i915
-	 * scheduler priority. See comment above
-	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
+	 * @guc_prio: Priority level while the request is inflight. Differs from
+	 * i915 scheduler priority. See comment above
+	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
+	 * ce->guc_active.lock.
 	 */
 #define	GUC_PRIO_INIT	0xff
 #define	GUC_PRIO_FINI	0xfe
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
@ 2021-08-17  7:21   ` kernel test robot
  0 siblings, 0 replies; 56+ messages in thread
From: kernel test robot @ 2021-08-17  7:21 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: clang-built-linux, kbuild-all, daniel.vetter

[-- Attachment #1: Type: text/plain, Size: 2534 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip next-20210816]
[cannot apply to drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Clean-up-GuC-CI-failures-simplify-locking-and-kernel-DOC/20210816-220020
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-a006-20210817 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 2c6448cdc2f68f8c28fd0bd9404182b81306e6e6)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e423aedb52eccddd07fb104ba0a6bed20ff9481a
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Clean-up-GuC-CI-failures-simplify-locking-and-kernel-DOC/20210816-220020
        git checkout e423aedb52eccddd07fb104ba0a6bed20ff9481a
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:158:20: warning: function 'sched_state_is_init' is not needed and will not be emitted [-Wunneeded-internal-declaration]
   static inline bool sched_state_is_init(struct intel_context *ce)
                      ^
   1 warning generated.


vim +/sched_state_is_init +158 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

   157	
 > 158	static inline bool sched_state_is_init(struct intel_context *ce)
   159	{
   160		/*
   161		 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
   162		 * suspend.
   163		 */
   164		return !(atomic_read(&ce->guc_sched_state_no_lock) &
   165			 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
   166			!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
   167	}
   168	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33506 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences Matthew Brost
@ 2021-08-17  9:21   ` Daniel Vetter
  2021-08-17 15:08     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17  9:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:23AM -0700, Matthew Brost wrote:
> Progagating errors to dependent fences is wrong, don't do it. Selftest
> in following patch exposes this bug.

Please explain what "this bug" is, it's hard to read minds, especially at
a distance in spacetime :-)

> Fixes: 8e9f84cf5cac ("drm/i915/gt: Propagate change in error status to children on unhold")

I think it would be better to outright revert this, instead of just
disabling it like this.

Also please cite the dma_fence error propagation revert from Jason:

commit 93a2711cddd5760e2f0f901817d71c93183c3b87
Author: Jason Ekstrand <jason@jlekstrand.net>
Date:   Wed Jul 14 14:34:16 2021 -0500

    Revert "drm/i915: Propagate errors on awaiting already signaled fences"

Maybe in full, if you need the justification.

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>

Unless "this bug" is some real world impact thing I wouldn't put cc:
stable on this.
-Daniel
> ---
>  drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index de5f9c86b9a4..cafb0608ffb4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
>  			if (p->flags & I915_DEPENDENCY_WEAK)
>  				continue;
>  
> -			/* Propagate any change in error status */
> -			if (rq->fence.error)
> -				i915_request_set_error_once(w, rq->fence.error);
> -
>  			if (w->engine != rq->engine)
>  				continue;
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
@ 2021-08-17  9:32   ` Daniel Vetter
  2021-08-17 15:03     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17  9:32 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:22AM -0700, Matthew Brost wrote:
> If the context is reset as a result of the request cancelation the
> context reset G2H is received after schedule disable done G2H which is
> likely the wrong order. The schedule disable done G2H release the
> waiting request cancelation code which resubmits the context. This races
> with the context reset G2H which also wants to resubmit the context but
> in this case it really should be a NOP as request cancelation code owns
> the resubmit. Use some clever tricks of checking the context state to
> seal this race until if / when the GuC firmware is fixed.
> 
> v2:
>  (Checkpatch)
>   - Fix typos
> 
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
>  1 file changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3cd2da6f5c03..c3b7bf7319dd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -826,17 +826,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
>  static void __guc_reset_context(struct intel_context *ce, bool stalled)
>  {
>  	struct i915_request *rq;
> +	unsigned long flags;
>  	u32 head;
> +	bool skip = false;
>  
>  	intel_context_get(ce);
>  
>  	/*
> -	 * GuC will implicitly mark the context as non-schedulable
> -	 * when it sends the reset notification. Make sure our state
> -	 * reflects this change. The context will be marked enabled
> -	 * on resubmission.
> +	 * GuC will implicitly mark the context as non-schedulable when it sends
> +	 * the reset notification. Make sure our state reflects this change. The
> +	 * context will be marked enabled on resubmission.
> +	 *
> +	 * XXX: If the context is reset as a result of the request cancellation
> +	 * this G2H is received after the schedule disable complete G2H which is
> +	 * likely wrong as this creates a race between the request cancellation
> +	 * code re-submitting the context and this G2H handler. This likely
> +	 * should be fixed in the GuC but until if / when that gets fixed we
> +	 * need to workaround this. Convert this function to a NOP if a pending
> +	 * enable is in flight as this indicates that a request cancellation has
> +	 * occurred.
>  	 */
> -	clr_context_enabled(ce);
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	if (likely(!context_pending_enable(ce))) {
> +		clr_context_enabled(ce);
> +	} else {
> +		skip = true;
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(skip))
> +		goto out_put;
>  
>  	rq = intel_context_find_active_request(ce);
>  	if (!rq) {
> @@ -855,6 +873,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>  out_replay:
>  	guc_reset_state(ce, head, stalled);
>  	__unwind_incomplete_requests(ce);
> +out_put:
>  	intel_context_put(ce);
>  }
>  
> @@ -1599,6 +1618,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
>  			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
>  					true);
>  		}
> +
> +		/*
> +		 * XXX: Racey if context is reset, see comment in
> +		 * __guc_reset_context().
> +		 */
> +		flush_work(&ce_to_guc(ce)->ct.requests.worker);

This looks racy, and I think that holds in general for all the flush_work
you're adding: This only flushes the processing of the work, it doesn't
stop any re-queueing (as far as I can tell at least), which means it
doesn't do a hole lot.

Worse, your task is re-queue because it only processes one item at a time.
That means flush_work only flushes the first invocation, but not even
drains them all. So even if you do prevent requeueing somehow, this isn't
what you want. Two solutions.

- flush_work_sync, which flushes until self-requeues are all done too

- Or more preferred, make you're worker a bit more standard for this
  stuff: a) under the spinlock, take the entire list, not just the first
  entry, with list_move or similar to a local list b) process that local
  list in a loop b) don't requeue youreself.

Cheers, Daniel
> +
>  		guc_context_unblock(ce);
>  	}
>  }
> @@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>  {
>  	trace_intel_context_reset(ce);
>  
> -	if (likely(!intel_context_is_banned(ce))) {
> +	/*
> +	 * XXX: Racey if request cancellation has occurred, see comment in
> +	 * __guc_reset_context().
> +	 */
> +	if (likely(!intel_context_is_banned(ce) &&
> +		   !context_blocked(ce))) {
>  		capture_error_state(guc, ce);
>  		guc_context_replay(ce);
>  	}
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
@ 2021-08-17  9:39   ` Daniel Vetter
  2021-08-17 18:17     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17  9:39 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:19AM -0700, Matthew Brost wrote:
> A small race that could result in incorrect accounting of the number
> of outstanding G2H. Basically prior to this patch we did not increment
> the number of outstanding G2H if we encoutered a GT reset while sending
> a H2G. This was incorrect as the context state had already been updated
> to anticipate a G2H response thus the counter should be incremented.
> 
> Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 69faa39da178..b5d3972ae164 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>  {
>  	int err;
>  
> -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> -
> -	if (!err && g2h_len_dw)
> +	if (g2h_len_dw)
>  		atomic_inc(&guc->outstanding_submission_g2h);
>  
> +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);

I'm majorly confused by the _busy_loop naming scheme, especially here.
Like "why do we want to send a busy loop comand to guc, this doesn't make
sense".

It seems like you're using _busy_loop as a suffix for "this is ok to be
called in atomic context". The linux kernel bikeshed for this is generally
_atomic() (or _in_atomic() or something like that).  Would be good to
rename to make this slightly less confusing.
-Daniel

> +	if (err == -EBUSY && g2h_len_dw)
> +		atomic_dec(&guc->outstanding_submission_g2h);
> +
>  	return err;
>  }
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
@ 2021-08-17  9:47   ` Daniel Vetter
  2021-08-17  9:57     ` Daniel Vetter
  2021-08-17 16:44     ` Matthew Brost
  0 siblings, 2 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17  9:47 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> When unblocking a context, do not enable scheduling if the context is
> banned, guc_id invalid, or not registered.
> 
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index c3b7bf7319dd..353899634fa8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	if (unlikely(submission_disabled(guc) ||
> +		     intel_context_is_banned(ce) ||
> +		     context_guc_id_invalid(ce) ||
> +		     !lrc_desc_registered(guc, ce->guc_id) ||
>  		     !intel_context_is_pinned(ce) ||
>  		     context_pending_disable(ce) ||
>  		     context_blocked(ce) > 1)) {

I think this entire if condition here is screaming that our intel_context
state machinery for guc is way too complex, and on the wrong side of
incomprehensible.

Also some of these check state outside of the context, and we don't seem
to hold spinlocks for those, or anything else.

I general I have no idea which of these are defensive programming and
cannot ever happen, and which actually can happen. There's for sure way
too many races going on given that this is all context-local stuff.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-17  9:47   ` Daniel Vetter
@ 2021-08-17  9:57     ` Daniel Vetter
  2021-08-17 16:44     ` Matthew Brost
  1 sibling, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17  9:57 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index c3b7bf7319dd..353899634fa8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> 
> I think this entire if condition here is screaming that our intel_context
> state machinery for guc is way too complex, and on the wrong side of
> incomprehensible.
> 
> Also some of these check state outside of the context, and we don't seem
> to hold spinlocks for those, or anything else.
> 
> I general I have no idea which of these are defensive programming and
> cannot ever happen, and which actually can happen. There's for sure way
> too many races going on given that this is all context-local stuff.

Races here meaining that we seem to be dropping locks while the context is
in an inconsistent state, which then means that every other code path
touching contexts needs to check whether the context is in an inconsistent
state.

This is a bit an example of protecting code, vs protecting datastructures.
Protecting code is having state bits of intermediate/transitional state
leak outside of the locked section (like context_blocked), so that every
other piece of code must be aware about the transition and not screw
things up for worse when they race.

This means your review and validation effort scales O(N^2) with the amount
of code and features you have. Which doesn't work.

Datastructure or object oriented locking design goes different:

1. You figure out what the invariants of your datastructure are. That
means what should hold after each state transition is finished. I have no
idea what is the solution for all them here, but e.g. why is
context_blocked even visible to other threads? Usual approach is a) take
lock b) do whatever is necessary (we're talking about reset stuff here, so
performance really doesn't matter) c) unlock. I know that i915-gem is full
of these leaky counting things, but that's really not a good design.

2. Next up, for every piece of state you think how it's protected with a
per-object lock. The fewer locks you have (but still per-objects so it's
not becoming a mess for different reasons) the higher chances that you
don't leak inconsistent state to other threads. This is a bit tricky when
multipled objects are involved, or if you have to split your locks for a
single object because some of it needs to be accessed from irq context
(like a tasklet).

3. Document your rules in kerneldoc, so that when new code gets added you
don't have to review everything for consistency against the rules. This
way you get overall O(N) effort for validation and review, because all you
have to do is check every function that changes state against the overall
contract, and not everything against everything else.

If you have a pile of if checks every time you grab a lock, your locking
design has too much state that leaks outside of the locked sections.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context Matthew Brost
@ 2021-08-17 10:06   ` Daniel Vetter
  2021-08-17 16:12     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:06 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:31AM -0700, Matthew Brost wrote:
> Error captures can now be done in a work queue processing G2H messages.
> These messages need to be completely done being processed in the reset
> path, to avoid races in the missing G2H cleanup, which create a
> dependency on memory allocations and dma fences (i915_requests).
> Requests depend on resets, thus now we have a circular dependency. To
> work around this, allocate the error capture in an atomic context.
> 
> Fixes: dc0dad365c5e ("Fix for error capture after full GPU reset with GuC")
> Fixes: 573ba126aef3 ("Capture error state on context reset")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 37 +++++++++++++--------------
>  1 file changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 0f08bcfbe964..453376aa6d9f 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -49,7 +49,6 @@
>  #include "i915_memcpy.h"
>  #include "i915_scatterlist.h"
>  
> -#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
>  #define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)

This one doesn't make much sense. GFP_ATOMIC essentially means we're
high-priority and failure would be a pretty bad day. Meanwhile
__GFP_NOWARN means we can totally cope with failure, pls don't holler.

GFP_NOWAIT | __GFP_NOWARN would the more consistent one here I think.

gfp.h for all the docs for this.

Separate patch ofc. This one is definitely the right direction, since
GFP_KERNEL from the reset worker is not a good idea.
-Daniel

>  
>  static void __sg_set_buf(struct scatterlist *sg,
> @@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
>  	if (e->cur == e->end) {
>  		struct scatterlist *sgl;
>  
> -		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
> +		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
>  		if (!sgl) {
>  			e->err = -ENOMEM;
>  			return false;
> @@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
>  	}
>  
>  	e->size = ALIGN(len + 1, SZ_64K);
> -	e->buf = kmalloc(e->size, ALLOW_FAIL);
> +	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
>  	if (!e->buf) {
>  		e->size = PAGE_ALIGN(len + 1);
> -		e->buf = kmalloc(e->size, GFP_KERNEL);
> +		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
>  	}
>  	if (!e->buf) {
>  		e->err = -ENOMEM;
> @@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
>  {
>  	struct z_stream_s *zstream = &c->zstream;
>  
> -	if (pool_init(&c->pool, ALLOW_FAIL))
> +	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
>  		return false;
>  
>  	zstream->workspace =
>  		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
> -			ALLOW_FAIL);
> +			ATOMIC_MAYFAIL);
>  	if (!zstream->workspace) {
>  		pool_fini(&c->pool);
>  		return false;
> @@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
>  
>  	c->tmp = NULL;
>  	if (i915_has_memcpy_from_wc())
> -		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
> +		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
>  
>  	return true;
>  }
> @@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
>  	if (dst->page_count >= dst->num_pages)
>  		return ERR_PTR(-ENOSPC);
>  
> -	page = pool_alloc(&c->pool, ALLOW_FAIL);
> +	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
>  	if (!page)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -376,7 +375,7 @@ struct i915_vma_compress {
>  
>  static bool compress_init(struct i915_vma_compress *c)
>  {
> -	return pool_init(&c->pool, ALLOW_FAIL) == 0;
> +	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
>  }
>  
>  static bool compress_start(struct i915_vma_compress *c)
> @@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
>  {
>  	void *ptr;
>  
> -	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
> +	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
>  	if (!ptr)
>  		return -ENOMEM;
>  
> @@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
>  
>  	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
>  	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
> -	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
> +	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
>  	if (!dst)
>  		return NULL;
>  
> @@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
>  	struct i915_request *rq = NULL;
>  	unsigned long flags;
>  
> -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> +	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
>  	if (!ee)
>  		return NULL;
>  
> @@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
>  		struct intel_engine_coredump *ee;
>  
>  		/* Refill our page pool before entering atomic section */
> -		pool_refill(&compress->pool, ALLOW_FAIL);
> +		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
>  
>  		ee = capture_engine(engine, compress);
>  		if (!ee)
> @@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
>  	const struct intel_uc *uc = &gt->_gt->uc;
>  	struct intel_uc_coredump *error_uc;
>  
> -	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
> +	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
>  	if (!error_uc)
>  		return NULL;
>  
> @@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
>  	 * As modparams are generally accesible from the userspace make
>  	 * explicit copies of the firmware paths.
>  	 */
> -	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
> -	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
> +	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
> +	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
>  	error_uc->guc_log =
>  		i915_vma_coredump_create(gt->_gt,
>  					 uc->guc.log.vma, "GuC log buffer",
> @@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
>  {
>  	struct i915_vma_compress *compress;
>  
> -	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
> +	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
>  	if (!compress)
>  		return NULL;
>  
> @@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
>  	if (IS_ERR(error))
>  		return error;
>  
> -	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
> +	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
>  	if (!error)
>  		return ERR_PTR(-ENOMEM);
>  
> -	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
> +	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
>  	if (error->gt) {
>  		struct i915_vma_compress *compress;
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
@ 2021-08-17 10:06   ` Daniel Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:06 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:32AM -0700, Matthew Brost wrote:
> It isn't safe to scrub for missing G2H or continue with the reset until
> all G2H processing is complete. Flush the G2H work queue during reset to
> ensure it is done running.
> 
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
>  1 file changed, 2 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3a01743e09ea..8c560ed14976 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -707,8 +707,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
>  
>  void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  {
> -	int i;
> -
>  	if (unlikely(!guc_submission_initialized(guc))) {
>  		/* Reset called during driver load? GuC not yet initialised! */
>  		return;
> @@ -724,20 +722,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  
>  	guc_flush_submissions(guc);
>  
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);

Same thing about flush_work as in an earlier patch.
-Daniel

> +
>  	scrub_guc_desc_for_outstanding_g2h(guc);
>  }
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ Matthew Brost
@ 2021-08-17 10:08   ` Daniel Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:08 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:33AM -0700, Matthew Brost wrote:
> A subsequent patch will flip the locking hierarchy from
> ce->guc_state.lock -> sched_engine->lock to sched_engine->lock ->
> ce->guc_state.lock. As such we need to release the submit fence for a
> request from an IRQ to break a lock inversion - i.e. the fence must be
> release went holding ce->guc_state.lock and the releasing of the can
> acquire sched_engine->lock.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Title should be "irq work", otherwise it reads a bit strange. Also these
kind of nestings would be good to document in the kerneldoc too (maybe as
you go even).
-Daniel

> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>  drivers/gpu/drm/i915/i915_request.h               |  5 +++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 8c560ed14976..9ae4633aa7cb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2017,6 +2017,14 @@ static const struct intel_context_ops guc_context_ops = {
>  	.create_virtual = guc_create_virtual,
>  };
>  
> +static void submit_work_cb(struct irq_work *wrk)
> +{
> +	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
> +
> +	might_lock(&rq->engine->sched_engine->lock);
> +	i915_sw_fence_complete(&rq->submit);
> +}
> +
>  static void __guc_signal_context_fence(struct intel_context *ce)
>  {
>  	struct i915_request *rq;
> @@ -2026,8 +2034,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
>  	if (!list_empty(&ce->guc_state.fences))
>  		trace_intel_context_fence_release(ce);
>  
> +	/*
> +	 * Use an IRQ to ensure locking order of sched_engine->lock ->
> +	 * ce->guc_state.lock is preserved.
> +	 */
>  	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
> -		i915_sw_fence_complete(&rq->submit);
> +		irq_work_queue(&rq->submit_work);
>  
>  	INIT_LIST_HEAD(&ce->guc_state.fences);
>  }
> @@ -2137,6 +2149,7 @@ static int guc_request_alloc(struct i915_request *rq)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  	if (context_wait_for_deregister_to_register(ce) ||
>  	    context_pending_disable(ce)) {
> +		init_irq_work(&rq->submit_work, submit_work_cb);
>  		i915_sw_fence_await(&rq->submit);
>  
>  		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 1bc1349ba3c2..d818cfbfc41d 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -218,6 +218,11 @@ struct i915_request {
>  	};
>  	struct llist_head execute_cb;
>  	struct i915_sw_fence semaphore;
> +	/**
> +	 * @submit_work: complete submit fence from an IRQ if needed for
> +	 * locking hierarchy reasons.
> +	 */
> +	struct irq_work submit_work;
>  
>  	/*
>  	 * A list of everyone we wait upon, and everyone who waits upon us.
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
@ 2021-08-17 10:10   ` Daniel Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:10 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:34AM -0700, Matthew Brost wrote:
> Move guc_blocked fence to struct guc_state as the lock which protects
> the fence lives there.
> 
> s/ce->guc_blocked/ce->guc_state.blocked_fence/g
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

General comment, but latest when your combine your count state with a wait
queue you're very far into "reinventing a mutex/semaphore, badly" land.

I think we really need to look into why we can't just protect this all
with a mutex and make sure the awkward transition states are never visible
to anyone else.
-Daniel

> ---
>  drivers/gpu/drm/i915/gt/intel_context.c        |  5 +++--
>  drivers/gpu/drm/i915/gt/intel_context_types.h  |  5 ++---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +++++++++---------
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 745e84c72c90..0e48939ec85f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -405,8 +405,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  	 * Initialize fence to be complete as this is expected to be complete
>  	 * unless there is a pending schedule disable outstanding.
>  	 */
> -	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
> -	i915_sw_fence_commit(&ce->guc_blocked);
> +	i915_sw_fence_init(&ce->guc_state.blocked_fence,
> +			   sw_fence_dummy_notify);
> +	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
>  
>  	i915_active_init(&ce->active,
>  			 __intel_context_active, __intel_context_retire, 0);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 3a73f3117873..c06171ee8792 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -167,6 +167,8 @@ struct intel_context {
>  		 * fence related to GuC submission
>  		 */
>  		struct list_head fences;
> +		/* GuC context blocked fence */
> +		struct i915_sw_fence blocked_fence;
>  	} guc_state;
>  
>  	struct {
> @@ -190,9 +192,6 @@ struct intel_context {
>  	 */
>  	struct list_head guc_id_link;
>  
> -	/* GuC context blocked fence */
> -	struct i915_sw_fence guc_blocked;
> -
>  	/*
>  	 * GuC priority management
>  	 */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9ae4633aa7cb..7aa16371908a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1482,24 +1482,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
>  {
>  	lockdep_assert_held(&ce->guc_state.lock);
>  
> -	if (!i915_sw_fence_done(&ce->guc_blocked))
> -		i915_sw_fence_complete(&ce->guc_blocked);
> +	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
> +		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
>  }
>  
>  static void guc_blocked_fence_reinit(struct intel_context *ce)
>  {
>  	lockdep_assert_held(&ce->guc_state.lock);
> -	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
> +	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
>  
>  	/*
>  	 * This fence is always complete unless a pending schedule disable is
>  	 * outstanding. We arm the fence here and complete it when we receive
>  	 * the pending schedule disable complete message.
>  	 */
> -	i915_sw_fence_fini(&ce->guc_blocked);
> -	i915_sw_fence_reinit(&ce->guc_blocked);
> -	i915_sw_fence_await(&ce->guc_blocked);
> -	i915_sw_fence_commit(&ce->guc_blocked);
> +	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_await(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
>  }
>  
>  static u16 prep_context_pending_disable(struct intel_context *ce)
> @@ -1539,7 +1539,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  		if (enabled)
>  			clr_context_enabled(ce);
>  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -		return &ce->guc_blocked;
> +		return &ce->guc_state.blocked_fence;
>  	}
>  
>  	/*
> @@ -1555,7 +1555,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  	with_intel_runtime_pm(runtime_pm, wakeref)
>  		__guc_context_sched_disable(guc, ce, guc_id);
>  
> -	return &ce->guc_blocked;
> +	return &ce->guc_state.blocked_fence;
>  }
>  
>  static void guc_context_unblock(struct intel_context *ce)
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking Matthew Brost
@ 2021-08-17 10:15   ` Daniel Vetter
  2021-08-17 15:30     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:15 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:35AM -0700, Matthew Brost wrote:
> Rework and simplify the locking with GuC subission. Drop
> sched_state_no_lock and move all fields under the guc_state.sched_state
> and protect all these fields with guc_state.lock . This requires
> changing the locking hierarchy from guc_state.lock -> sched_engine.lock
> to sched_engine.lock -> guc_state.lock.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Yeah this is definitely going in the right direction. Especially
sprinkling lockdep_assert_held around.

One comment below.

> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
>  drivers/gpu/drm/i915/i915_trace.h             |   6 +-
>  3 files changed, 89 insertions(+), 108 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index c06171ee8792..d5d643b04d54 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -161,7 +161,7 @@ struct intel_context {
>  		 * sched_state: scheduling state of this context using GuC
>  		 * submission
>  		 */
> -		u16 sched_state;
> +		u32 sched_state;
>  		/*
>  		 * fences: maintains of list of requests that have a submit
>  		 * fence related to GuC submission
> @@ -178,9 +178,6 @@ struct intel_context {
>  		struct list_head requests;
>  	} guc_active;
>  
> -	/* GuC scheduling state flags that do not require a lock. */
> -	atomic_t guc_sched_state_no_lock;
> -
>  	/* GuC LRC descriptor ID */
>  	u16 guc_id;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 7aa16371908a..ba19b99173fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>  
>  #define GUC_REQUEST_SIZE 64 /* bytes */
>  
> -/*
> - * Below is a set of functions which control the GuC scheduling state which do
> - * not require a lock as all state transitions are mutually exclusive. i.e. It
> - * is not possible for the context pinning code and submission, for the same
> - * context, to be executing simultaneously. We still need an atomic as it is
> - * possible for some of the bits to changing at the same time though.
> - */
> -#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
> -#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
> -#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
> -static inline bool context_enabled(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_ENABLED);
> -}
> -
> -static inline void set_context_enabled(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_enabled(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline bool context_pending_enable(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
> -}
> -
> -static inline void set_context_pending_enable(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> -		  &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_pending_enable(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline bool context_registered(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_REGISTERED);
> -}
> -
> -static inline void set_context_registered(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
> -		  &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_registered(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
>  /*
>   * Below is a set of functions which control the GuC scheduling state which
> - * require a lock, aside from the special case where the functions are called
> - * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
> - * path to be executing on the context.
> + * require a lock.
>   */
>  #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
>  #define SCHED_STATE_DESTROYED				BIT(1)
>  #define SCHED_STATE_PENDING_DISABLE			BIT(2)
>  #define SCHED_STATE_BANNED				BIT(3)
> -#define SCHED_STATE_BLOCKED_SHIFT			4
> +#define SCHED_STATE_ENABLED				BIT(4)
> +#define SCHED_STATE_PENDING_ENABLE			BIT(5)
> +#define SCHED_STATE_REGISTERED				BIT(6)
> +#define SCHED_STATE_BLOCKED_SHIFT			7
>  #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
>  #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
>  static inline void init_sched_state(struct intel_context *ce)
>  {
>  	lockdep_assert_held(&ce->guc_state.lock);
> -	atomic_set(&ce->guc_sched_state_no_lock, 0);
>  	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>  }
>  
> @@ -161,9 +98,8 @@ static inline bool sched_state_is_init(struct intel_context *ce)
>  	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
>  	 * suspend.
>  	 */
> -	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> -		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> -		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> +	return !(ce->guc_state.sched_state &=
> +		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
>  }
>  
>  static inline bool
> @@ -236,6 +172,57 @@ static inline void clr_context_banned(struct intel_context *ce)
>  	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
>  }
>  
> +static inline bool context_enabled(struct intel_context *ce)

No statice inline in .c files. The compiler is better at this than you
are. Especially once you add stuff like asserts and everything, it's just
not worth the cognitive effort to have to reevaluate these.

One-line helpers in headers are the only exception where static inline is
ok.
-Daniel

> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
> +}
> +
> +static inline void set_context_enabled(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
> +}
> +
> +static inline void clr_context_enabled(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
> +}
> +
> +static inline bool context_pending_enable(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline void set_context_pending_enable(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline void clr_context_pending_enable(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline bool context_registered(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
> +}
> +
> +static inline void set_context_registered(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
> +}
> +
> +static inline void clr_context_registered(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
> +}
> +
>  static inline u32 context_blocked(struct intel_context *ce)
>  {
>  	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
> @@ -244,7 +231,6 @@ static inline u32 context_blocked(struct intel_context *ce)
>  
>  static inline void incr_context_blocked(struct intel_context *ce)
>  {
> -	lockdep_assert_held(&ce->engine->sched_engine->lock);
>  	lockdep_assert_held(&ce->guc_state.lock);
>  
>  	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
> @@ -254,7 +240,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
>  
>  static inline void decr_context_blocked(struct intel_context *ce)
>  {
> -	lockdep_assert_held(&ce->engine->sched_engine->lock);
>  	lockdep_assert_held(&ce->guc_state.lock);
>  
>  	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
> @@ -443,6 +428,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  	u32 g2h_len_dw = 0;
>  	bool enabled;
>  
> +	lockdep_assert_held(&rq->engine->sched_engine->lock);
> +
>  	/*
>  	 * Corner case where requests were sitting in the priority list or a
>  	 * request resubmitted after the context was banned.
> @@ -450,7 +437,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  	if (unlikely(intel_context_is_banned(ce))) {
>  		i915_request_put(i915_request_mark_eio(rq));
>  		intel_engine_signal_breadcrumbs(ce->engine);
> -		goto out;
> +		return 0;
>  	}
>  
>  	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> @@ -463,9 +450,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
>  		err = guc_lrc_desc_pin(ce, false);
>  		if (unlikely(err))
> -			goto out;
> +			return err;
>  	}
>  
> +	spin_lock(&ce->guc_state.lock);
> +
>  	/*
>  	 * The request / context will be run on the hardware when scheduling
>  	 * gets enabled in the unblock.
> @@ -500,6 +489,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  		trace_i915_request_guc_submit(rq);
>  
>  out:
> +	spin_unlock(&ce->guc_state.lock);
>  	return err;
>  }
>  
> @@ -720,8 +710,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
>  	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>  
> -	guc_flush_submissions(guc);
> -
>  	flush_work(&guc->ct.requests.worker);
>  
>  	scrub_guc_desc_for_outstanding_g2h(guc);
> @@ -1125,7 +1113,11 @@ static int steal_guc_id(struct intel_guc *guc)
>  
>  		list_del_init(&ce->guc_id_link);
>  		guc_id = ce->guc_id;
> +
> +		spin_lock(&ce->guc_state.lock);
>  		clr_context_registered(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
>  		set_context_guc_id_invalid(ce);
>  		return guc_id;
>  	} else {
> @@ -1161,6 +1153,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  try_again:
>  	spin_lock_irqsave(&guc->contexts_lock, flags);
>  
> +	might_lock(&ce->guc_state.lock);
> +
>  	if (context_guc_id_invalid(ce)) {
>  		ret = assign_guc_id(guc, &ce->guc_id);
>  		if (ret)
> @@ -1240,8 +1234,13 @@ static int register_context(struct intel_context *ce, bool loop)
>  	trace_intel_context_register(ce);
>  
>  	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
> -	if (likely(!ret))
> +	if (likely(!ret)) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>  		set_context_registered(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	}
>  
>  	return ret;
>  }
> @@ -1517,7 +1516,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
>  static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
> -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
>  	unsigned long flags;
>  	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>  	intel_wakeref_t wakeref;
> @@ -1526,13 +1524,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
> -	/*
> -	 * Sync with submission path, increment before below changes to context
> -	 * state.
> -	 */
> -	spin_lock(&sched_engine->lock);
>  	incr_context_blocked(ce);
> -	spin_unlock(&sched_engine->lock);
>  
>  	enabled = context_enabled(ce);
>  	if (unlikely(!enabled || submission_disabled(guc))) {
> @@ -1561,7 +1553,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  static void guc_context_unblock(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
> -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
>  	unsigned long flags;
>  	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>  	intel_wakeref_t wakeref;
> @@ -1586,13 +1577,7 @@ static void guc_context_unblock(struct intel_context *ce)
>  		intel_context_get(ce);
>  	}
>  
> -	/*
> -	 * Sync with submission path, decrement after above changes to context
> -	 * state.
> -	 */
> -	spin_lock(&sched_engine->lock);
>  	decr_context_blocked(ce);
> -	spin_unlock(&sched_engine->lock);
>  
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
> @@ -1702,7 +1687,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
>  
>  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
>  	    !lrc_desc_registered(guc, ce->guc_id)) {
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>  		clr_context_enabled(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  		goto unpin;
>  	}
>  
> @@ -1752,7 +1739,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
>  	GEM_BUG_ON(context_enabled(ce));
>  
> -	clr_context_registered(ce);
>  	deregister_context(ce, ce->guc_id, true);
>  }
>  
> @@ -1825,8 +1811,10 @@ static void guc_context_destroy(struct kref *kref)
>  	/* Seal race with Reset */
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  	disabled = submission_disabled(guc);
> -	if (likely(!disabled))
> +	if (likely(!disabled)) {
>  		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
> @@ -2695,8 +2683,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>  		     (!context_pending_enable(ce) &&
>  		     !context_pending_disable(ce)))) {
>  		drm_err(&guc_to_gt(guc)->i915->drm,
> -			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
> -			atomic_read(&ce->guc_sched_state_no_lock),
> +			"Bad context sched_state 0x%x, desc_idx %u",
>  			ce->guc_state.sched_state, desc_idx);
>  		return -EPROTO;
>  	}
> @@ -2711,7 +2698,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>  		}
>  #endif
>  
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>  		clr_context_pending_enable(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  	} else if (context_pending_disable(ce)) {
>  		bool banned;
>  
> @@ -2985,9 +2974,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  			   atomic_read(&ce->pin_count));
>  		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
>  			   atomic_read(&ce->guc_id_ref));
> -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> -			   ce->guc_state.sched_state,
> -			   atomic_read(&ce->guc_sched_state_no_lock));
> +		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> +			   ce->guc_state.sched_state);
>  
>  		guc_log_context_priority(p, ce);
>  	}
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 806ad688274b..0a77eb2944b5 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
>  			     __field(u32, guc_id)
>  			     __field(int, pin_count)
>  			     __field(u32, sched_state)
> -			     __field(u32, guc_sched_state_no_lock)
>  			     __field(u8, guc_prio)
>  			     ),
>  
> @@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
>  			   __entry->guc_id = ce->guc_id;
>  			   __entry->pin_count = atomic_read(&ce->pin_count);
>  			   __entry->sched_state = ce->guc_state.sched_state;
> -			   __entry->guc_sched_state_no_lock =
> -			   atomic_read(&ce->guc_sched_state_no_lock);
>  			   __entry->guc_prio = ce->guc_prio;
>  			   ),
>  
> -		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
> +		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
>  			      __entry->guc_id, __entry->pin_count,
>  			      __entry->sched_state,
> -			      __entry->guc_sched_state_no_lock,
>  			      __entry->guc_prio)
>  );
>  
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
@ 2021-08-17 10:27   ` Daniel Vetter
  2021-08-17 15:26     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 10:27 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:36AM -0700, Matthew Brost wrote:
> Lock the xarray and take ref to the context if needed.
> 
> v2:
>  (Checkpatch)
>   - Add new line after declaration
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 84 ++++++++++++++++---
>  1 file changed, 73 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba19b99173fc..2ecb2f002bed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -599,8 +599,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  	unsigned long index, flags;
>  	bool pending_disable, pending_enable, deregister, destroyed, banned;
>  
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>  	xa_for_each(&guc->context_lookup, index, ce) {
> -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		/*
> +		 * Corner case where the ref count on the object is zero but and
> +		 * deregister G2H was lost. In this case we don't touch the ref
> +		 * count and finish the destroy of the context.
> +		 */
> +		bool do_put = kref_get_unless_zero(&ce->ref);

This looks really scary, because in another loop below you have an
unconditional refcount increase. This means sometimes guc->context_lookup
xarray guarantees we hold a full reference on the context, sometimes we
don't. So we're right back in "protect the code" O(N^2) review complexity
instead of invariant rules about the datastructure, which is linear.

Essentially anytime you feel like you have to add a comment to explain
what's going on about concurrent stuff you're racing with, you're
protecting code, not data.

Since guc can't do a hole lot without the guc_id registered and all that,
I kinda expected you'd always have a full reference here. If there's
intermediate stages (e.g. around unregister) where this is currently not
always the case, then those should make sure a full reference is held.

Another option would be to threa ->context_lookup as a weak reference that
we lazily clean up when the context is finalized. That works too, but
probably not with a spinlock (since you most likely have to wait for all
pending guc transations to complete), but it's another option.

Either way I think standard process is needed here for locking design,
i.e.
1. come up with the right invariants ("we always have a full reference
when a context is ont he guc->context_lookup xarray")
2. come up with the locks. From the guc side the xa_lock is maybe good
enough, but from the context side this doesn't protect against a
re-registering racing against a deregistering. So probably needs more
rules on top, and then you have a nice lock inversion in a few places like
here.
3. document it and roll it out.

The other thing is that this is a very tricky iterator, and there's a few
copies of it. That is, if this is the right solution. As-is this should be
abstracted away into guc_context_iter_begin/next_end() helpers, e.g. like
we have for drm_connector_list_iter_begin/end_next as an example.

Cheers, Daniel

> +
> +		xa_unlock(&guc->context_lookup);
> +
> +		spin_lock(&ce->guc_state.lock);
>  
>  		/*
>  		 * Once we are at this point submission_disabled() is guaranteed
> @@ -616,7 +626,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  		banned = context_banned(ce);
>  		init_sched_state(ce);
>  
> -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		GEM_BUG_ON(!do_put && !destroyed);
>  
>  		if (pending_enable || destroyed || deregister) {
>  			atomic_dec(&guc->outstanding_submission_g2h);
> @@ -645,7 +657,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  
>  			intel_context_put(ce);
>  		}
> +
> +		if (do_put)
> +			intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
>  	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>  }
>  
>  static inline bool
> @@ -866,16 +883,26 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>  {
>  	struct intel_context *ce;
>  	unsigned long index;
> +	unsigned long flags;
>  
>  	if (unlikely(!guc_submission_initialized(guc))) {
>  		/* Reset called during driver load? GuC not yet initialised! */
>  		return;
>  	}
>  
> -	xa_for_each(&guc->context_lookup, index, ce)
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	xa_for_each(&guc->context_lookup, index, ce) {
> +		intel_context_get(ce);
> +		xa_unlock(&guc->context_lookup);
> +
>  		if (intel_context_is_pinned(ce))
>  			__guc_reset_context(ce, stalled);
>  
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
> +	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +
>  	/* GuC is blown away, drop all references to contexts */
>  	xa_destroy(&guc->context_lookup);
>  }
> @@ -950,11 +977,21 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>  {
>  	struct intel_context *ce;
>  	unsigned long index;
> +	unsigned long flags;
> +
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	xa_for_each(&guc->context_lookup, index, ce) {
> +		intel_context_get(ce);
> +		xa_unlock(&guc->context_lookup);
>  
> -	xa_for_each(&guc->context_lookup, index, ce)
>  		if (intel_context_is_pinned(ce))
>  			guc_cancel_context_requests(ce);
>  
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
> +	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +
>  	guc_cancel_sched_engine_requests(guc->sched_engine);
>  
>  	/* GuC is blown away, drop all references to contexts */
> @@ -2848,21 +2885,26 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>  	struct intel_context *ce;
>  	struct i915_request *rq;
>  	unsigned long index;
> +	unsigned long flags;
>  
>  	/* Reset called during driver load? GuC not yet initialised! */
>  	if (unlikely(!guc_submission_initialized(guc)))
>  		return;
>  
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>  	xa_for_each(&guc->context_lookup, index, ce) {
> +		intel_context_get(ce);
> +		xa_unlock(&guc->context_lookup);
> +
>  		if (!intel_context_is_pinned(ce))
> -			continue;
> +			goto next;
>  
>  		if (intel_engine_is_virtual(ce->engine)) {
>  			if (!(ce->engine->mask & engine->mask))
> -				continue;
> +				goto next;
>  		} else {
>  			if (ce->engine != engine)
> -				continue;
> +				goto next;
>  		}
>  
>  		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> @@ -2872,9 +2914,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>  			intel_engine_set_hung_context(engine, ce);
>  
>  			/* Can only cope with one hang at a time... */
> -			return;
> +			intel_context_put(ce);
> +			xa_lock(&guc->context_lookup);
> +			goto done;
>  		}
> +next:
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
> +
>  	}
> +done:
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>  }
>  
>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> @@ -2890,23 +2940,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>  	if (unlikely(!guc_submission_initialized(guc)))
>  		return;
>  
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>  	xa_for_each(&guc->context_lookup, index, ce) {
> +		intel_context_get(ce);
> +		xa_unlock(&guc->context_lookup);
> +
>  		if (!intel_context_is_pinned(ce))
> -			continue;
> +			goto next;
>  
>  		if (intel_engine_is_virtual(ce->engine)) {
>  			if (!(ce->engine->mask & engine->mask))
> -				continue;
> +				goto next;
>  		} else {
>  			if (ce->engine != engine)
> -				continue;
> +				goto next;
>  		}
>  
>  		spin_lock_irqsave(&ce->guc_active.lock, flags);
>  		intel_engine_dump_active_requests(&ce->guc_active.requests,
>  						  hung_rq, m);
>  		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> +
> +next:
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
>  	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>  }
>  
>  void intel_guc_submission_print_info(struct intel_guc *guc,
> @@ -2960,7 +3019,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  {
>  	struct intel_context *ce;
>  	unsigned long index;
> +	unsigned long flags;
>  
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>  	xa_for_each(&guc->context_lookup, index, ce) {
>  		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
>  		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> @@ -2979,6 +3040,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  
>  		guc_log_context_priority(p, ce);
>  	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>  }
>  
>  static struct intel_context *
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc Matthew Brost
@ 2021-08-17 11:11   ` Daniel Vetter
  2021-08-17 16:36     ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 11:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter

On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
> Add GuC kernel doc for all structures added thus far for GuC submission
> and update the main GuC submission section with the new interface
> details.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
locking design that also doesn't feel too consistent.

> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  42 +++++---
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +++-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 101 ++++++++++++++----
>  drivers/gpu/drm/i915/i915_request.h           |  18 ++--
>  4 files changed, 131 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index f6989e6807f7..75d609a1bc33 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -156,44 +156,56 @@ struct intel_context {
>  	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
>  
>  	struct {
> -		/** lock: protects everything in guc_state */
> +		/** @lock: protects everything in guc_state */
>  		spinlock_t lock;
>  		/**
> -		 * sched_state: scheduling state of this context using GuC
> +		 * @sched_state: scheduling state of this context using GuC
>  		 * submission
>  		 */
>  		u32 sched_state;
>  		/*
> -		 * fences: maintains of list of requests that have a submit
> -		 * fence related to GuC submission
> +		 * @fences: maintains a list of requests are currently being
> +		 * fenced until a GuC operation completes
>  		 */
>  		struct list_head fences;
> -		/* GuC context blocked fence */
> +		/**
> +		 * @blocked_fence: fence used to signal when the blocking of a
> +		 * contexts submissions is complete.
> +		 */
>  		struct i915_sw_fence blocked_fence;
> -		/* GuC committed requests */
> +		/** @number_committed_requests: number of committed requests */
>  		int number_committed_requests;
>  	} guc_state;
>  
>  	struct {
> -		/** lock: protects everything in guc_active */
> +		/** @lock: protects everything in guc_active */
>  		spinlock_t lock;

Why do we have two locks spinlocks to protect guc context state?

I do understand the need for a spinlock (at least for now) because of how
i915-scheduler runs in tasklet context. But beyond that we really
shouldn't need more than two locks to protect context state. You still
have an entire pile here, plus some atomics, plus more.

And this is on a single context, where concurrently submitting stuff
really isn't a thing. I'd expect actual benchmarking would show a perf
hit, since all these locks and atomics aren't free. This is at least the
case with execbuf and the various i915_vma locks we currently have.

What I expect intel_context locking to be is roughly:

- One lock to protect all intel_context state. This probably should be a
  dma_resv_lock for a few reasons, least so we can pin state objects
  underneath that lock.

- A separate lock if there's anything you need to coordinate with the
  backend scheduler while that's running, to avoid dma_fence inversions.
  Right now this separate lock might need to be a spinlock because our
  scheduler runs in tasklets, and that might mean we need both a mutex and
  a spinlock here.

Anything that goes beyond that is premature optimization and kills us code
complexity vise. I'd be _extremely_ surprised if an IA core cannot keep up
with GuC, and therefore anything that goes beyond "one lock per object",
plus/minus execution context issues like the above tasklet issue, is
likely just going to slow everything down.

> -		/** requests: active requests on this context */
> +		/** @requests: list of active requests on this context */
>  		struct list_head requests;
> -		/*
> -		 * GuC priority management
> -		 */
> +		/** @guc_prio: the contexts current guc priority */
>  		u8 guc_prio;
> +		/**
> +		 * @guc_prio_count: a counter of the number requests inflight in
> +		 * each priority bucket
> +		 */
>  		u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
>  	} guc_active;
>  
> -	/* GuC LRC descriptor ID */
> +	/**
> +	 * @guc_id: unique handle which is used to communicate information with
> +	 * the GuC about this context, protected by guc->contexts_lock
> +	 */
>  	u16 guc_id;
>  
> -	/* GuC LRC descriptor reference count */
> +	/**
> +	 * @guc_id_ref: the number of references to the guc_id, when
> +	 * transitioning in and out of zero protected by guc->contexts_lock
> +	 */
>  	atomic_t guc_id_ref;

All this guc_id related stuff (including the guc->context_lookup xarray I
guess) also has quite a pile of atomics and locks.

>  
> -	/*
> -	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> +	/**
> +	 * @guc_id_link: in guc->guc_id_list when the guc_id has no refs but is
> +	 * still valid, protected by guc->contexts_lock
>  	 */
>  	struct list_head guc_id_link;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 2e27fe59786b..c0b3fdb601f0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -41,6 +41,10 @@ struct intel_guc {
>  	spinlock_t irq_lock;
>  	unsigned int msg_enabled_mask;
>  
> +	/**
> +	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
> +	 * submission, used to determine if the GT is idle
> +	 */
>  	atomic_t outstanding_submission_g2h;

atomic_t is good for statistcs, but not for code flow control. If you use
if for that you either need a lot of barriers and comments, which means
there needs to be some real perf numbers showing that this is required in
a workload we care about.

Or you stuff this into a related lock. E.g. from high-level view stuff
this into intel_guc_ct (which also has definitely way more locks than it
needs) could make sense?

>  
>  	struct {
> @@ -49,12 +53,16 @@ struct intel_guc {
>  		void (*disable)(struct intel_guc *guc);
>  	} interrupts;
>  
> -	/*
> -	 * contexts_lock protects the pool of free guc ids and a linked list of
> -	 * guc ids available to be stolen
> +	/**
> +	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id, and
> +	 * ce->guc_id_ref when transitioning in and out of zero
>  	 */
>  	spinlock_t contexts_lock;
> +	/** @guc_ids: used to allocate new guc_ids */
>  	struct ida guc_ids;
> +	/**
> +	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> +	 */
>  	struct list_head guc_id_list;
>  
>  	bool submission_supported;
> @@ -70,7 +78,10 @@ struct intel_guc {
>  	struct i915_vma *lrc_desc_pool;
>  	void *lrc_desc_pool_vaddr;
>  
> -	/* guc_id to intel_context lookup */
> +	/**
> +	 * @context_lookup: used to intel_context from guc_id, if a context is
> +	 * present in this structure it is registered with the GuC
> +	 */
>  	struct xarray context_lookup;
>  
>  	/* Control params for fw initialization */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index eb06a4c7534e..18ef363c6e5d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -28,21 +28,6 @@
>  /**
>   * DOC: GuC-based command submission
>   *
> - * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
> - * firmware is moving to an updated submission interface and we plan to
> - * turn submission back on when that lands. The below documentation (and related
> - * code) matches the old submission model and will be updated as part of the
> - * upgrade to the new flow.
> - *
> - * GuC stage descriptor:
> - * During initialization, the driver allocates a static pool of 1024 such
> - * descriptors, and shares them with the GuC. Currently, we only use one
> - * descriptor. This stage descriptor lets the GuC know about the workqueue and
> - * process descriptor. Theoretically, it also lets the GuC know about our HW
> - * contexts (context ID, etc...), but we actually employ a kind of submission
> - * where the GuC uses the LRCA sent via the work item instead. This is called
> - * a "proxy" submission.
> - *
>   * The Scratch registers:
>   * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
>   * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
> @@ -51,14 +36,86 @@
>   * processes the request. The kernel driver polls waiting for this update and
>   * then proceeds.
>   *
> - * Work Items:
> - * There are several types of work items that the host may place into a
> - * workqueue, each with its own requirements and limitations. Currently only
> - * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> - * represents in-order queue. The kernel driver packs ring tail pointer and an
> - * ELSP context descriptor dword into Work Item.
> - * See guc_add_request()
> + * Command Transport buffers (CTBs):
> + * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
> + * G2H) are a message interface between the i915 and GuC used to controls
> + * submissions.
> + *
> + * Context registration:
> + * Before a context can be submitted it must be registered with the GuC via a
> + * H2G. A unique guc_id is associated with each context. The context is either
> + * registered at request creation time (normal operation) or at submission time
> + * (abnormal operation, e.g. after a reset).
> + *
> + * Context submission:
> + * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
> + * or context submit H2G is used to submit a context.
> + *
> + * Context unpin:
> + * To unpin a context a H2G is used to disable scheduling and when the
> + * corresponding G2H returns indicating the scheduling disable operation has
> + * completed it is safe to unpin the context. While a disable is in flight it
> + * isn't safe to resubmit the context so a fence is used to stall all future
> + * requests until the G2H is returned.
> + *
> + * Context deregistration:
> + * Before a context can be destroyed or we steal its guc_id we must deregister
> + * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
> + * submit anything to this guc_id until the deregister completes so a fence is
> + * used to stall all requests associated with this guc_ids until the
> + * corresponding G2H returns indicating the guc_id has been deregistered.
> + *
> + * guc_ids:
> + * Unique number associated with private GuC context data passed in during
> + * context registration / submission / deregistration. 64k available. Simple ida
> + * is used for allocation.
> + *
> + * Stealing guc_ids:
> + * If no guc_ids are available they can be stolen from another context at
> + * request creation time if that context is unpinned. If a guc_id can't be found
> + * we punt this problem to the user as we believe this is near impossible to hit
> + * during normal use cases.
> + *
> + * Locking:
> + * In the GuC submission code we have 4 basic spin locks which protect
> + * everything. Details about each below.
> + *
> + * sched_engine->lock
> + * This is the submission lock for all contexts that share a i915 schedule
> + * engine (sched_engine), thus only 1 context which share a sched_engine can be
> + * submitting at a time. Currently only 1 sched_engine used for all of GuC
> + * submission but that could change in the future.

There's at least 3 more spinlocks for intel_guc_ct ...

> + *
> + * guc->contexts_lock
> + * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> + * submission can hold this at a time.

Plus you forgot the spinlock of the xarrray, which is also used in the
code with this patch set, not just internally in the xarray, so we have to
think about that one too.

Iow still way too many locks.

> + *
> + * ce->guc_state.lock
> + * Protects everything under ce->guc_state. Ensures that a context is in the
> + * correct state before issuing a H2G. e.g. We don't issue a schedule disable
> + * on disabled context (bad idea), we don't issue schedule enable when a
> + * schedule disable is inflight, etc... Lock individual to each context.
> + *
> + * ce->guc_active.lock
> + * Protects everything under ce->guc_active which is the current requests
> + * inflight on the context / priority management. Lock individual to each
> + * context.
> + *
> + * Lock ordering rules:
> + * sched_engine->lock -> ce->guc_active.lock
> + * sched_engine->lock -> ce->guc_state.lock
> + * guc->contexts_lock -> ce->guc_state.lock
>   *
> + * Reset races:
> + * When a GPU full reset is triggered it is assumed that some G2H responses to
> + * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
> + * fatal as we do certain operations upon receiving a G2H (e.g. destroy
> + * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
> + * context state and cleanup appropriately, however this is quite racey. To
> + * avoid races the rules are check for submission being disabled (i.e. check for
> + * mid reset) with the appropriate lock being held. If submission is disabled
> + * don't send the H2G or update the context state. The reset code must disable
> + * submission and grab all these locks before scrubbing for the missing G2H.

Can we make this all a lot less racy? Instead of a huge state machinery
can't we just do all that under a context look, i.e.

1. take context lock
2. send guc message that is tricky, like register or deregister or
whatever
3. wait for that reply, our context is blocked anyway, no harm holding a
lock, other contexts can keep processing
4. the lower-level guc_ct code guarantees that we either get the reply, or
a -ERESET or whatever indicating that we raced with a reset, in which case
we can just restart whatever it is we wanted to do (or for deregister, do
nothing since the guc reset has solved that problem)
5. unlock

Massively lockless state machines are cool, but also very hard to maintain
and keep correct.
-Daniel

>   */
>  
>  /* GuC Virtual Engine */
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index d818cfbfc41d..177eaf55adff 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -290,18 +290,20 @@ struct i915_request {
>  		struct hrtimer timer;
>  	} watchdog;
>  
> -	/*
> -	 * Requests may need to be stalled when using GuC submission waiting for
> -	 * certain GuC operations to complete. If that is the case, stalled
> -	 * requests are added to a per context list of stalled requests. The
> -	 * below list_head is the link in that list.
> +	/**
> +	 * @guc_fence_link: Requests may need to be stalled when using GuC
> +	 * submission waiting for certain GuC operations to complete. If that is
> +	 * the case, stalled requests are added to a per context list of stalled
> +	 * requests. The below list_head is the link in that list. Protected by
> +	 * ce->guc_state.lock.
>  	 */
>  	struct list_head guc_fence_link;
>  
>  	/**
> -	 * Priority level while the request is inflight. Differs from i915
> -	 * scheduler priority. See comment above
> -	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
> +	 * @guc_prio: Priority level while the request is inflight. Differs from
> +	 * i915 scheduler priority. See comment above
> +	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
> +	 * ce->guc_active.lock.
>  	 */
>  #define	GUC_PRIO_INIT	0xff
>  #define	GUC_PRIO_FINI	0xfe
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (21 preceding siblings ...)
  2021-08-16 13:51 ` [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc Matthew Brost
@ 2021-08-17 12:49 ` Patchwork
  2021-08-17 12:51 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 56+ messages in thread
From: Patchwork @ 2021-08-17 12:49 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
URL   : https://patchwork.freedesktop.org/series/93704/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
b9583f1b134e drm/i915/guc: Fix blocked context accounting
5703206b5f51 drm/i915/guc: Fix outstanding G2H accounting
ee0ecb333df9 drm/i915/guc: Unwind context requests in reverse order
97ee783e2b00 drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
6f42cfc0eeb4 drm/i915/guc: Workaround reset G2H is received after schedule done G2H
-:7: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#7: 
If the context is reset as a result of the request cancelation the
                                                   ^^^^^^^^^^^

-:10: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#10: 
waiting request cancelation code which resubmits the context. This races
                ^^^^^^^^^^^

-:12: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#12: 
in this case it really should be a NOP as request cancelation code owns
                                                  ^^^^^^^^^^^

-:58: WARNING:BRACES: braces {} are not necessary for any arm of this statement
#58: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:850:
+	if (likely(!context_pending_enable(ce))) {
[...]
+	} else {
[...]

total: 0 errors, 4 warnings, 0 checks, 73 lines checked
806479ce9909 drm/i915/execlists: Do not propagate errors to dependent fences
862260cf6795 drm/i915/selftests: Add a cancel request selftest that triggers a reset
ba1d218343a3 drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
221846309949 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
241da61be83d drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
-:104: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#104: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 232 lines checked
957737f84734 drm/i915/guc: Take context ref when cancelling request
738284a940e2 drm/i915/guc: Don't touch guc_state.sched_state without a lock
48c820953477 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
e21f028c082e drm/i915: Allocate error capture in atomic context
31fbd295c9f5 drm/i915/guc: Flush G2H work queue during reset
dcd9725de04f drm/i915/guc: Release submit fence from an IRQ
c8b83840007d drm/i915/guc: Move guc_blocked fence to struct guc_state
fa32fd7346d0 drm/i915/guc: Rework and simplify locking
69e885b61035 drm/i915/guc: Proper xarray usage for contexts_lookup
492de6bdaacb drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
bb901831764c drm/i915/guc: Move GuC priority fields in context under guc_active
c4c34f7bb22c drm/i915/guc: Add GuC kernel doc



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (22 preceding siblings ...)
  2021-08-17 12:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2) Patchwork
@ 2021-08-17 12:51 ` Patchwork
  2021-08-17 13:22 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
  2021-08-17 14:39 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  25 siblings, 0 replies; 56+ messages in thread
From: Patchwork @ 2021-08-17 12:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
URL   : https://patchwork.freedesktop.org/series/93704/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/selftests/i915_syncmap.c:80:54: warning: dubious: x | !y



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (23 preceding siblings ...)
  2021-08-17 12:51 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2021-08-17 13:22 ` Patchwork
  2021-08-17 14:39 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  25 siblings, 0 replies; 56+ messages in thread
From: Patchwork @ 2021-08-17 13:22 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 5526 bytes --]

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
URL   : https://patchwork.freedesktop.org/series/93704/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10490 -> Patchwork_20833
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_20833:

### IGT changes ###

#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@i915_selftest@live@gt_heartbeat:
    - {fi-ehl-2}:         [PASS][1] -> [DMESG-FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/fi-ehl-2/igt@i915_selftest@live@gt_heartbeat.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-ehl-2/igt@i915_selftest@live@gt_heartbeat.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10490 and Patchwork_20833:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 30 pass(s)
    - Exec time: [0.40, 5.19] s

  

Known issues
------------

  Here are the changes found in Patchwork_20833 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@semaphore:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][3] ([fdo#109271]) +27 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-bdw-5557u/igt@amdgpu/amd_basic@semaphore.html

  * igt@core_hotunplug@unbind-rebind:
    - fi-bdw-5557u:       NOTRUN -> [WARN][4] ([i915#3718])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-bdw-5557u/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_parallel@engines@userptr:
    - fi-pnv-d510:        [PASS][5] -> [INCOMPLETE][6] ([i915#299])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/fi-pnv-d510/igt@gem_exec_parallel@engines@userptr.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-pnv-d510/igt@gem_exec_parallel@engines@userptr.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][7] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-bdw-5557u/igt@kms_chamelium@dp-crc-fast.html

  * igt@runner@aborted:
    - fi-pnv-d510:        NOTRUN -> [FAIL][8] ([i915#2403] / [i915#2505] / [i915#2722])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/fi-pnv-d510/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#2403]: https://gitlab.freedesktop.org/drm/intel/issues/2403
  [i915#2505]: https://gitlab.freedesktop.org/drm/intel/issues/2505
  [i915#2722]: https://gitlab.freedesktop.org/drm/intel/issues/2722
  [i915#299]: https://gitlab.freedesktop.org/drm/intel/issues/299
  [i915#3718]: https://gitlab.freedesktop.org/drm/intel/issues/3718


Participating hosts (36 -> 34)
------------------------------

  Missing    (2): fi-bsw-cyan fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_10490 -> Patchwork_20833

  CI-20190529: 20190529
  CI_DRM_10490: 3bd74b377986fcb89cf4563629f97c5b3199ca6f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6177: f474644e7226dd319195ca03b3cde82ad10ac54c @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20833: c4c34f7bb22c9a83377812d75d8eb207a44a1b9b @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

c4c34f7bb22c drm/i915/guc: Add GuC kernel doc
bb901831764c drm/i915/guc: Move GuC priority fields in context under guc_active
492de6bdaacb drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
69e885b61035 drm/i915/guc: Proper xarray usage for contexts_lookup
fa32fd7346d0 drm/i915/guc: Rework and simplify locking
c8b83840007d drm/i915/guc: Move guc_blocked fence to struct guc_state
dcd9725de04f drm/i915/guc: Release submit fence from an IRQ
31fbd295c9f5 drm/i915/guc: Flush G2H work queue during reset
e21f028c082e drm/i915: Allocate error capture in atomic context
48c820953477 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
738284a940e2 drm/i915/guc: Don't touch guc_state.sched_state without a lock
957737f84734 drm/i915/guc: Take context ref when cancelling request
241da61be83d drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
221846309949 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
ba1d218343a3 drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
862260cf6795 drm/i915/selftests: Add a cancel request selftest that triggers a reset
806479ce9909 drm/i915/execlists: Do not propagate errors to dependent fences
6f42cfc0eeb4 drm/i915/guc: Workaround reset G2H is received after schedule done G2H
97ee783e2b00 drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
ee0ecb333df9 drm/i915/guc: Unwind context requests in reverse order
5703206b5f51 drm/i915/guc: Fix outstanding G2H accounting
b9583f1b134e drm/i915/guc: Fix blocked context accounting

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/index.html

[-- Attachment #2: Type: text/html, Size: 6496 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
  2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (24 preceding siblings ...)
  2021-08-17 13:22 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2021-08-17 14:39 ` Patchwork
  25 siblings, 0 replies; 56+ messages in thread
From: Patchwork @ 2021-08-17 14:39 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30288 bytes --]

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev2)
URL   : https://patchwork.freedesktop.org/series/93704/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10490_full -> Patchwork_20833_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_20833_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_20833_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_20833_full:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_flip_tiling@flip-to-yf-tiled@edp-1-pipe-a:
    - shard-skl:          [PASS][1] -> [FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl10/igt@kms_flip_tiling@flip-to-yf-tiled@edp-1-pipe-a.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl7/igt@kms_flip_tiling@flip-to-yf-tiled@edp-1-pipe-a.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10490_full and Patchwork_20833_full:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 5 pass(s)
    - Exec time: [0.95, 4.69] s

  

Known issues
------------

  Here are the changes found in Patchwork_20833_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ctx_persistence@legacy-engines-queued:
    - shard-snb:          NOTRUN -> [SKIP][3] ([fdo#109271] / [i915#1099]) +2 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-snb5/igt@gem_ctx_persistence@legacy-engines-queued.html

  * igt@gem_eio@unwedge-stress:
    - shard-tglb:         [PASS][4] -> [TIMEOUT][5] ([i915#2369] / [i915#3063] / [i915#3648])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-tglb7/igt@gem_eio@unwedge-stress.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb7/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - shard-kbl:          [PASS][6] -> [FAIL][7] ([i915#2842]) +5 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl6/igt@gem_exec_fair@basic-none-solo@rcs0.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl4/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-pace@vcs1:
    - shard-iclb:         NOTRUN -> [FAIL][8] ([i915#2842]) +1 similar issue
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb2/igt@gem_exec_fair@basic-pace@vcs1.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-glk:          [PASS][9] -> [FAIL][10] ([i915#2842]) +2 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-glk5/igt@gem_exec_fair@basic-throttle@rcs0.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-glk9/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_suspend@basic-s4-devices:
    - shard-glk:          NOTRUN -> [DMESG-WARN][11] ([i915#1610])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-glk6/igt@gem_exec_suspend@basic-s4-devices.html

  * igt@gem_exec_whisper@basic-fds-forked:
    - shard-glk:          [PASS][12] -> [DMESG-WARN][13] ([i915#118] / [i915#95])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-glk7/igt@gem_exec_whisper@basic-fds-forked.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-glk7/igt@gem_exec_whisper@basic-fds-forked.html

  * igt@gem_huc_copy@huc-copy:
    - shard-tglb:         [PASS][14] -> [SKIP][15] ([i915#2190])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-tglb2/igt@gem_huc_copy@huc-copy.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb6/igt@gem_huc_copy@huc-copy.html

  * igt@gem_pwrite@basic-exhaustion:
    - shard-tglb:         NOTRUN -> [WARN][16] ([i915#2658])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb8/igt@gem_pwrite@basic-exhaustion.html

  * igt@gem_render_copy@y-tiled-to-vebox-y-tiled:
    - shard-iclb:         NOTRUN -> [SKIP][17] ([i915#768])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@gem_render_copy@y-tiled-to-vebox-y-tiled.html

  * igt@gem_userptr_blits@access-control:
    - shard-tglb:         NOTRUN -> [SKIP][18] ([i915#3297]) +1 similar issue
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@gem_userptr_blits@access-control.html
    - shard-iclb:         NOTRUN -> [SKIP][19] ([i915#3297]) +1 similar issue
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@gem_userptr_blits@access-control.html

  * igt@gem_userptr_blits@input-checking:
    - shard-apl:          NOTRUN -> [DMESG-WARN][20] ([i915#3002])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@gem_userptr_blits@input-checking.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-skl:          [PASS][21] -> [DMESG-WARN][22] ([i915#1436] / [i915#716])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl4/igt@gen9_exec_parse@allowed-single.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl3/igt@gen9_exec_parse@allowed-single.html

  * igt@i915_suspend@sysfs-reader:
    - shard-apl:          NOTRUN -> [DMESG-WARN][23] ([i915#180]) +2 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl3/igt@i915_suspend@sysfs-reader.html

  * igt@kms_addfb_basic@invalid-smem-bo-on-discrete:
    - shard-tglb:         NOTRUN -> [SKIP][24] ([i915#3826])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_addfb_basic@invalid-smem-bo-on-discrete.html
    - shard-iclb:         NOTRUN -> [SKIP][25] ([i915#3826])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_addfb_basic@invalid-smem-bo-on-discrete.html

  * igt@kms_atomic@plane-primary-overlay-mutable-zpos:
    - shard-tglb:         NOTRUN -> [SKIP][26] ([i915#404])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_atomic@plane-primary-overlay-mutable-zpos.html
    - shard-iclb:         NOTRUN -> [SKIP][27] ([i915#404])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_atomic@plane-primary-overlay-mutable-zpos.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-skl:          NOTRUN -> [SKIP][28] ([fdo#109271] / [i915#3777])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_big_fb@yf-tiled-8bpp-rotate-180:
    - shard-tglb:         NOTRUN -> [SKIP][29] ([fdo#111615])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb8/igt@kms_big_fb@yf-tiled-8bpp-rotate-180.html

  * igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][30] ([i915#3689]) +2 similar issues
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_ccs.html

  * igt@kms_ccs@pipe-b-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-iclb:         NOTRUN -> [SKIP][31] ([fdo#109278] / [i915#3886])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_ccs@pipe-b-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc:
    - shard-kbl:          NOTRUN -> [SKIP][32] ([fdo#109271] / [i915#3886]) +1 similar issue
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@kms_ccs@pipe-b-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc.html
    - shard-apl:          NOTRUN -> [SKIP][33] ([fdo#109271] / [i915#3886]) +9 similar issues
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl8/igt@kms_ccs@pipe-b-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_mc_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][34] ([i915#3689] / [i915#3886])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb8/igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-d-crc-primary-rotation-180-y_tiled_ccs:
    - shard-iclb:         NOTRUN -> [SKIP][35] ([fdo#109278])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_ccs@pipe-d-crc-primary-rotation-180-y_tiled_ccs.html

  * igt@kms_cdclk@plane-scaling:
    - shard-iclb:         NOTRUN -> [SKIP][36] ([i915#3742])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_cdclk@plane-scaling.html
    - shard-tglb:         NOTRUN -> [SKIP][37] ([i915#3742])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_cdclk@plane-scaling.html

  * igt@kms_chamelium@dp-audio-edid:
    - shard-skl:          NOTRUN -> [SKIP][38] ([fdo#109271] / [fdo#111827]) +1 similar issue
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl6/igt@kms_chamelium@dp-audio-edid.html

  * igt@kms_color_chamelium@pipe-a-ctm-0-25:
    - shard-snb:          NOTRUN -> [SKIP][39] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-snb5/igt@kms_color_chamelium@pipe-a-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-a-ctm-blue-to-red:
    - shard-kbl:          NOTRUN -> [SKIP][40] ([fdo#109271] / [fdo#111827])
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@kms_color_chamelium@pipe-a-ctm-blue-to-red.html

  * igt@kms_color_chamelium@pipe-a-ctm-limited-range:
    - shard-apl:          NOTRUN -> [SKIP][41] ([fdo#109271] / [fdo#111827]) +15 similar issues
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@kms_color_chamelium@pipe-a-ctm-limited-range.html

  * igt@kms_cursor_crc@pipe-a-cursor-32x32-random:
    - shard-tglb:         NOTRUN -> [SKIP][42] ([i915#3319])
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb8/igt@kms_cursor_crc@pipe-a-cursor-32x32-random.html

  * igt@kms_cursor_crc@pipe-a-cursor-512x512-rapid-movement:
    - shard-iclb:         NOTRUN -> [SKIP][43] ([fdo#109278] / [fdo#109279])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_cursor_crc@pipe-a-cursor-512x512-rapid-movement.html
    - shard-tglb:         NOTRUN -> [SKIP][44] ([fdo#109279] / [i915#3359])
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_cursor_crc@pipe-a-cursor-512x512-rapid-movement.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-skl:          [PASS][45] -> [FAIL][46] ([i915#3444])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl10/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl7/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_cursor_crc@pipe-c-cursor-suspend:
    - shard-apl:          [PASS][47] -> [DMESG-WARN][48] ([i915#180]) +1 similar issue
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl7/igt@kms_cursor_crc@pipe-c-cursor-suspend.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl1/igt@kms_cursor_crc@pipe-c-cursor-suspend.html

  * igt@kms_cursor_crc@pipe-d-cursor-256x256-rapid-movement:
    - shard-kbl:          NOTRUN -> [SKIP][49] ([fdo#109271]) +31 similar issues
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@kms_cursor_crc@pipe-d-cursor-256x256-rapid-movement.html

  * igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy:
    - shard-apl:          [PASS][50] -> [DMESG-WARN][51] ([IGT#6])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl3/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl1/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html

  * igt@kms_cursor_legacy@pipe-d-torture-bo:
    - shard-apl:          NOTRUN -> [SKIP][52] ([fdo#109271] / [i915#533]) +1 similar issue
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl3/igt@kms_cursor_legacy@pipe-d-torture-bo.html

  * igt@kms_flip@2x-flip-vs-panning-vs-hang:
    - shard-skl:          NOTRUN -> [SKIP][53] ([fdo#109271]) +21 similar issues
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@kms_flip@2x-flip-vs-panning-vs-hang.html

  * igt@kms_flip@2x-plain-flip:
    - shard-iclb:         NOTRUN -> [SKIP][54] ([fdo#109274])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_flip@2x-plain-flip.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@c-edp1:
    - shard-skl:          [PASS][55] -> [FAIL][56] ([i915#79])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl10/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-edp1.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl1/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-kbl:          [PASS][57] -> [DMESG-WARN][58] ([i915#180]) +2 similar issues
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl3/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl1/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_flip@plain-flip-fb-recreate@b-edp1:
    - shard-skl:          [PASS][59] -> [FAIL][60] ([i915#2122])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl3/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl8/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-mmap-cpu:
    - shard-iclb:         NOTRUN -> [SKIP][61] ([fdo#109280]) +4 similar issues
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-mmap-cpu.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-mmap-cpu:
    - shard-tglb:         NOTRUN -> [SKIP][62] ([fdo#111825]) +8 similar issues
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-mmap-cpu.html

  * igt@kms_hdr@bpc-switch:
    - shard-skl:          [PASS][63] -> [FAIL][64] ([i915#1188])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl8/igt@kms_hdr@bpc-switch.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@kms_hdr@bpc-switch.html

  * igt@kms_pipe_b_c_ivb@disable-pipe-b-enable-pipe-c:
    - shard-apl:          NOTRUN -> [SKIP][65] ([fdo#109271]) +161 similar issues
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@kms_pipe_b_c_ivb@disable-pipe-b-enable-pipe-c.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-d:
    - shard-kbl:          NOTRUN -> [SKIP][66] ([fdo#109271] / [i915#533])
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@kms_pipe_crc_basic@read-crc-pipe-d.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb:
    - shard-apl:          NOTRUN -> [FAIL][67] ([i915#265])
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl3/igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-b-alpha-basic:
    - shard-apl:          NOTRUN -> [FAIL][68] ([fdo#108145] / [i915#265]) +1 similar issue
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@kms_plane_alpha_blend@pipe-b-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          [PASS][69] -> [FAIL][70] ([fdo#108145] / [i915#265]) +2 similar issues
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl4/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl3/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_lowres@pipe-a-tiling-x:
    - shard-iclb:         NOTRUN -> [SKIP][71] ([i915#3536])
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_plane_lowres@pipe-a-tiling-x.html
    - shard-tglb:         NOTRUN -> [SKIP][72] ([i915#3536])
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_plane_lowres@pipe-a-tiling-x.html

  * igt@kms_plane_lowres@pipe-c-tiling-y:
    - shard-snb:          NOTRUN -> [SKIP][73] ([fdo#109271]) +151 similar issues
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-snb5/igt@kms_plane_lowres@pipe-c-tiling-y.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2:
    - shard-apl:          NOTRUN -> [SKIP][74] ([fdo#109271] / [i915#658]) +1 similar issue
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@plane-move-sf-dmg-area-2:
    - shard-tglb:         NOTRUN -> [SKIP][75] ([i915#2920])
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_psr2_sf@plane-move-sf-dmg-area-2.html
    - shard-iclb:         NOTRUN -> [SKIP][76] ([i915#658])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_psr2_sf@plane-move-sf-dmg-area-2.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-3:
    - shard-skl:          NOTRUN -> [SKIP][77] ([fdo#109271] / [i915#658])
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-3.html

  * igt@kms_psr@psr2_no_drrs:
    - shard-iclb:         [PASS][78] -> [SKIP][79] ([fdo#109441]) +1 similar issue
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb2/igt@kms_psr@psr2_no_drrs.html
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb6/igt@kms_psr@psr2_no_drrs.html

  * igt@kms_vblank@pipe-d-query-busy-hang:
    - shard-glk:          NOTRUN -> [SKIP][80] ([fdo#109271])
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-glk6/igt@kms_vblank@pipe-d-query-busy-hang.html

  * igt@kms_vrr@flipline:
    - shard-tglb:         NOTRUN -> [SKIP][81] ([fdo#109502])
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb5/igt@kms_vrr@flipline.html
    - shard-iclb:         NOTRUN -> [SKIP][82] ([fdo#109502])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb4/igt@kms_vrr@flipline.html

  * igt@perf@blocking:
    - shard-skl:          [PASS][83] -> [FAIL][84] ([i915#1542]) +1 similar issue
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl8/igt@perf@blocking.html
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl1/igt@perf@blocking.html

  * igt@sysfs_clients@fair-1:
    - shard-apl:          NOTRUN -> [SKIP][85] ([fdo#109271] / [i915#2994]) +2 similar issues
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@sysfs_clients@fair-1.html

  * igt@sysfs_clients@sema-10:
    - shard-skl:          NOTRUN -> [SKIP][86] ([fdo#109271] / [i915#2994])
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@sysfs_clients@sema-10.html

  
#### Possible fixes ####

  * igt@gem_eio@unwedge-stress:
    - shard-iclb:         [TIMEOUT][87] ([i915#2369] / [i915#2481] / [i915#3070]) -> [PASS][88]
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb8/igt@gem_eio@unwedge-stress.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb6/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-none@vecs0:
    - shard-apl:          [FAIL][89] ([i915#2842] / [i915#3468]) -> [PASS][90]
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl3/igt@gem_exec_fair@basic-none@vecs0.html
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl1/igt@gem_exec_fair@basic-none@vecs0.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-kbl:          [FAIL][91] ([i915#2842]) -> [PASS][92] +1 similar issue
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl7/igt@gem_exec_fair@basic-pace@rcs0.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@gem_exec_fair@basic-pace@rcs0.html
    - shard-tglb:         [FAIL][93] ([i915#2842]) -> [PASS][94]
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-tglb5/igt@gem_exec_fair@basic-pace@rcs0.html
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-tglb1/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_ppgtt@flink-and-close-vma-leak:
    - shard-skl:          [FAIL][95] ([i915#644]) -> [PASS][96]
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl2/igt@gem_ppgtt@flink-and-close-vma-leak.html
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl6/igt@gem_ppgtt@flink-and-close-vma-leak.html

  * igt@gen9_exec_parse@allowed-all:
    - shard-glk:          [DMESG-WARN][97] ([i915#1436] / [i915#716]) -> [PASS][98]
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-glk4/igt@gen9_exec_parse@allowed-all.html
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-glk6/igt@gen9_exec_parse@allowed-all.html

  * igt@i915_pm_rpm@system-suspend:
    - shard-skl:          [INCOMPLETE][99] ([i915#151]) -> [PASS][100]
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl6/igt@i915_pm_rpm@system-suspend.html
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl4/igt@i915_pm_rpm@system-suspend.html

  * igt@i915_suspend@forcewake:
    - shard-kbl:          [INCOMPLETE][101] ([i915#636]) -> [PASS][102]
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl4/igt@i915_suspend@forcewake.html
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl7/igt@i915_suspend@forcewake.html

  * igt@kms_cursor_crc@pipe-b-cursor-suspend:
    - shard-kbl:          [DMESG-WARN][103] ([i915#180]) -> [PASS][104]
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl7/igt@kms_cursor_crc@pipe-b-cursor-suspend.html
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl6/igt@kms_cursor_crc@pipe-b-cursor-suspend.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-skl:          [FAIL][105] ([i915#2346]) -> [PASS][106]
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl5/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl10/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-skl:          [FAIL][107] ([i915#79]) -> [PASS][108] +1 similar issue
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl2/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl2/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-apl:          [DMESG-WARN][109] ([i915#180]) -> [PASS][110] +1 similar issue
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl6/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl2/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [FAIL][111] ([i915#1188]) -> [PASS][112]
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl4/igt@kms_hdr@bpc-switch-dpms.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl8/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_psr2_su@page_flip:
    - shard-iclb:         [SKIP][113] ([fdo#109642] / [fdo#111068] / [i915#658]) -> [PASS][114]
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb3/igt@kms_psr2_su@page_flip.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb2/igt@kms_psr2_su@page_flip.html

  * igt@kms_psr@psr2_cursor_plane_move:
    - shard-iclb:         [SKIP][115] ([fdo#109441]) -> [PASS][116] +1 similar issue
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb4/igt@kms_psr@psr2_cursor_plane_move.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb2/igt@kms_psr@psr2_cursor_plane_move.html

  
#### Warnings ####

  * igt@i915_pm_dc@dc3co-vpb-simulation:
    - shard-iclb:         [SKIP][117] ([i915#658]) -> [SKIP][118] ([i915#588])
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb7/igt@i915_pm_dc@dc3co-vpb-simulation.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb2/igt@i915_pm_dc@dc3co-vpb-simulation.html

  * igt@i915_pm_rc6_residency@rc6-fence:
    - shard-iclb:         [WARN][119] ([i915#2684]) -> [WARN][120] ([i915#1804] / [i915#2684]) +1 similar issue
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb8/igt@i915_pm_rc6_residency@rc6-fence.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb6/igt@i915_pm_rc6_residency@rc6-fence.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2:
    - shard-iclb:         [SKIP][121] ([i915#658]) -> [SKIP][122] ([i915#2920]) +1 similar issue
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb7/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4:
    - shard-iclb:         [SKIP][123] ([i915#2920]) -> [SKIP][124] ([i915#658]) +1 similar issue
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-iclb6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][125], [FAIL][126], [FAIL][127]) ([i915#1814] / [i915#2505] / [i915#3002] / [i915#3363]) -> ([FAIL][128], [FAIL][129], [FAIL][130], [FAIL][131]) ([i915#180] / [i915#1814] / [i915#3002] / [i915#3363])
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl6/igt@runner@aborted.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl3/igt@runner@aborted.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-kbl7/igt@runner@aborted.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl1/igt@runner@aborted.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl1/igt@runner@aborted.html
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl2/igt@runner@aborted.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-kbl1/igt@runner@aborted.html
    - shard-apl:          ([FAIL][132], [FAIL][133], [FAIL][134], [FAIL][135], [FAIL][136]) ([fdo#109271] / [i915#180] / [i915#1814] / [i915#3002] / [i915#3363]) -> ([FAIL][137], [FAIL][138], [FAIL][139], [FAIL][140], [FAIL][141]) ([i915#180] / [i915#1814] / [i915#3002] / [i915#3363])
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl1/igt@runner@aborted.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl1/igt@runner@aborted.html
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl6/igt@runner@aborted.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl2/igt@runner@aborted.html
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-apl8/igt@runner@aborted.html
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl3/igt@runner@aborted.html
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl2/igt@runner@aborted.html
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl7/igt@runner@aborted.html
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl3/igt@runner@aborted.html
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-apl1/igt@runner@aborted.html
    - shard-skl:          ([FAIL][142], [FAIL][143]) ([i915#3002] / [i915#3363]) -> ([FAIL][144], [FAIL][145], [FAIL][146]) ([i915#1436] / [i915#3002] / [i915#3363])
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl7/igt@runner@aborted.html
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10490/shard-skl10/igt@runner@aborted.html
   [144]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl3/igt@runner@aborted.html
   [145]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl3/igt@runner@aborted.html
   [146]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/shard-skl7/igt@runner@aborted.html

  
  [IGT#6]: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/6
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/sh

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20833/index.html

[-- Attachment #2: Type: text/html, Size: 35329 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-17  9:32   ` Daniel Vetter
@ 2021-08-17 15:03     ` Matthew Brost
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 15:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 11:32:56AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:22AM -0700, Matthew Brost wrote:
> > If the context is reset as a result of the request cancelation the
> > context reset G2H is received after schedule disable done G2H which is
> > likely the wrong order. The schedule disable done G2H release the
> > waiting request cancelation code which resubmits the context. This races
> > with the context reset G2H which also wants to resubmit the context but
> > in this case it really should be a NOP as request cancelation code owns
> > the resubmit. Use some clever tricks of checking the context state to
> > seal this race until if / when the GuC firmware is fixed.
> > 
> > v2:
> >  (Checkpatch)
> >   - Fix typos
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
> >  1 file changed, 37 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3cd2da6f5c03..c3b7bf7319dd 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -826,17 +826,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >  static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >  {
> >  	struct i915_request *rq;
> > +	unsigned long flags;
> >  	u32 head;
> > +	bool skip = false;
> >  
> >  	intel_context_get(ce);
> >  
> >  	/*
> > -	 * GuC will implicitly mark the context as non-schedulable
> > -	 * when it sends the reset notification. Make sure our state
> > -	 * reflects this change. The context will be marked enabled
> > -	 * on resubmission.
> > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > +	 * the reset notification. Make sure our state reflects this change. The
> > +	 * context will be marked enabled on resubmission.
> > +	 *
> > +	 * XXX: If the context is reset as a result of the request cancellation
> > +	 * this G2H is received after the schedule disable complete G2H which is
> > +	 * likely wrong as this creates a race between the request cancellation
> > +	 * code re-submitting the context and this G2H handler. This likely
> > +	 * should be fixed in the GuC but until if / when that gets fixed we
> > +	 * need to workaround this. Convert this function to a NOP if a pending
> > +	 * enable is in flight as this indicates that a request cancellation has
> > +	 * occurred.
> >  	 */
> > -	clr_context_enabled(ce);
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	if (likely(!context_pending_enable(ce))) {
> > +		clr_context_enabled(ce);
> > +	} else {
> > +		skip = true;
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(skip))
> > +		goto out_put;
> >  
> >  	rq = intel_context_find_active_request(ce);
> >  	if (!rq) {
> > @@ -855,6 +873,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >  out_replay:
> >  	guc_reset_state(ce, head, stalled);
> >  	__unwind_incomplete_requests(ce);
> > +out_put:
> >  	intel_context_put(ce);
> >  }
> >  
> > @@ -1599,6 +1618,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >  			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
> >  					true);
> >  		}
> > +
> > +		/*
> > +		 * XXX: Racey if context is reset, see comment in
> > +		 * __guc_reset_context().
> > +		 */
> > +		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> 
> This looks racy, and I think that holds in general for all the flush_work
> you're adding: This only flushes the processing of the work, it doesn't
> stop any re-queueing (as far as I can tell at least), which means it
> doesn't do a hole lot.
> 
> Worse, your task is re-queue because it only processes one item at a time.
> That means flush_work only flushes the first invocation, but not even
> drains them all. So even if you do prevent requeueing somehow, this isn't
> what you want. Two solutions.
> 
> - flush_work_sync, which flushes until self-requeues are all done too
> 
> - Or more preferred, make you're worker a bit more standard for this
>   stuff: a) under the spinlock, take the entire list, not just the first
>   entry, with list_move or similar to a local list b) process that local
>   list in a loop b) don't requeue youreself.

This seems better, not sure what it currently doesn't do that as I
didn't write that code.

Also BTW, confirmed with the GuC team the order of the G2H is incorrect
and will get fixed in an upcoming release, once that happens most of
this patch can get dropped.

Matt 

> 
> Cheers, Daniel
> > +
> >  		guc_context_unblock(ce);
> >  	}
> >  }
> > @@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> >  {
> >  	trace_intel_context_reset(ce);
> >  
> > -	if (likely(!intel_context_is_banned(ce))) {
> > +	/*
> > +	 * XXX: Racey if request cancellation has occurred, see comment in
> > +	 * __guc_reset_context().
> > +	 */
> > +	if (likely(!intel_context_is_banned(ce) &&
> > +		   !context_blocked(ce))) {
> >  		capture_error_state(guc, ce);
> >  		guc_context_replay(ce);
> >  	}
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences
  2021-08-17  9:21   ` Daniel Vetter
@ 2021-08-17 15:08     ` Matthew Brost
  2021-08-17 15:49       ` Daniel Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 15:08 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 11:21:27AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:23AM -0700, Matthew Brost wrote:
> > Progagating errors to dependent fences is wrong, don't do it. Selftest
> > in following patch exposes this bug.
> 
> Please explain what "this bug" is, it's hard to read minds, especially at
> a distance in spacetime :-)
> 

Not a very good explaination.

> > Fixes: 8e9f84cf5cac ("drm/i915/gt: Propagate change in error status to children on unhold")
> 
> I think it would be better to outright revert this, instead of just
> disabling it like this.
>

I tried revert and git did some really odd things that I couldn't
resolve, hence the new patch.
 
> Also please cite the dma_fence error propagation revert from Jason:
> 
> commit 93a2711cddd5760e2f0f901817d71c93183c3b87
> Author: Jason Ekstrand <jason@jlekstrand.net>
> Date:   Wed Jul 14 14:34:16 2021 -0500
> 
>     Revert "drm/i915: Propagate errors on awaiting already signaled fences"
> 
> Maybe in full, if you need the justification.
>

Will site.

> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> 
> Unless "this bug" is some real world impact thing I wouldn't put cc:
> stable on this.

Got it.

Matt

> -Daniel
> > ---
> >  drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index de5f9c86b9a4..cafb0608ffb4 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
> >  			if (p->flags & I915_DEPENDENCY_WEAK)
> >  				continue;
> >  
> > -			/* Propagate any change in error status */
> > -			if (rq->fence.error)
> > -				i915_request_set_error_once(w, rq->fence.error);
> > -
> >  			if (w->engine != rq->engine)
> >  				continue;
> >  
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-17 10:27   ` Daniel Vetter
@ 2021-08-17 15:26     ` Matthew Brost
  2021-08-17 17:13       ` Daniel Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 15:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 12:27:29PM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:36AM -0700, Matthew Brost wrote:
> > Lock the xarray and take ref to the context if needed.
> > 
> > v2:
> >  (Checkpatch)
> >   - Add new line after declaration
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 84 ++++++++++++++++---
> >  1 file changed, 73 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ba19b99173fc..2ecb2f002bed 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -599,8 +599,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >  	unsigned long index, flags;
> >  	bool pending_disable, pending_enable, deregister, destroyed, banned;
> >  
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >  	xa_for_each(&guc->context_lookup, index, ce) {
> > -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +		/*
> > +		 * Corner case where the ref count on the object is zero but and
> > +		 * deregister G2H was lost. In this case we don't touch the ref
> > +		 * count and finish the destroy of the context.
> > +		 */
> > +		bool do_put = kref_get_unless_zero(&ce->ref);
> 
> This looks really scary, because in another loop below you have an
> unconditional refcount increase. This means sometimes guc->context_lookup

Yea, good catch those loops need something like this too.

> xarray guarantees we hold a full reference on the context, sometimes we
> don't. So we're right back in "protect the code" O(N^2) review complexity
> instead of invariant rules about the datastructure, which is linear.
> 
> Essentially anytime you feel like you have to add a comment to explain
> what's going on about concurrent stuff you're racing with, you're
> protecting code, not data.
> 
> Since guc can't do a hole lot without the guc_id registered and all that,
> I kinda expected you'd always have a full reference here. If there's

The deregister is triggered by the ref count going to zero and we can't
fully release the guc_id until that operation completes hence why it is
still in the xarray. I think the solution here is to use iterator like
you mention below that ref counts this correctly.

> intermediate stages (e.g. around unregister) where this is currently not
> always the case, then those should make sure a full reference is held.
> 
> Another option would be to threa ->context_lookup as a weak reference that
> we lazily clean up when the context is finalized. That works too, but
> probably not with a spinlock (since you most likely have to wait for all
> pending guc transations to complete), but it's another option.
> 
> Either way I think standard process is needed here for locking design,
> i.e.
> 1. come up with the right invariants ("we always have a full reference
> when a context is ont he guc->context_lookup xarray")
> 2. come up with the locks. From the guc side the xa_lock is maybe good
> enough, but from the context side this doesn't protect against a
> re-registering racing against a deregistering. So probably needs more
> rules on top, and then you have a nice lock inversion in a few places like
> here.
> 3. document it and roll it out.
> 
> The other thing is that this is a very tricky iterator, and there's a few
> copies of it. That is, if this is the right solution. As-is this should be
> abstracted away into guc_context_iter_begin/next_end() helpers, e.g. like
> we have for drm_connector_list_iter_begin/end_next as an example.
>

I can check this out.

Matt
 
> Cheers, Daniel
> 
> > +
> > +		xa_unlock(&guc->context_lookup);
> > +
> > +		spin_lock(&ce->guc_state.lock);
> >  
> >  		/*
> >  		 * Once we are at this point submission_disabled() is guaranteed
> > @@ -616,7 +626,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >  		banned = context_banned(ce);
> >  		init_sched_state(ce);
> >  
> > -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> > +		GEM_BUG_ON(!do_put && !destroyed);
> >  
> >  		if (pending_enable || destroyed || deregister) {
> >  			atomic_dec(&guc->outstanding_submission_g2h);
> > @@ -645,7 +657,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >  
> >  			intel_context_put(ce);
> >  		}
> > +
> > +		if (do_put)
> > +			intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> >  	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >  }
> >  
> >  static inline bool
> > @@ -866,16 +883,26 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> >  {
> >  	struct intel_context *ce;
> >  	unsigned long index;
> > +	unsigned long flags;
> >  
> >  	if (unlikely(!guc_submission_initialized(guc))) {
> >  		/* Reset called during driver load? GuC not yet initialised! */
> >  		return;
> >  	}
> >  
> > -	xa_for_each(&guc->context_lookup, index, ce)
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > +	xa_for_each(&guc->context_lookup, index, ce) {
> > +		intel_context_get(ce);
> > +		xa_unlock(&guc->context_lookup);
> > +
> >  		if (intel_context_is_pinned(ce))
> >  			__guc_reset_context(ce, stalled);
> >  
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> > +	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > +
> >  	/* GuC is blown away, drop all references to contexts */
> >  	xa_destroy(&guc->context_lookup);
> >  }
> > @@ -950,11 +977,21 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> >  {
> >  	struct intel_context *ce;
> >  	unsigned long index;
> > +	unsigned long flags;
> > +
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > +	xa_for_each(&guc->context_lookup, index, ce) {
> > +		intel_context_get(ce);
> > +		xa_unlock(&guc->context_lookup);
> >  
> > -	xa_for_each(&guc->context_lookup, index, ce)
> >  		if (intel_context_is_pinned(ce))
> >  			guc_cancel_context_requests(ce);
> >  
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> > +	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > +
> >  	guc_cancel_sched_engine_requests(guc->sched_engine);
> >  
> >  	/* GuC is blown away, drop all references to contexts */
> > @@ -2848,21 +2885,26 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> >  	struct intel_context *ce;
> >  	struct i915_request *rq;
> >  	unsigned long index;
> > +	unsigned long flags;
> >  
> >  	/* Reset called during driver load? GuC not yet initialised! */
> >  	if (unlikely(!guc_submission_initialized(guc)))
> >  		return;
> >  
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >  	xa_for_each(&guc->context_lookup, index, ce) {
> > +		intel_context_get(ce);
> > +		xa_unlock(&guc->context_lookup);
> > +
> >  		if (!intel_context_is_pinned(ce))
> > -			continue;
> > +			goto next;
> >  
> >  		if (intel_engine_is_virtual(ce->engine)) {
> >  			if (!(ce->engine->mask & engine->mask))
> > -				continue;
> > +				goto next;
> >  		} else {
> >  			if (ce->engine != engine)
> > -				continue;
> > +				goto next;
> >  		}
> >  
> >  		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> > @@ -2872,9 +2914,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> >  			intel_engine_set_hung_context(engine, ce);
> >  
> >  			/* Can only cope with one hang at a time... */
> > -			return;
> > +			intel_context_put(ce);
> > +			xa_lock(&guc->context_lookup);
> > +			goto done;
> >  		}
> > +next:
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> > +
> >  	}
> > +done:
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >  }
> >  
> >  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > @@ -2890,23 +2940,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> >  	if (unlikely(!guc_submission_initialized(guc)))
> >  		return;
> >  
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >  	xa_for_each(&guc->context_lookup, index, ce) {
> > +		intel_context_get(ce);
> > +		xa_unlock(&guc->context_lookup);
> > +
> >  		if (!intel_context_is_pinned(ce))
> > -			continue;
> > +			goto next;
> >  
> >  		if (intel_engine_is_virtual(ce->engine)) {
> >  			if (!(ce->engine->mask & engine->mask))
> > -				continue;
> > +				goto next;
> >  		} else {
> >  			if (ce->engine != engine)
> > -				continue;
> > +				goto next;
> >  		}
> >  
> >  		spin_lock_irqsave(&ce->guc_active.lock, flags);
> >  		intel_engine_dump_active_requests(&ce->guc_active.requests,
> >  						  hung_rq, m);
> >  		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> > +
> > +next:
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> >  	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >  }
> >  
> >  void intel_guc_submission_print_info(struct intel_guc *guc,
> > @@ -2960,7 +3019,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >  {
> >  	struct intel_context *ce;
> >  	unsigned long index;
> > +	unsigned long flags;
> >  
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >  	xa_for_each(&guc->context_lookup, index, ce) {
> >  		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> >  		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > @@ -2979,6 +3040,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >  
> >  		guc_log_context_priority(p, ce);
> >  	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >  }
> >  
> >  static struct intel_context *
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking
  2021-08-17 10:15   ` Daniel Vetter
@ 2021-08-17 15:30     ` Matthew Brost
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 15:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 12:15:21PM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:35AM -0700, Matthew Brost wrote:
> > Rework and simplify the locking with GuC subission. Drop
> > sched_state_no_lock and move all fields under the guc_state.sched_state
> > and protect all these fields with guc_state.lock . This requires
> > changing the locking hierarchy from guc_state.lock -> sched_engine.lock
> > to sched_engine.lock -> guc_state.lock.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> Yeah this is definitely going in the right direction. Especially
> sprinkling lockdep_assert_held around.
> 
> One comment below.
> 
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
> >  drivers/gpu/drm/i915/i915_trace.h             |   6 +-
> >  3 files changed, 89 insertions(+), 108 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index c06171ee8792..d5d643b04d54 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -161,7 +161,7 @@ struct intel_context {
> >  		 * sched_state: scheduling state of this context using GuC
> >  		 * submission
> >  		 */
> > -		u16 sched_state;
> > +		u32 sched_state;
> >  		/*
> >  		 * fences: maintains of list of requests that have a submit
> >  		 * fence related to GuC submission
> > @@ -178,9 +178,6 @@ struct intel_context {
> >  		struct list_head requests;
> >  	} guc_active;
> >  
> > -	/* GuC scheduling state flags that do not require a lock. */
> > -	atomic_t guc_sched_state_no_lock;
> > -
> >  	/* GuC LRC descriptor ID */
> >  	u16 guc_id;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 7aa16371908a..ba19b99173fc 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> >  
> >  #define GUC_REQUEST_SIZE 64 /* bytes */
> >  
> > -/*
> > - * Below is a set of functions which control the GuC scheduling state which do
> > - * not require a lock as all state transitions are mutually exclusive. i.e. It
> > - * is not possible for the context pinning code and submission, for the same
> > - * context, to be executing simultaneously. We still need an atomic as it is
> > - * possible for some of the bits to changing at the same time though.
> > - */
> > -#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
> > -#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
> > -#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
> > -static inline bool context_enabled(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_ENABLED);
> > -}
> > -
> > -static inline void set_context_enabled(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_enabled(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline bool context_pending_enable(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
> > -}
> > -
> > -static inline void set_context_pending_enable(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> > -		  &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_pending_enable(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline bool context_registered(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_REGISTERED);
> > -}
> > -
> > -static inline void set_context_registered(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
> > -		  &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_registered(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> >  /*
> >   * Below is a set of functions which control the GuC scheduling state which
> > - * require a lock, aside from the special case where the functions are called
> > - * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
> > - * path to be executing on the context.
> > + * require a lock.
> >   */
> >  #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
> >  #define SCHED_STATE_DESTROYED				BIT(1)
> >  #define SCHED_STATE_PENDING_DISABLE			BIT(2)
> >  #define SCHED_STATE_BANNED				BIT(3)
> > -#define SCHED_STATE_BLOCKED_SHIFT			4
> > +#define SCHED_STATE_ENABLED				BIT(4)
> > +#define SCHED_STATE_PENDING_ENABLE			BIT(5)
> > +#define SCHED_STATE_REGISTERED				BIT(6)
> > +#define SCHED_STATE_BLOCKED_SHIFT			7
> >  #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
> >  #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
> >  static inline void init_sched_state(struct intel_context *ce)
> >  {
> >  	lockdep_assert_held(&ce->guc_state.lock);
> > -	atomic_set(&ce->guc_sched_state_no_lock, 0);
> >  	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> >  }
> >  
> > @@ -161,9 +98,8 @@ static inline bool sched_state_is_init(struct intel_context *ce)
> >  	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
> >  	 * suspend.
> >  	 */
> > -	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> > -		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> > -		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> > +	return !(ce->guc_state.sched_state &=
> > +		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
> >  }
> >  
> >  static inline bool
> > @@ -236,6 +172,57 @@ static inline void clr_context_banned(struct intel_context *ce)
> >  	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
> >  }
> >  
> > +static inline bool context_enabled(struct intel_context *ce)
> 
> No statice inline in .c files. The compiler is better at this than you
> are. Especially once you add stuff like asserts and everything, it's just
> not worth the cognitive effort to have to reevaluate these.
> 

Will clean up this whole file in one patch at the end dropping all the
static inlines from this file.

Matt

> One-line helpers in headers are the only exception where static inline is
> ok.
> -Daniel
> 
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline void set_context_enabled(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline void clr_context_enabled(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline bool context_pending_enable(struct intel_context *ce)
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline void set_context_pending_enable(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline void clr_context_pending_enable(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline bool context_registered(struct intel_context *ce)
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
> > +}
> > +
> > +static inline void set_context_registered(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
> > +}
> > +
> > +static inline void clr_context_registered(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
> > +}
> > +
> >  static inline u32 context_blocked(struct intel_context *ce)
> >  {
> >  	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
> > @@ -244,7 +231,6 @@ static inline u32 context_blocked(struct intel_context *ce)
> >  
> >  static inline void incr_context_blocked(struct intel_context *ce)
> >  {
> > -	lockdep_assert_held(&ce->engine->sched_engine->lock);
> >  	lockdep_assert_held(&ce->guc_state.lock);
> >  
> >  	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
> > @@ -254,7 +240,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
> >  
> >  static inline void decr_context_blocked(struct intel_context *ce)
> >  {
> > -	lockdep_assert_held(&ce->engine->sched_engine->lock);
> >  	lockdep_assert_held(&ce->guc_state.lock);
> >  
> >  	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
> > @@ -443,6 +428,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >  	u32 g2h_len_dw = 0;
> >  	bool enabled;
> >  
> > +	lockdep_assert_held(&rq->engine->sched_engine->lock);
> > +
> >  	/*
> >  	 * Corner case where requests were sitting in the priority list or a
> >  	 * request resubmitted after the context was banned.
> > @@ -450,7 +437,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >  	if (unlikely(intel_context_is_banned(ce))) {
> >  		i915_request_put(i915_request_mark_eio(rq));
> >  		intel_engine_signal_breadcrumbs(ce->engine);
> > -		goto out;
> > +		return 0;
> >  	}
> >  
> >  	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> > @@ -463,9 +450,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >  	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
> >  		err = guc_lrc_desc_pin(ce, false);
> >  		if (unlikely(err))
> > -			goto out;
> > +			return err;
> >  	}
> >  
> > +	spin_lock(&ce->guc_state.lock);
> > +
> >  	/*
> >  	 * The request / context will be run on the hardware when scheduling
> >  	 * gets enabled in the unblock.
> > @@ -500,6 +489,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >  		trace_i915_request_guc_submit(rq);
> >  
> >  out:
> > +	spin_unlock(&ce->guc_state.lock);
> >  	return err;
> >  }
> >  
> > @@ -720,8 +710,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >  	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
> >  	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >  
> > -	guc_flush_submissions(guc);
> > -
> >  	flush_work(&guc->ct.requests.worker);
> >  
> >  	scrub_guc_desc_for_outstanding_g2h(guc);
> > @@ -1125,7 +1113,11 @@ static int steal_guc_id(struct intel_guc *guc)
> >  
> >  		list_del_init(&ce->guc_id_link);
> >  		guc_id = ce->guc_id;
> > +
> > +		spin_lock(&ce->guc_state.lock);
> >  		clr_context_registered(ce);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> >  		set_context_guc_id_invalid(ce);
> >  		return guc_id;
> >  	} else {
> > @@ -1161,6 +1153,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  try_again:
> >  	spin_lock_irqsave(&guc->contexts_lock, flags);
> >  
> > +	might_lock(&ce->guc_state.lock);
> > +
> >  	if (context_guc_id_invalid(ce)) {
> >  		ret = assign_guc_id(guc, &ce->guc_id);
> >  		if (ret)
> > @@ -1240,8 +1234,13 @@ static int register_context(struct intel_context *ce, bool loop)
> >  	trace_intel_context_register(ce);
> >  
> >  	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
> > -	if (likely(!ret))
> > +	if (likely(!ret)) {
> > +		unsigned long flags;
> > +
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  		set_context_registered(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	}
> >  
> >  	return ret;
> >  }
> > @@ -1517,7 +1516,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
> >  static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >  {
> >  	struct intel_guc *guc = ce_to_guc(ce);
> > -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
> >  	unsigned long flags;
> >  	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >  	intel_wakeref_t wakeref;
> > @@ -1526,13 +1524,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >  
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> > -	/*
> > -	 * Sync with submission path, increment before below changes to context
> > -	 * state.
> > -	 */
> > -	spin_lock(&sched_engine->lock);
> >  	incr_context_blocked(ce);
> > -	spin_unlock(&sched_engine->lock);
> >  
> >  	enabled = context_enabled(ce);
> >  	if (unlikely(!enabled || submission_disabled(guc))) {
> > @@ -1561,7 +1553,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >  static void guc_context_unblock(struct intel_context *ce)
> >  {
> >  	struct intel_guc *guc = ce_to_guc(ce);
> > -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
> >  	unsigned long flags;
> >  	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >  	intel_wakeref_t wakeref;
> > @@ -1586,13 +1577,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >  		intel_context_get(ce);
> >  	}
> >  
> > -	/*
> > -	 * Sync with submission path, decrement after above changes to context
> > -	 * state.
> > -	 */
> > -	spin_lock(&sched_engine->lock);
> >  	decr_context_blocked(ce);
> > -	spin_unlock(&sched_engine->lock);
> >  
> >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  
> > @@ -1702,7 +1687,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >  
> >  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> >  	    !lrc_desc_registered(guc, ce->guc_id)) {
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  		clr_context_enabled(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  		goto unpin;
> >  	}
> >  
> > @@ -1752,7 +1739,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >  	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
> >  	GEM_BUG_ON(context_enabled(ce));
> >  
> > -	clr_context_registered(ce);
> >  	deregister_context(ce, ce->guc_id, true);
> >  }
> >  
> > @@ -1825,8 +1811,10 @@ static void guc_context_destroy(struct kref *kref)
> >  	/* Seal race with Reset */
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  	disabled = submission_disabled(guc);
> > -	if (likely(!disabled))
> > +	if (likely(!disabled)) {
> >  		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  	if (unlikely(disabled)) {
> >  		release_guc_id(guc, ce);
> > @@ -2695,8 +2683,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> >  		     (!context_pending_enable(ce) &&
> >  		     !context_pending_disable(ce)))) {
> >  		drm_err(&guc_to_gt(guc)->i915->drm,
> > -			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
> > -			atomic_read(&ce->guc_sched_state_no_lock),
> > +			"Bad context sched_state 0x%x, desc_idx %u",
> >  			ce->guc_state.sched_state, desc_idx);
> >  		return -EPROTO;
> >  	}
> > @@ -2711,7 +2698,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> >  		}
> >  #endif
> >  
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  		clr_context_pending_enable(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  	} else if (context_pending_disable(ce)) {
> >  		bool banned;
> >  
> > @@ -2985,9 +2974,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >  			   atomic_read(&ce->pin_count));
> >  		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> >  			   atomic_read(&ce->guc_id_ref));
> > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > -			   ce->guc_state.sched_state,
> > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > +		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> > +			   ce->guc_state.sched_state);
> >  
> >  		guc_log_context_priority(p, ce);
> >  	}
> > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > index 806ad688274b..0a77eb2944b5 100644
> > --- a/drivers/gpu/drm/i915/i915_trace.h
> > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > @@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
> >  			     __field(u32, guc_id)
> >  			     __field(int, pin_count)
> >  			     __field(u32, sched_state)
> > -			     __field(u32, guc_sched_state_no_lock)
> >  			     __field(u8, guc_prio)
> >  			     ),
> >  
> > @@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
> >  			   __entry->guc_id = ce->guc_id;
> >  			   __entry->pin_count = atomic_read(&ce->pin_count);
> >  			   __entry->sched_state = ce->guc_state.sched_state;
> > -			   __entry->guc_sched_state_no_lock =
> > -			   atomic_read(&ce->guc_sched_state_no_lock);
> >  			   __entry->guc_prio = ce->guc_prio;
> >  			   ),
> >  
> > -		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
> > +		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
> >  			      __entry->guc_id, __entry->pin_count,
> >  			      __entry->sched_state,
> > -			      __entry->guc_sched_state_no_lock,
> >  			      __entry->guc_prio)
> >  );
> >  
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences
  2021-08-17 15:08     ` Matthew Brost
@ 2021-08-17 15:49       ` Daniel Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 15:49 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 17, 2021 at 5:13 PM Matthew Brost <matthew.brost@intel.com> wrote:
> On Tue, Aug 17, 2021 at 11:21:27AM +0200, Daniel Vetter wrote:
> > On Mon, Aug 16, 2021 at 06:51:23AM -0700, Matthew Brost wrote:
> > > Progagating errors to dependent fences is wrong, don't do it. Selftest
> > > in following patch exposes this bug.
> >
> > Please explain what "this bug" is, it's hard to read minds, especially at
> > a distance in spacetime :-)
> >
>
> Not a very good explaination.
>
> > > Fixes: 8e9f84cf5cac ("drm/i915/gt: Propagate change in error status to children on unhold")
> >
> > I think it would be better to outright revert this, instead of just
> > disabling it like this.
> >
>
> I tried revert and git did some really odd things that I couldn't
> resolve, hence the new patch.

If there's any conflict git just gives you your current code, and what
was there with the revert applied, with the block markers. Then it's
your job to manually apply that change.

Occasionally (when there's been ridiculous amounts of code movement)
it gets completely lost and puts these into very non-intuitive places.
In that case just delete it, keep the current code, and check what
change you're missing that needs to be manually reverted still. Also
sometimes there's a follow-up patch that you should revert first,
which makes the revert clean. In that case it's generally the right
thing to revert the follow-up first, and then apply your revert. Often
there's subtle functional dependencies hiding.
-Daniel

>
> > Also please cite the dma_fence error propagation revert from Jason:
> >
> > commit 93a2711cddd5760e2f0f901817d71c93183c3b87
> > Author: Jason Ekstrand <jason@jlekstrand.net>
> > Date:   Wed Jul 14 14:34:16 2021 -0500
> >
> >     Revert "drm/i915: Propagate errors on awaiting already signaled fences"
> >
> > Maybe in full, if you need the justification.
> >
>
> Will site.
>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <stable@vger.kernel.org>
> >
> > Unless "this bug" is some real world impact thing I wouldn't put cc:
> > stable on this.
>
> Got it.
>
> Matt
>
> > -Daniel
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
> > >  1 file changed, 4 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index de5f9c86b9a4..cafb0608ffb4 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
> > >                     if (p->flags & I915_DEPENDENCY_WEAK)
> > >                             continue;
> > >
> > > -                   /* Propagate any change in error status */
> > > -                   if (rq->fence.error)
> > > -                           i915_request_set_error_once(w, rq->fence.error);
> > > -
> > >                     if (w->engine != rq->engine)
> > >                             continue;
> > >
> > > --
> > > 2.32.0
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context
  2021-08-17 10:06   ` Daniel Vetter
@ 2021-08-17 16:12     ` Matthew Brost
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 16:12 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 12:06:16PM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:31AM -0700, Matthew Brost wrote:
> > Error captures can now be done in a work queue processing G2H messages.
> > These messages need to be completely done being processed in the reset
> > path, to avoid races in the missing G2H cleanup, which create a
> > dependency on memory allocations and dma fences (i915_requests).
> > Requests depend on resets, thus now we have a circular dependency. To
> > work around this, allocate the error capture in an atomic context.
> > 
> > Fixes: dc0dad365c5e ("Fix for error capture after full GPU reset with GuC")
> > Fixes: 573ba126aef3 ("Capture error state on context reset")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gpu_error.c | 37 +++++++++++++--------------
> >  1 file changed, 18 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index 0f08bcfbe964..453376aa6d9f 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -49,7 +49,6 @@
> >  #include "i915_memcpy.h"
> >  #include "i915_scatterlist.h"
> >  
> > -#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
> >  #define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)
> 
> This one doesn't make much sense. GFP_ATOMIC essentially means we're
> high-priority and failure would be a pretty bad day. Meanwhile
> __GFP_NOWARN means we can totally cope with failure, pls don't holler.
> 
> GFP_NOWAIT | __GFP_NOWARN would the more consistent one here I think.
> 
> gfp.h for all the docs for this.
> 
> Separate patch ofc. This one is definitely the right direction, since
> GFP_KERNEL from the reset worker is not a good idea.

Lockdep is happy with GFP_NOWAIT so this works for me.

Matt

> -Daniel
> 
> >  
> >  static void __sg_set_buf(struct scatterlist *sg,
> > @@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
> >  	if (e->cur == e->end) {
> >  		struct scatterlist *sgl;
> >  
> > -		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
> > +		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
> >  		if (!sgl) {
> >  			e->err = -ENOMEM;
> >  			return false;
> > @@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
> >  	}
> >  
> >  	e->size = ALIGN(len + 1, SZ_64K);
> > -	e->buf = kmalloc(e->size, ALLOW_FAIL);
> > +	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
> >  	if (!e->buf) {
> >  		e->size = PAGE_ALIGN(len + 1);
> > -		e->buf = kmalloc(e->size, GFP_KERNEL);
> > +		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
> >  	}
> >  	if (!e->buf) {
> >  		e->err = -ENOMEM;
> > @@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
> >  {
> >  	struct z_stream_s *zstream = &c->zstream;
> >  
> > -	if (pool_init(&c->pool, ALLOW_FAIL))
> > +	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
> >  		return false;
> >  
> >  	zstream->workspace =
> >  		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
> > -			ALLOW_FAIL);
> > +			ATOMIC_MAYFAIL);
> >  	if (!zstream->workspace) {
> >  		pool_fini(&c->pool);
> >  		return false;
> > @@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
> >  
> >  	c->tmp = NULL;
> >  	if (i915_has_memcpy_from_wc())
> > -		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
> > +		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
> >  
> >  	return true;
> >  }
> > @@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
> >  	if (dst->page_count >= dst->num_pages)
> >  		return ERR_PTR(-ENOSPC);
> >  
> > -	page = pool_alloc(&c->pool, ALLOW_FAIL);
> > +	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
> >  	if (!page)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > @@ -376,7 +375,7 @@ struct i915_vma_compress {
> >  
> >  static bool compress_init(struct i915_vma_compress *c)
> >  {
> > -	return pool_init(&c->pool, ALLOW_FAIL) == 0;
> > +	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
> >  }
> >  
> >  static bool compress_start(struct i915_vma_compress *c)
> > @@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
> >  {
> >  	void *ptr;
> >  
> > -	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
> > +	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
> >  	if (!ptr)
> >  		return -ENOMEM;
> >  
> > @@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
> >  
> >  	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
> >  	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
> > -	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
> > +	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
> >  	if (!dst)
> >  		return NULL;
> >  
> > @@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
> >  	struct i915_request *rq = NULL;
> >  	unsigned long flags;
> >  
> > -	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
> > +	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
> >  	if (!ee)
> >  		return NULL;
> >  
> > @@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
> >  		struct intel_engine_coredump *ee;
> >  
> >  		/* Refill our page pool before entering atomic section */
> > -		pool_refill(&compress->pool, ALLOW_FAIL);
> > +		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
> >  
> >  		ee = capture_engine(engine, compress);
> >  		if (!ee)
> > @@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
> >  	const struct intel_uc *uc = &gt->_gt->uc;
> >  	struct intel_uc_coredump *error_uc;
> >  
> > -	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
> > +	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
> >  	if (!error_uc)
> >  		return NULL;
> >  
> > @@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
> >  	 * As modparams are generally accesible from the userspace make
> >  	 * explicit copies of the firmware paths.
> >  	 */
> > -	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
> > -	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
> > +	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
> > +	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
> >  	error_uc->guc_log =
> >  		i915_vma_coredump_create(gt->_gt,
> >  					 uc->guc.log.vma, "GuC log buffer",
> > @@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
> >  {
> >  	struct i915_vma_compress *compress;
> >  
> > -	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
> > +	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
> >  	if (!compress)
> >  		return NULL;
> >  
> > @@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
> >  	if (IS_ERR(error))
> >  		return error;
> >  
> > -	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
> > +	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
> >  	if (!error)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > -	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
> > +	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
> >  	if (error->gt) {
> >  		struct i915_vma_compress *compress;
> >  
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 11:11   ` Daniel Vetter
@ 2021-08-17 16:36     ` Matthew Brost
  2021-08-17 17:20       ` Daniel Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 16:36 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
> > Add GuC kernel doc for all structures added thus far for GuC submission
> > and update the main GuC submission section with the new interface
> > details.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
> locking design that also doesn't feel too consistent.
>

That is a different layer than GuC submission so I don't we should
mention anything about that layer here. Didn't really write that layer
and it super painful to touch that code so I'm going to stay out of any
rework you think we need to do there. 
 
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |  42 +++++---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +++-
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 101 ++++++++++++++----
> >  drivers/gpu/drm/i915/i915_request.h           |  18 ++--
> >  4 files changed, 131 insertions(+), 49 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index f6989e6807f7..75d609a1bc33 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -156,44 +156,56 @@ struct intel_context {
> >  	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
> >  
> >  	struct {
> > -		/** lock: protects everything in guc_state */
> > +		/** @lock: protects everything in guc_state */
> >  		spinlock_t lock;
> >  		/**
> > -		 * sched_state: scheduling state of this context using GuC
> > +		 * @sched_state: scheduling state of this context using GuC
> >  		 * submission
> >  		 */
> >  		u32 sched_state;
> >  		/*
> > -		 * fences: maintains of list of requests that have a submit
> > -		 * fence related to GuC submission
> > +		 * @fences: maintains a list of requests are currently being
> > +		 * fenced until a GuC operation completes
> >  		 */
> >  		struct list_head fences;
> > -		/* GuC context blocked fence */
> > +		/**
> > +		 * @blocked_fence: fence used to signal when the blocking of a
> > +		 * contexts submissions is complete.
> > +		 */
> >  		struct i915_sw_fence blocked_fence;
> > -		/* GuC committed requests */
> > +		/** @number_committed_requests: number of committed requests */
> >  		int number_committed_requests;
> >  	} guc_state;
> >  
> >  	struct {
> > -		/** lock: protects everything in guc_active */
> > +		/** @lock: protects everything in guc_active */
> >  		spinlock_t lock;
> 
> Why do we have two locks spinlocks to protect guc context state?
> 
> I do understand the need for a spinlock (at least for now) because of how
> i915-scheduler runs in tasklet context. But beyond that we really
> shouldn't need more than two locks to protect context state. You still
> have an entire pile here, plus some atomics, plus more.
>

Yea I actually thought about this after I sent to out, guc_active &
guc_state should be combined into a single lock. Originally I had two
different locks because of old hierarchy this is no longer needed. Can
fix.
 
> And this is on a single context, where concurrently submitting stuff
> really isn't a thing. I'd expect actual benchmarking would show a perf
> hit, since all these locks and atomics aren't free. This is at least the
> case with execbuf and the various i915_vma locks we currently have.
> 
> What I expect intel_context locking to be is roughly:
> 
> - One lock to protect all intel_context state. This probably should be a
>   dma_resv_lock for a few reasons, least so we can pin state objects
>   underneath that lock.
> 
> - A separate lock if there's anything you need to coordinate with the
>   backend scheduler while that's running, to avoid dma_fence inversions.
>   Right now this separate lock might need to be a spinlock because our
>   scheduler runs in tasklets, and that might mean we need both a mutex and
>   a spinlock here.
>
> Anything that goes beyond that is premature optimization and kills us code
> complexity vise. I'd be _extremely_ surprised if an IA core cannot keep up
> with GuC, and therefore anything that goes beyond "one lock per object",
> plus/minus execution context issues like the above tasklet issue, is
> likely just going to slow everything down.

If I combine the above spin lock, isn't that basically what we have one
lock for the context state as it relates to GuC submission?

Also thinking when we move to DRM scheduler we likely can get rid of all
the atomic contexts in the GuC submission backend.

> 
> > -		/** requests: active requests on this context */
> > +		/** @requests: list of active requests on this context */
> >  		struct list_head requests;
> > -		/*
> > -		 * GuC priority management
> > -		 */
> > +		/** @guc_prio: the contexts current guc priority */
> >  		u8 guc_prio;
> > +		/**
> > +		 * @guc_prio_count: a counter of the number requests inflight in
> > +		 * each priority bucket
> > +		 */
> >  		u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> >  	} guc_active;
> >  
> > -	/* GuC LRC descriptor ID */
> > +	/**
> > +	 * @guc_id: unique handle which is used to communicate information with
> > +	 * the GuC about this context, protected by guc->contexts_lock
> > +	 */
> >  	u16 guc_id;
> >  
> > -	/* GuC LRC descriptor reference count */
> > +	/**
> > +	 * @guc_id_ref: the number of references to the guc_id, when
> > +	 * transitioning in and out of zero protected by guc->contexts_lock
> > +	 */
> >  	atomic_t guc_id_ref;
> 
> All this guc_id related stuff (including the guc->context_lookup xarray I
> guess) also has quite a pile of atomics and locks.
>
> >  
> > -	/*
> > -	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> > +	/**
> > +	 * @guc_id_link: in guc->guc_id_list when the guc_id has no refs but is
> > +	 * still valid, protected by guc->contexts_lock
> >  	 */
> >  	struct list_head guc_id_link;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 2e27fe59786b..c0b3fdb601f0 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -41,6 +41,10 @@ struct intel_guc {
> >  	spinlock_t irq_lock;
> >  	unsigned int msg_enabled_mask;
> >  
> > +	/**
> > +	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
> > +	 * submission, used to determine if the GT is idle
> > +	 */
> >  	atomic_t outstanding_submission_g2h;
> 
> atomic_t is good for statistcs, but not for code flow control. If you use
> if for that you either need a lot of barriers and comments, which means
> there needs to be some real perf numbers showing that this is required in
> a workload we care about.

This is kinda stat too, it is connected to debugfs and typically is
non-zero if something has gone horribly wrong (e.g. you lose a G2H).
Confused about the flow control comment, this basically it just sayings
the GT isn't idle if the GuC is processing messages and we expect a G2H
response. A counter here makes sense to me and don't see why we'd need
barriorr for this.

> 
> Or you stuff this into a related lock. E.g. from high-level view stuff
> this into intel_guc_ct (which also has definitely way more locks than it
> needs) could make sense?
> 

I had it in that layer at one point but got push back, thus it lives
here now. The way it is used now it probably makes sense to keep it
here.

> >  
> >  	struct {
> > @@ -49,12 +53,16 @@ struct intel_guc {
> >  		void (*disable)(struct intel_guc *guc);
> >  	} interrupts;
> >  
> > -	/*
> > -	 * contexts_lock protects the pool of free guc ids and a linked list of
> > -	 * guc ids available to be stolen
> > +	/**
> > +	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id, and
> > +	 * ce->guc_id_ref when transitioning in and out of zero
> >  	 */
> >  	spinlock_t contexts_lock;
> > +	/** @guc_ids: used to allocate new guc_ids */
> >  	struct ida guc_ids;
> > +	/**
> > +	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > +	 */
> >  	struct list_head guc_id_list;
> >  
> >  	bool submission_supported;
> > @@ -70,7 +78,10 @@ struct intel_guc {
> >  	struct i915_vma *lrc_desc_pool;
> >  	void *lrc_desc_pool_vaddr;
> >  
> > -	/* guc_id to intel_context lookup */
> > +	/**
> > +	 * @context_lookup: used to intel_context from guc_id, if a context is
> > +	 * present in this structure it is registered with the GuC
> > +	 */
> >  	struct xarray context_lookup;
> >  
> >  	/* Control params for fw initialization */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index eb06a4c7534e..18ef363c6e5d 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -28,21 +28,6 @@
> >  /**
> >   * DOC: GuC-based command submission
> >   *
> > - * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
> > - * firmware is moving to an updated submission interface and we plan to
> > - * turn submission back on when that lands. The below documentation (and related
> > - * code) matches the old submission model and will be updated as part of the
> > - * upgrade to the new flow.
> > - *
> > - * GuC stage descriptor:
> > - * During initialization, the driver allocates a static pool of 1024 such
> > - * descriptors, and shares them with the GuC. Currently, we only use one
> > - * descriptor. This stage descriptor lets the GuC know about the workqueue and
> > - * process descriptor. Theoretically, it also lets the GuC know about our HW
> > - * contexts (context ID, etc...), but we actually employ a kind of submission
> > - * where the GuC uses the LRCA sent via the work item instead. This is called
> > - * a "proxy" submission.
> > - *
> >   * The Scratch registers:
> >   * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
> >   * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
> > @@ -51,14 +36,86 @@
> >   * processes the request. The kernel driver polls waiting for this update and
> >   * then proceeds.
> >   *
> > - * Work Items:
> > - * There are several types of work items that the host may place into a
> > - * workqueue, each with its own requirements and limitations. Currently only
> > - * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> > - * represents in-order queue. The kernel driver packs ring tail pointer and an
> > - * ELSP context descriptor dword into Work Item.
> > - * See guc_add_request()
> > + * Command Transport buffers (CTBs):
> > + * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
> > + * G2H) are a message interface between the i915 and GuC used to controls
> > + * submissions.
> > + *
> > + * Context registration:
> > + * Before a context can be submitted it must be registered with the GuC via a
> > + * H2G. A unique guc_id is associated with each context. The context is either
> > + * registered at request creation time (normal operation) or at submission time
> > + * (abnormal operation, e.g. after a reset).
> > + *
> > + * Context submission:
> > + * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
> > + * or context submit H2G is used to submit a context.
> > + *
> > + * Context unpin:
> > + * To unpin a context a H2G is used to disable scheduling and when the
> > + * corresponding G2H returns indicating the scheduling disable operation has
> > + * completed it is safe to unpin the context. While a disable is in flight it
> > + * isn't safe to resubmit the context so a fence is used to stall all future
> > + * requests until the G2H is returned.
> > + *
> > + * Context deregistration:
> > + * Before a context can be destroyed or we steal its guc_id we must deregister
> > + * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
> > + * submit anything to this guc_id until the deregister completes so a fence is
> > + * used to stall all requests associated with this guc_ids until the
> > + * corresponding G2H returns indicating the guc_id has been deregistered.
> > + *
> > + * guc_ids:
> > + * Unique number associated with private GuC context data passed in during
> > + * context registration / submission / deregistration. 64k available. Simple ida
> > + * is used for allocation.
> > + *
> > + * Stealing guc_ids:
> > + * If no guc_ids are available they can be stolen from another context at
> > + * request creation time if that context is unpinned. If a guc_id can't be found
> > + * we punt this problem to the user as we believe this is near impossible to hit
> > + * during normal use cases.
> > + *
> > + * Locking:
> > + * In the GuC submission code we have 4 basic spin locks which protect
> > + * everything. Details about each below.
> > + *
> > + * sched_engine->lock
> > + * This is the submission lock for all contexts that share a i915 schedule
> > + * engine (sched_engine), thus only 1 context which share a sched_engine can be
> > + * submitting at a time. Currently only 1 sched_engine used for all of GuC
> > + * submission but that could change in the future.
> 
> There's at least 3 more spinlocks for intel_guc_ct ...
> 

Different layer that I'd like to stay out of.

> > + *
> > + * guc->contexts_lock
> > + * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> > + * submission can hold this at a time.
> 
> Plus you forgot the spinlock of the xarrray, which is also used in the
> code with this patch set, not just internally in the xarray, so we have to
> think about that one too.
>
> Iow still way too many locks.
>

Well we can delete one pretty easily, that's something, right?

> > + *
> > + * ce->guc_state.lock
> > + * Protects everything under ce->guc_state. Ensures that a context is in the
> > + * correct state before issuing a H2G. e.g. We don't issue a schedule disable
> > + * on disabled context (bad idea), we don't issue schedule enable when a
> > + * schedule disable is inflight, etc... Lock individual to each context.
> > + *
> > + * ce->guc_active.lock
> > + * Protects everything under ce->guc_active which is the current requests
> > + * inflight on the context / priority management. Lock individual to each
> > + * context.
> > + *
> > + * Lock ordering rules:
> > + * sched_engine->lock -> ce->guc_active.lock
> > + * sched_engine->lock -> ce->guc_state.lock
> > + * guc->contexts_lock -> ce->guc_state.lock
> >   *
> > + * Reset races:
> > + * When a GPU full reset is triggered it is assumed that some G2H responses to
> > + * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
> > + * fatal as we do certain operations upon receiving a G2H (e.g. destroy
> > + * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
> > + * context state and cleanup appropriately, however this is quite racey. To
> > + * avoid races the rules are check for submission being disabled (i.e. check for
> > + * mid reset) with the appropriate lock being held. If submission is disabled
> > + * don't send the H2G or update the context state. The reset code must disable
> > + * submission and grab all these locks before scrubbing for the missing G2H.
> 
> Can we make this all a lot less racy? Instead of a huge state machinery
> can't we just do all that under a context look, i.e.
> 

Well can't sleep inside the context lock so confused by this suggestion.

> 1. take context lock
> 2. send guc message that is tricky, like register or deregister or
> whatever

A schedule enable is tricky too and that can be sent from the submission
path.

> 3. wait for that reply, our context is blocked anyway, no harm holding a

We don't need to block while a schedule enable is in flight. Also we
only block the submissions while the other messages are in flight, we
are allowed continuing working on the context (e.g. we can generate a
prep another request while the operation is inflight).

> lock, other contexts can keep processing
> 4. the lower-level guc_ct code guarantees that we either get the reply, or

Again don't really want to touch the guc_ct code.

> a -ERESET or whatever indicating that we raced with a reset, in which case
> we can just restart whatever it is we wanted to do (or for deregister, do
> nothing since the guc reset has solved that problem)

You'd still have to have a list of waiters which get woke on a reset
too. Not all the convinced this would simplier that checking if a reset
is in flight within a lock.

> 5. unlock
> 
> Massively lockless state machines are cool, but also very hard to maintain
> and keep correct.

Not saying we can't try to make this simplier over time but a huge
structural change at this point is way more likely to break things than
cleaning up what we have in place.

Matt

> -Daniel
> 
> >   */
> >  
> >  /* GuC Virtual Engine */
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index d818cfbfc41d..177eaf55adff 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -290,18 +290,20 @@ struct i915_request {
> >  		struct hrtimer timer;
> >  	} watchdog;
> >  
> > -	/*
> > -	 * Requests may need to be stalled when using GuC submission waiting for
> > -	 * certain GuC operations to complete. If that is the case, stalled
> > -	 * requests are added to a per context list of stalled requests. The
> > -	 * below list_head is the link in that list.
> > +	/**
> > +	 * @guc_fence_link: Requests may need to be stalled when using GuC
> > +	 * submission waiting for certain GuC operations to complete. If that is
> > +	 * the case, stalled requests are added to a per context list of stalled
> > +	 * requests. The below list_head is the link in that list. Protected by
> > +	 * ce->guc_state.lock.
> >  	 */
> >  	struct list_head guc_fence_link;
> >  
> >  	/**
> > -	 * Priority level while the request is inflight. Differs from i915
> > -	 * scheduler priority. See comment above
> > -	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
> > +	 * @guc_prio: Priority level while the request is inflight. Differs from
> > +	 * i915 scheduler priority. See comment above
> > +	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
> > +	 * ce->guc_active.lock.
> >  	 */
> >  #define	GUC_PRIO_INIT	0xff
> >  #define	GUC_PRIO_FINI	0xfe
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-17  9:47   ` Daniel Vetter
  2021-08-17  9:57     ` Daniel Vetter
@ 2021-08-17 16:44     ` Matthew Brost
  1 sibling, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 16:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index c3b7bf7319dd..353899634fa8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> 
> I think this entire if condition here is screaming that our intel_context
> state machinery for guc is way too complex, and on the wrong side of
> incomprehensible.
> 
> Also some of these check state outside of the context, and we don't seem
> to hold spinlocks for those, or anything else.
> 
> I general I have no idea which of these are defensive programming and
> cannot ever happen, and which actually can happen. There's for sure way
> too many races going on given that this is all context-local stuff.

A lot of this is guarding against a full GT reset while trying to
cancel a request. Full GT resets make everything really hard and in
pratice should never really happen because the GuC does per engine /
context resets. Unfortunately IGTs do weird things like turn off per
engine / contexts resets and full GT reset the only way to recover so
the IGTs can will expose all the races around GT reset, especially when
we run IGTs a pre-prod HW that tends to hang for whatever reason.

Matt 

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-17 15:26     ` Matthew Brost
@ 2021-08-17 17:13       ` Daniel Vetter
  2021-08-17 17:13         ` Matthew Brost
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 17:13 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 08:26:28AM -0700, Matthew Brost wrote:
> On Tue, Aug 17, 2021 at 12:27:29PM +0200, Daniel Vetter wrote:
> > On Mon, Aug 16, 2021 at 06:51:36AM -0700, Matthew Brost wrote:
> > > Lock the xarray and take ref to the context if needed.
> > > 
> > > v2:
> > >  (Checkpatch)
> > >   - Add new line after declaration
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 84 ++++++++++++++++---
> > >  1 file changed, 73 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index ba19b99173fc..2ecb2f002bed 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -599,8 +599,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > >  	unsigned long index, flags;
> > >  	bool pending_disable, pending_enable, deregister, destroyed, banned;
> > >  
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > +		/*
> > > +		 * Corner case where the ref count on the object is zero but and
> > > +		 * deregister G2H was lost. In this case we don't touch the ref
> > > +		 * count and finish the destroy of the context.
> > > +		 */
> > > +		bool do_put = kref_get_unless_zero(&ce->ref);
> > 
> > This looks really scary, because in another loop below you have an
> > unconditional refcount increase. This means sometimes guc->context_lookup
> 
> Yea, good catch those loops need something like this too.
> 
> > xarray guarantees we hold a full reference on the context, sometimes we
> > don't. So we're right back in "protect the code" O(N^2) review complexity
> > instead of invariant rules about the datastructure, which is linear.
> > 
> > Essentially anytime you feel like you have to add a comment to explain
> > what's going on about concurrent stuff you're racing with, you're
> > protecting code, not data.
> > 
> > Since guc can't do a hole lot without the guc_id registered and all that,
> > I kinda expected you'd always have a full reference here. If there's
> 
> The deregister is triggered by the ref count going to zero and we can't
> fully release the guc_id until that operation completes hence why it is
> still in the xarray. I think the solution here is to use iterator like
> you mention below that ref counts this correctly.

Hm but if the refcount drops to zero while we have a guc_id, how does that
work? Do we delay the guc_context_destroy until that's done, or is the
context handed off internally somehow to a worker?

Afaik intel_context_put is called from all kinds of nasty context, so
waiting is not an option as-is ...
-Daniel

> > intermediate stages (e.g. around unregister) where this is currently not
> > always the case, then those should make sure a full reference is held.
> > 
> > Another option would be to threa ->context_lookup as a weak reference that
> > we lazily clean up when the context is finalized. That works too, but
> > probably not with a spinlock (since you most likely have to wait for all
> > pending guc transations to complete), but it's another option.
> > 
> > Either way I think standard process is needed here for locking design,
> > i.e.
> > 1. come up with the right invariants ("we always have a full reference
> > when a context is ont he guc->context_lookup xarray")
> > 2. come up with the locks. From the guc side the xa_lock is maybe good
> > enough, but from the context side this doesn't protect against a
> > re-registering racing against a deregistering. So probably needs more
> > rules on top, and then you have a nice lock inversion in a few places like
> > here.
> > 3. document it and roll it out.
> > 
> > The other thing is that this is a very tricky iterator, and there's a few
> > copies of it. That is, if this is the right solution. As-is this should be
> > abstracted away into guc_context_iter_begin/next_end() helpers, e.g. like
> > we have for drm_connector_list_iter_begin/end_next as an example.
> >
> 
> I can check this out.
> 
> Matt
>  
> > Cheers, Daniel
> > 
> > > +
> > > +		xa_unlock(&guc->context_lookup);
> > > +
> > > +		spin_lock(&ce->guc_state.lock);
> > >  
> > >  		/*
> > >  		 * Once we are at this point submission_disabled() is guaranteed
> > > @@ -616,7 +626,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > >  		banned = context_banned(ce);
> > >  		init_sched_state(ce);
> > >  
> > > -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > +		spin_unlock(&ce->guc_state.lock);
> > > +
> > > +		GEM_BUG_ON(!do_put && !destroyed);
> > >  
> > >  		if (pending_enable || destroyed || deregister) {
> > >  			atomic_dec(&guc->outstanding_submission_g2h);
> > > @@ -645,7 +657,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > >  
> > >  			intel_context_put(ce);
> > >  		}
> > > +
> > > +		if (do_put)
> > > +			intel_context_put(ce);
> > > +		xa_lock(&guc->context_lookup);
> > >  	}
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > >  }
> > >  
> > >  static inline bool
> > > @@ -866,16 +883,26 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> > >  {
> > >  	struct intel_context *ce;
> > >  	unsigned long index;
> > > +	unsigned long flags;
> > >  
> > >  	if (unlikely(!guc_submission_initialized(guc))) {
> > >  		/* Reset called during driver load? GuC not yet initialised! */
> > >  		return;
> > >  	}
> > >  
> > > -	xa_for_each(&guc->context_lookup, index, ce)
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > +		intel_context_get(ce);
> > > +		xa_unlock(&guc->context_lookup);
> > > +
> > >  		if (intel_context_is_pinned(ce))
> > >  			__guc_reset_context(ce, stalled);
> > >  
> > > +		intel_context_put(ce);
> > > +		xa_lock(&guc->context_lookup);
> > > +	}
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > +
> > >  	/* GuC is blown away, drop all references to contexts */
> > >  	xa_destroy(&guc->context_lookup);
> > >  }
> > > @@ -950,11 +977,21 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> > >  {
> > >  	struct intel_context *ce;
> > >  	unsigned long index;
> > > +	unsigned long flags;
> > > +
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > +		intel_context_get(ce);
> > > +		xa_unlock(&guc->context_lookup);
> > >  
> > > -	xa_for_each(&guc->context_lookup, index, ce)
> > >  		if (intel_context_is_pinned(ce))
> > >  			guc_cancel_context_requests(ce);
> > >  
> > > +		intel_context_put(ce);
> > > +		xa_lock(&guc->context_lookup);
> > > +	}
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > +
> > >  	guc_cancel_sched_engine_requests(guc->sched_engine);
> > >  
> > >  	/* GuC is blown away, drop all references to contexts */
> > > @@ -2848,21 +2885,26 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> > >  	struct intel_context *ce;
> > >  	struct i915_request *rq;
> > >  	unsigned long index;
> > > +	unsigned long flags;
> > >  
> > >  	/* Reset called during driver load? GuC not yet initialised! */
> > >  	if (unlikely(!guc_submission_initialized(guc)))
> > >  		return;
> > >  
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > +		intel_context_get(ce);
> > > +		xa_unlock(&guc->context_lookup);
> > > +
> > >  		if (!intel_context_is_pinned(ce))
> > > -			continue;
> > > +			goto next;
> > >  
> > >  		if (intel_engine_is_virtual(ce->engine)) {
> > >  			if (!(ce->engine->mask & engine->mask))
> > > -				continue;
> > > +				goto next;
> > >  		} else {
> > >  			if (ce->engine != engine)
> > > -				continue;
> > > +				goto next;
> > >  		}
> > >  
> > >  		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> > > @@ -2872,9 +2914,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> > >  			intel_engine_set_hung_context(engine, ce);
> > >  
> > >  			/* Can only cope with one hang at a time... */
> > > -			return;
> > > +			intel_context_put(ce);
> > > +			xa_lock(&guc->context_lookup);
> > > +			goto done;
> > >  		}
> > > +next:
> > > +		intel_context_put(ce);
> > > +		xa_lock(&guc->context_lookup);
> > > +
> > >  	}
> > > +done:
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > >  }
> > >  
> > >  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > > @@ -2890,23 +2940,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > >  	if (unlikely(!guc_submission_initialized(guc)))
> > >  		return;
> > >  
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > +		intel_context_get(ce);
> > > +		xa_unlock(&guc->context_lookup);
> > > +
> > >  		if (!intel_context_is_pinned(ce))
> > > -			continue;
> > > +			goto next;
> > >  
> > >  		if (intel_engine_is_virtual(ce->engine)) {
> > >  			if (!(ce->engine->mask & engine->mask))
> > > -				continue;
> > > +				goto next;
> > >  		} else {
> > >  			if (ce->engine != engine)
> > > -				continue;
> > > +				goto next;
> > >  		}
> > >  
> > >  		spin_lock_irqsave(&ce->guc_active.lock, flags);
> > >  		intel_engine_dump_active_requests(&ce->guc_active.requests,
> > >  						  hung_rq, m);
> > >  		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> > > +
> > > +next:
> > > +		intel_context_put(ce);
> > > +		xa_lock(&guc->context_lookup);
> > >  	}
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > >  }
> > >  
> > >  void intel_guc_submission_print_info(struct intel_guc *guc,
> > > @@ -2960,7 +3019,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > >  {
> > >  	struct intel_context *ce;
> > >  	unsigned long index;
> > > +	unsigned long flags;
> > >  
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > >  		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > >  		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > @@ -2979,6 +3040,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > >  
> > >  		guc_log_context_priority(p, ce);
> > >  	}
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > >  }
> > >  
> > >  static struct intel_context *
> > > -- 
> > > 2.32.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-17 17:13       ` Daniel Vetter
@ 2021-08-17 17:13         ` Matthew Brost
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 17:13 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 07:13:33PM +0200, Daniel Vetter wrote:
> On Tue, Aug 17, 2021 at 08:26:28AM -0700, Matthew Brost wrote:
> > On Tue, Aug 17, 2021 at 12:27:29PM +0200, Daniel Vetter wrote:
> > > On Mon, Aug 16, 2021 at 06:51:36AM -0700, Matthew Brost wrote:
> > > > Lock the xarray and take ref to the context if needed.
> > > > 
> > > > v2:
> > > >  (Checkpatch)
> > > >   - Add new line after declaration
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 84 ++++++++++++++++---
> > > >  1 file changed, 73 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index ba19b99173fc..2ecb2f002bed 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -599,8 +599,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > >  	unsigned long index, flags;
> > > >  	bool pending_disable, pending_enable, deregister, destroyed, banned;
> > > >  
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > > +		/*
> > > > +		 * Corner case where the ref count on the object is zero but and
> > > > +		 * deregister G2H was lost. In this case we don't touch the ref
> > > > +		 * count and finish the destroy of the context.
> > > > +		 */
> > > > +		bool do_put = kref_get_unless_zero(&ce->ref);
> > > 
> > > This looks really scary, because in another loop below you have an
> > > unconditional refcount increase. This means sometimes guc->context_lookup
> > 
> > Yea, good catch those loops need something like this too.
> > 
> > > xarray guarantees we hold a full reference on the context, sometimes we
> > > don't. So we're right back in "protect the code" O(N^2) review complexity
> > > instead of invariant rules about the datastructure, which is linear.
> > > 
> > > Essentially anytime you feel like you have to add a comment to explain
> > > what's going on about concurrent stuff you're racing with, you're
> > > protecting code, not data.
> > > 
> > > Since guc can't do a hole lot without the guc_id registered and all that,
> > > I kinda expected you'd always have a full reference here. If there's
> > 
> > The deregister is triggered by the ref count going to zero and we can't
> > fully release the guc_id until that operation completes hence why it is
> > still in the xarray. I think the solution here is to use iterator like
> > you mention below that ref counts this correctly.
> 
> Hm but if the refcount drops to zero while we have a guc_id, how does that
> work? Do we delay the guc_context_destroy until that's done, or is the

Yes, we don't want to release the guc_id and deregister the context with
the GuC until the i915 is done with the context (no refs). We issue the
deregister when we have no refs (done directly now, add worker to do
this in a upcoming patch). We release the guc_id, remove from xarray, and
destroy context when the deregister completes.

> context handed off internally somehow to a worker?
> 
> Afaik intel_context_put is called from all kinds of nasty context, so
> waiting is not an option as-is ...

Right, it is definitely can be called from nasty contexts hence why move
this to a work in an upcoming patch.

Matt

> -Daniel
> 
> > > intermediate stages (e.g. around unregister) where this is currently not
> > > always the case, then those should make sure a full reference is held.
> > > 
> > > Another option would be to threa ->context_lookup as a weak reference that
> > > we lazily clean up when the context is finalized. That works too, but
> > > probably not with a spinlock (since you most likely have to wait for all
> > > pending guc transations to complete), but it's another option.
> > > 
> > > Either way I think standard process is needed here for locking design,
> > > i.e.
> > > 1. come up with the right invariants ("we always have a full reference
> > > when a context is ont he guc->context_lookup xarray")
> > > 2. come up with the locks. From the guc side the xa_lock is maybe good
> > > enough, but from the context side this doesn't protect against a
> > > re-registering racing against a deregistering. So probably needs more
> > > rules on top, and then you have a nice lock inversion in a few places like
> > > here.
> > > 3. document it and roll it out.
> > > 
> > > The other thing is that this is a very tricky iterator, and there's a few
> > > copies of it. That is, if this is the right solution. As-is this should be
> > > abstracted away into guc_context_iter_begin/next_end() helpers, e.g. like
> > > we have for drm_connector_list_iter_begin/end_next as an example.
> > >
> > 
> > I can check this out.
> > 
> > Matt
> >  
> > > Cheers, Daniel
> > > 
> > > > +
> > > > +		xa_unlock(&guc->context_lookup);
> > > > +
> > > > +		spin_lock(&ce->guc_state.lock);
> > > >  
> > > >  		/*
> > > >  		 * Once we are at this point submission_disabled() is guaranteed
> > > > @@ -616,7 +626,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > >  		banned = context_banned(ce);
> > > >  		init_sched_state(ce);
> > > >  
> > > > -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > +		spin_unlock(&ce->guc_state.lock);
> > > > +
> > > > +		GEM_BUG_ON(!do_put && !destroyed);
> > > >  
> > > >  		if (pending_enable || destroyed || deregister) {
> > > >  			atomic_dec(&guc->outstanding_submission_g2h);
> > > > @@ -645,7 +657,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > >  
> > > >  			intel_context_put(ce);
> > > >  		}
> > > > +
> > > > +		if (do_put)
> > > > +			intel_context_put(ce);
> > > > +		xa_lock(&guc->context_lookup);
> > > >  	}
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > >  }
> > > >  
> > > >  static inline bool
> > > > @@ -866,16 +883,26 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> > > >  {
> > > >  	struct intel_context *ce;
> > > >  	unsigned long index;
> > > > +	unsigned long flags;
> > > >  
> > > >  	if (unlikely(!guc_submission_initialized(guc))) {
> > > >  		/* Reset called during driver load? GuC not yet initialised! */
> > > >  		return;
> > > >  	}
> > > >  
> > > > -	xa_for_each(&guc->context_lookup, index, ce)
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		intel_context_get(ce);
> > > > +		xa_unlock(&guc->context_lookup);
> > > > +
> > > >  		if (intel_context_is_pinned(ce))
> > > >  			__guc_reset_context(ce, stalled);
> > > >  
> > > > +		intel_context_put(ce);
> > > > +		xa_lock(&guc->context_lookup);
> > > > +	}
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > > +
> > > >  	/* GuC is blown away, drop all references to contexts */
> > > >  	xa_destroy(&guc->context_lookup);
> > > >  }
> > > > @@ -950,11 +977,21 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> > > >  {
> > > >  	struct intel_context *ce;
> > > >  	unsigned long index;
> > > > +	unsigned long flags;
> > > > +
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		intel_context_get(ce);
> > > > +		xa_unlock(&guc->context_lookup);
> > > >  
> > > > -	xa_for_each(&guc->context_lookup, index, ce)
> > > >  		if (intel_context_is_pinned(ce))
> > > >  			guc_cancel_context_requests(ce);
> > > >  
> > > > +		intel_context_put(ce);
> > > > +		xa_lock(&guc->context_lookup);
> > > > +	}
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > > +
> > > >  	guc_cancel_sched_engine_requests(guc->sched_engine);
> > > >  
> > > >  	/* GuC is blown away, drop all references to contexts */
> > > > @@ -2848,21 +2885,26 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> > > >  	struct intel_context *ce;
> > > >  	struct i915_request *rq;
> > > >  	unsigned long index;
> > > > +	unsigned long flags;
> > > >  
> > > >  	/* Reset called during driver load? GuC not yet initialised! */
> > > >  	if (unlikely(!guc_submission_initialized(guc)))
> > > >  		return;
> > > >  
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		intel_context_get(ce);
> > > > +		xa_unlock(&guc->context_lookup);
> > > > +
> > > >  		if (!intel_context_is_pinned(ce))
> > > > -			continue;
> > > > +			goto next;
> > > >  
> > > >  		if (intel_engine_is_virtual(ce->engine)) {
> > > >  			if (!(ce->engine->mask & engine->mask))
> > > > -				continue;
> > > > +				goto next;
> > > >  		} else {
> > > >  			if (ce->engine != engine)
> > > > -				continue;
> > > > +				goto next;
> > > >  		}
> > > >  
> > > >  		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> > > > @@ -2872,9 +2914,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> > > >  			intel_engine_set_hung_context(engine, ce);
> > > >  
> > > >  			/* Can only cope with one hang at a time... */
> > > > -			return;
> > > > +			intel_context_put(ce);
> > > > +			xa_lock(&guc->context_lookup);
> > > > +			goto done;
> > > >  		}
> > > > +next:
> > > > +		intel_context_put(ce);
> > > > +		xa_lock(&guc->context_lookup);
> > > > +
> > > >  	}
> > > > +done:
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > >  }
> > > >  
> > > >  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > > > @@ -2890,23 +2940,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > > >  	if (unlikely(!guc_submission_initialized(guc)))
> > > >  		return;
> > > >  
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		intel_context_get(ce);
> > > > +		xa_unlock(&guc->context_lookup);
> > > > +
> > > >  		if (!intel_context_is_pinned(ce))
> > > > -			continue;
> > > > +			goto next;
> > > >  
> > > >  		if (intel_engine_is_virtual(ce->engine)) {
> > > >  			if (!(ce->engine->mask & engine->mask))
> > > > -				continue;
> > > > +				goto next;
> > > >  		} else {
> > > >  			if (ce->engine != engine)
> > > > -				continue;
> > > > +				goto next;
> > > >  		}
> > > >  
> > > >  		spin_lock_irqsave(&ce->guc_active.lock, flags);
> > > >  		intel_engine_dump_active_requests(&ce->guc_active.requests,
> > > >  						  hung_rq, m);
> > > >  		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> > > > +
> > > > +next:
> > > > +		intel_context_put(ce);
> > > > +		xa_lock(&guc->context_lookup);
> > > >  	}
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > >  }
> > > >  
> > > >  void intel_guc_submission_print_info(struct intel_guc *guc,
> > > > @@ -2960,7 +3019,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > >  {
> > > >  	struct intel_context *ce;
> > > >  	unsigned long index;
> > > > +	unsigned long flags;
> > > >  
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > >  		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > >  		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > @@ -2979,6 +3040,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > >  
> > > >  		guc_log_context_priority(p, ce);
> > > >  	}
> > > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > >  }
> > > >  
> > > >  static struct intel_context *
> > > > -- 
> > > > 2.32.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 16:36     ` Matthew Brost
@ 2021-08-17 17:20       ` Daniel Vetter
  2021-08-17 17:27         ` Michal Wajdeczko
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 17:20 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 09:36:49AM -0700, Matthew Brost wrote:
> On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
> > On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
> > > Add GuC kernel doc for all structures added thus far for GuC submission
> > > and update the main GuC submission section with the new interface
> > > details.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
> > locking design that also doesn't feel too consistent.
> >
> 
> That is a different layer than GuC submission so I don't we should
> mention anything about that layer here. Didn't really write that layer
> and it super painful to touch that code so I'm going to stay out of any
> rework you think we need to do there. 

Well there's three locks there plus it leaks out (you have your
outstanding_submission_g2h atomic_t which is very closed tied to well,
outstanding guc transmissions), so I guess I need someone else for that?

> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h |  42 +++++---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +++-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 101 ++++++++++++++----
> > >  drivers/gpu/drm/i915/i915_request.h           |  18 ++--
> > >  4 files changed, 131 insertions(+), 49 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index f6989e6807f7..75d609a1bc33 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -156,44 +156,56 @@ struct intel_context {
> > >  	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
> > >  
> > >  	struct {
> > > -		/** lock: protects everything in guc_state */
> > > +		/** @lock: protects everything in guc_state */
> > >  		spinlock_t lock;
> > >  		/**
> > > -		 * sched_state: scheduling state of this context using GuC
> > > +		 * @sched_state: scheduling state of this context using GuC
> > >  		 * submission
> > >  		 */
> > >  		u32 sched_state;
> > >  		/*
> > > -		 * fences: maintains of list of requests that have a submit
> > > -		 * fence related to GuC submission
> > > +		 * @fences: maintains a list of requests are currently being
> > > +		 * fenced until a GuC operation completes
> > >  		 */
> > >  		struct list_head fences;
> > > -		/* GuC context blocked fence */
> > > +		/**
> > > +		 * @blocked_fence: fence used to signal when the blocking of a
> > > +		 * contexts submissions is complete.
> > > +		 */
> > >  		struct i915_sw_fence blocked_fence;
> > > -		/* GuC committed requests */
> > > +		/** @number_committed_requests: number of committed requests */
> > >  		int number_committed_requests;
> > >  	} guc_state;
> > >  
> > >  	struct {
> > > -		/** lock: protects everything in guc_active */
> > > +		/** @lock: protects everything in guc_active */
> > >  		spinlock_t lock;
> > 
> > Why do we have two locks spinlocks to protect guc context state?
> > 
> > I do understand the need for a spinlock (at least for now) because of how
> > i915-scheduler runs in tasklet context. But beyond that we really
> > shouldn't need more than two locks to protect context state. You still
> > have an entire pile here, plus some atomics, plus more.
> >
> 
> Yea I actually thought about this after I sent to out, guc_active &
> guc_state should be combined into a single lock. Originally I had two
> different locks because of old hierarchy this is no longer needed. Can
> fix.
>  
> > And this is on a single context, where concurrently submitting stuff
> > really isn't a thing. I'd expect actual benchmarking would show a perf
> > hit, since all these locks and atomics aren't free. This is at least the
> > case with execbuf and the various i915_vma locks we currently have.
> > 
> > What I expect intel_context locking to be is roughly:
> > 
> > - One lock to protect all intel_context state. This probably should be a
> >   dma_resv_lock for a few reasons, least so we can pin state objects
> >   underneath that lock.
> > 
> > - A separate lock if there's anything you need to coordinate with the
> >   backend scheduler while that's running, to avoid dma_fence inversions.
> >   Right now this separate lock might need to be a spinlock because our
> >   scheduler runs in tasklets, and that might mean we need both a mutex and
> >   a spinlock here.
> >
> > Anything that goes beyond that is premature optimization and kills us code
> > complexity vise. I'd be _extremely_ surprised if an IA core cannot keep up
> > with GuC, and therefore anything that goes beyond "one lock per object",
> > plus/minus execution context issues like the above tasklet issue, is
> > likely just going to slow everything down.
> 
> If I combine the above spin lock, isn't that basically what we have one
> lock for the context state as it relates to GuC submission?
> 
> Also thinking when we move to DRM scheduler we likely can get rid of all
> the atomic contexts in the GuC submission backend.
> 
> > 
> > > -		/** requests: active requests on this context */
> > > +		/** @requests: list of active requests on this context */
> > >  		struct list_head requests;
> > > -		/*
> > > -		 * GuC priority management
> > > -		 */
> > > +		/** @guc_prio: the contexts current guc priority */
> > >  		u8 guc_prio;
> > > +		/**
> > > +		 * @guc_prio_count: a counter of the number requests inflight in
> > > +		 * each priority bucket
> > > +		 */
> > >  		u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > >  	} guc_active;
> > >  
> > > -	/* GuC LRC descriptor ID */
> > > +	/**
> > > +	 * @guc_id: unique handle which is used to communicate information with
> > > +	 * the GuC about this context, protected by guc->contexts_lock
> > > +	 */
> > >  	u16 guc_id;
> > >  
> > > -	/* GuC LRC descriptor reference count */
> > > +	/**
> > > +	 * @guc_id_ref: the number of references to the guc_id, when
> > > +	 * transitioning in and out of zero protected by guc->contexts_lock
> > > +	 */
> > >  	atomic_t guc_id_ref;
> > 
> > All this guc_id related stuff (including the guc->context_lookup xarray I
> > guess) also has quite a pile of atomics and locks.
> >
> > >  
> > > -	/*
> > > -	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> > > +	/**
> > > +	 * @guc_id_link: in guc->guc_id_list when the guc_id has no refs but is
> > > +	 * still valid, protected by guc->contexts_lock
> > >  	 */
> > >  	struct list_head guc_id_link;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 2e27fe59786b..c0b3fdb601f0 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -41,6 +41,10 @@ struct intel_guc {
> > >  	spinlock_t irq_lock;
> > >  	unsigned int msg_enabled_mask;
> > >  
> > > +	/**
> > > +	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
> > > +	 * submission, used to determine if the GT is idle
> > > +	 */
> > >  	atomic_t outstanding_submission_g2h;
> > 
> > atomic_t is good for statistcs, but not for code flow control. If you use
> > if for that you either need a lot of barriers and comments, which means
> > there needs to be some real perf numbers showing that this is required in
> > a workload we care about.
> 
> This is kinda stat too, it is connected to debugfs and typically is
> non-zero if something has gone horribly wrong (e.g. you lose a G2H).
> Confused about the flow control comment, this basically it just sayings
> the GT isn't idle if the GuC is processing messages and we expect a G2H
> response. A counter here makes sense to me and don't see why we'd need
> barriorr for this.

Yeah but if you actually use it in the code (which I thought you do), and
not just for debugfs stats, then you have control flow that depends upon
this counter. Which means you need memory barriers, or that stuff will be
unordered against other code that's running.

> > Or you stuff this into a related lock. E.g. from high-level view stuff
> > this into intel_guc_ct (which also has definitely way more locks than it
> > needs) could make sense?
> > 
> 
> I had it in that layer at one point but got push back, thus it lives
> here now. The way it is used now it probably makes sense to keep it
> here.

Hm can you dig out that push back? Often when locking rolls in it makes it
pretty clear that the layering needs to be rethought. Not changing that
code because it's written already or something like that, and then working
that suboptimal locking in your area isn't great. Minimally we need a plan
(like we have one for drm/scheduler for the frontend stuff) and someone
who'll do it.
-Daniel

> 
> > >  
> > >  	struct {
> > > @@ -49,12 +53,16 @@ struct intel_guc {
> > >  		void (*disable)(struct intel_guc *guc);
> > >  	} interrupts;
> > >  
> > > -	/*
> > > -	 * contexts_lock protects the pool of free guc ids and a linked list of
> > > -	 * guc ids available to be stolen
> > > +	/**
> > > +	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id, and
> > > +	 * ce->guc_id_ref when transitioning in and out of zero
> > >  	 */
> > >  	spinlock_t contexts_lock;
> > > +	/** @guc_ids: used to allocate new guc_ids */
> > >  	struct ida guc_ids;
> > > +	/**
> > > +	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > > +	 */
> > >  	struct list_head guc_id_list;
> > >  
> > >  	bool submission_supported;
> > > @@ -70,7 +78,10 @@ struct intel_guc {
> > >  	struct i915_vma *lrc_desc_pool;
> > >  	void *lrc_desc_pool_vaddr;
> > >  
> > > -	/* guc_id to intel_context lookup */
> > > +	/**
> > > +	 * @context_lookup: used to intel_context from guc_id, if a context is
> > > +	 * present in this structure it is registered with the GuC
> > > +	 */
> > >  	struct xarray context_lookup;
> > >  
> > >  	/* Control params for fw initialization */
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index eb06a4c7534e..18ef363c6e5d 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -28,21 +28,6 @@
> > >  /**
> > >   * DOC: GuC-based command submission
> > >   *
> > > - * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
> > > - * firmware is moving to an updated submission interface and we plan to
> > > - * turn submission back on when that lands. The below documentation (and related
> > > - * code) matches the old submission model and will be updated as part of the
> > > - * upgrade to the new flow.
> > > - *
> > > - * GuC stage descriptor:
> > > - * During initialization, the driver allocates a static pool of 1024 such
> > > - * descriptors, and shares them with the GuC. Currently, we only use one
> > > - * descriptor. This stage descriptor lets the GuC know about the workqueue and
> > > - * process descriptor. Theoretically, it also lets the GuC know about our HW
> > > - * contexts (context ID, etc...), but we actually employ a kind of submission
> > > - * where the GuC uses the LRCA sent via the work item instead. This is called
> > > - * a "proxy" submission.
> > > - *
> > >   * The Scratch registers:
> > >   * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
> > >   * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
> > > @@ -51,14 +36,86 @@
> > >   * processes the request. The kernel driver polls waiting for this update and
> > >   * then proceeds.
> > >   *
> > > - * Work Items:
> > > - * There are several types of work items that the host may place into a
> > > - * workqueue, each with its own requirements and limitations. Currently only
> > > - * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> > > - * represents in-order queue. The kernel driver packs ring tail pointer and an
> > > - * ELSP context descriptor dword into Work Item.
> > > - * See guc_add_request()
> > > + * Command Transport buffers (CTBs):
> > > + * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
> > > + * G2H) are a message interface between the i915 and GuC used to controls
> > > + * submissions.
> > > + *
> > > + * Context registration:
> > > + * Before a context can be submitted it must be registered with the GuC via a
> > > + * H2G. A unique guc_id is associated with each context. The context is either
> > > + * registered at request creation time (normal operation) or at submission time
> > > + * (abnormal operation, e.g. after a reset).
> > > + *
> > > + * Context submission:
> > > + * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
> > > + * or context submit H2G is used to submit a context.
> > > + *
> > > + * Context unpin:
> > > + * To unpin a context a H2G is used to disable scheduling and when the
> > > + * corresponding G2H returns indicating the scheduling disable operation has
> > > + * completed it is safe to unpin the context. While a disable is in flight it
> > > + * isn't safe to resubmit the context so a fence is used to stall all future
> > > + * requests until the G2H is returned.
> > > + *
> > > + * Context deregistration:
> > > + * Before a context can be destroyed or we steal its guc_id we must deregister
> > > + * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
> > > + * submit anything to this guc_id until the deregister completes so a fence is
> > > + * used to stall all requests associated with this guc_ids until the
> > > + * corresponding G2H returns indicating the guc_id has been deregistered.
> > > + *
> > > + * guc_ids:
> > > + * Unique number associated with private GuC context data passed in during
> > > + * context registration / submission / deregistration. 64k available. Simple ida
> > > + * is used for allocation.
> > > + *
> > > + * Stealing guc_ids:
> > > + * If no guc_ids are available they can be stolen from another context at
> > > + * request creation time if that context is unpinned. If a guc_id can't be found
> > > + * we punt this problem to the user as we believe this is near impossible to hit
> > > + * during normal use cases.
> > > + *
> > > + * Locking:
> > > + * In the GuC submission code we have 4 basic spin locks which protect
> > > + * everything. Details about each below.
> > > + *
> > > + * sched_engine->lock
> > > + * This is the submission lock for all contexts that share a i915 schedule
> > > + * engine (sched_engine), thus only 1 context which share a sched_engine can be
> > > + * submitting at a time. Currently only 1 sched_engine used for all of GuC
> > > + * submission but that could change in the future.
> > 
> > There's at least 3 more spinlocks for intel_guc_ct ...
> > 
> 
> Different layer that I'd like to stay out of.
> 
> > > + *
> > > + * guc->contexts_lock
> > > + * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> > > + * submission can hold this at a time.
> > 
> > Plus you forgot the spinlock of the xarrray, which is also used in the
> > code with this patch set, not just internally in the xarray, so we have to
> > think about that one too.
> >
> > Iow still way too many locks.
> >
> 
> Well we can delete one pretty easily, that's something, right?
> 
> > > + *
> > > + * ce->guc_state.lock
> > > + * Protects everything under ce->guc_state. Ensures that a context is in the
> > > + * correct state before issuing a H2G. e.g. We don't issue a schedule disable
> > > + * on disabled context (bad idea), we don't issue schedule enable when a
> > > + * schedule disable is inflight, etc... Lock individual to each context.
> > > + *
> > > + * ce->guc_active.lock
> > > + * Protects everything under ce->guc_active which is the current requests
> > > + * inflight on the context / priority management. Lock individual to each
> > > + * context.
> > > + *
> > > + * Lock ordering rules:
> > > + * sched_engine->lock -> ce->guc_active.lock
> > > + * sched_engine->lock -> ce->guc_state.lock
> > > + * guc->contexts_lock -> ce->guc_state.lock
> > >   *
> > > + * Reset races:
> > > + * When a GPU full reset is triggered it is assumed that some G2H responses to
> > > + * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
> > > + * fatal as we do certain operations upon receiving a G2H (e.g. destroy
> > > + * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
> > > + * context state and cleanup appropriately, however this is quite racey. To
> > > + * avoid races the rules are check for submission being disabled (i.e. check for
> > > + * mid reset) with the appropriate lock being held. If submission is disabled
> > > + * don't send the H2G or update the context state. The reset code must disable
> > > + * submission and grab all these locks before scrubbing for the missing G2H.
> > 
> > Can we make this all a lot less racy? Instead of a huge state machinery
> > can't we just do all that under a context look, i.e.
> > 
> 
> Well can't sleep inside the context lock so confused by this suggestion.
> 
> > 1. take context lock
> > 2. send guc message that is tricky, like register or deregister or
> > whatever
> 
> A schedule enable is tricky too and that can be sent from the submission
> path.
> 
> > 3. wait for that reply, our context is blocked anyway, no harm holding a
> 
> We don't need to block while a schedule enable is in flight. Also we
> only block the submissions while the other messages are in flight, we
> are allowed continuing working on the context (e.g. we can generate a
> prep another request while the operation is inflight).
> 
> > lock, other contexts can keep processing
> > 4. the lower-level guc_ct code guarantees that we either get the reply, or
> 
> Again don't really want to touch the guc_ct code.
> 
> > a -ERESET or whatever indicating that we raced with a reset, in which case
> > we can just restart whatever it is we wanted to do (or for deregister, do
> > nothing since the guc reset has solved that problem)
> 
> You'd still have to have a list of waiters which get woke on a reset
> too. Not all the convinced this would simplier that checking if a reset
> is in flight within a lock.
> 
> > 5. unlock
> > 
> > Massively lockless state machines are cool, but also very hard to maintain
> > and keep correct.
> 
> Not saying we can't try to make this simplier over time but a huge
> structural change at this point is way more likely to break things than
> cleaning up what we have in place.
> 
> Matt
> 
> > -Daniel
> > 
> > >   */
> > >  
> > >  /* GuC Virtual Engine */
> > > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > > index d818cfbfc41d..177eaf55adff 100644
> > > --- a/drivers/gpu/drm/i915/i915_request.h
> > > +++ b/drivers/gpu/drm/i915/i915_request.h
> > > @@ -290,18 +290,20 @@ struct i915_request {
> > >  		struct hrtimer timer;
> > >  	} watchdog;
> > >  
> > > -	/*
> > > -	 * Requests may need to be stalled when using GuC submission waiting for
> > > -	 * certain GuC operations to complete. If that is the case, stalled
> > > -	 * requests are added to a per context list of stalled requests. The
> > > -	 * below list_head is the link in that list.
> > > +	/**
> > > +	 * @guc_fence_link: Requests may need to be stalled when using GuC
> > > +	 * submission waiting for certain GuC operations to complete. If that is
> > > +	 * the case, stalled requests are added to a per context list of stalled
> > > +	 * requests. The below list_head is the link in that list. Protected by
> > > +	 * ce->guc_state.lock.
> > >  	 */
> > >  	struct list_head guc_fence_link;
> > >  
> > >  	/**
> > > -	 * Priority level while the request is inflight. Differs from i915
> > > -	 * scheduler priority. See comment above
> > > -	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
> > > +	 * @guc_prio: Priority level while the request is inflight. Differs from
> > > +	 * i915 scheduler priority. See comment above
> > > +	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
> > > +	 * ce->guc_active.lock.
> > >  	 */
> > >  #define	GUC_PRIO_INIT	0xff
> > >  #define	GUC_PRIO_FINI	0xfe
> > > -- 
> > > 2.32.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 17:20       ` Daniel Vetter
@ 2021-08-17 17:27         ` Michal Wajdeczko
  2021-08-17 17:34           ` Daniel Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Wajdeczko @ 2021-08-17 17:27 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 17.08.2021 19:20, Daniel Vetter wrote:
> On Tue, Aug 17, 2021 at 09:36:49AM -0700, Matthew Brost wrote:
>> On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
>>> On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
>>>> Add GuC kernel doc for all structures added thus far for GuC submission
>>>> and update the main GuC submission section with the new interface
>>>> details.
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>
>>> There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
>>> locking design that also doesn't feel too consistent.
>>>
>>
>> That is a different layer than GuC submission so I don't we should
>> mention anything about that layer here. Didn't really write that layer
>> and it super painful to touch that code so I'm going to stay out of any
>> rework you think we need to do there. 
> 
> Well there's three locks 

It's likely me.

There is one lock for the recv CTB, one for the send CTB, one for the
list of read messages ready to post process - do you want to use single
lock for both CTBs or single lock for all cases in CT ?

Michal

disclaimer: outstanding_g2h are not part of the CTB layer


> there plus it leaks out (you have your
> outstanding_submission_g2h atomic_t which is very closed tied to well,
> outstanding guc transmissions), so I guess I need someone else for that?
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 17:27         ` Michal Wajdeczko
@ 2021-08-17 17:34           ` Daniel Vetter
  2021-08-17 20:41             ` Michal Wajdeczko
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 17:34 UTC (permalink / raw)
  To: Michal Wajdeczko
  Cc: Daniel Vetter, Matthew Brost, intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 07:27:18PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 17.08.2021 19:20, Daniel Vetter wrote:
> > On Tue, Aug 17, 2021 at 09:36:49AM -0700, Matthew Brost wrote:
> >> On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
> >>> On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
> >>>> Add GuC kernel doc for all structures added thus far for GuC submission
> >>>> and update the main GuC submission section with the new interface
> >>>> details.
> >>>>
> >>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>>
> >>> There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
> >>> locking design that also doesn't feel too consistent.
> >>>
> >>
> >> That is a different layer than GuC submission so I don't we should
> >> mention anything about that layer here. Didn't really write that layer
> >> and it super painful to touch that code so I'm going to stay out of any
> >> rework you think we need to do there. 
> > 
> > Well there's three locks 
> 
> It's likely me.
> 
> There is one lock for the recv CTB, one for the send CTB, one for the
> list of read messages ready to post process - do you want to use single
> lock for both CTBs or single lock for all cases in CT ?
> 
> Michal
> 
> disclaimer: outstanding_g2h are not part of the CTB layer

Why? Like apparently there's not enough provided by that right now, so
Matt is now papering over that gap with more book-keeping in the next
layer. If the layer is not doing a good job it's either the wrong layer,
or shouldn't be a layer.

And yeah the locking looks like serious amounts of overkill, was it
benchmarked that we need the 3 separate locks for this?

While reading ctb code I also noticed that a bunch of stuff is checked
before we grab the relevant spinlocks, and it's not
- wrapped in a WARN_ON or GEM_BUG_ON or similar to just check everything
  works as expected
- there's no other locks

So either racy, buggy or playing some extremely clever tricks. None of
which is very good.
-Daniel

> 
> 
> > there plus it leaks out (you have your
> > outstanding_submission_g2h atomic_t which is very closed tied to well,
> > outstanding guc transmissions), so I guess I need someone else for that?
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-17  9:39   ` Daniel Vetter
@ 2021-08-17 18:17     ` Matthew Brost
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Brost @ 2021-08-17 18:17 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 17, 2021 at 11:39:29AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:19AM -0700, Matthew Brost wrote:
> > A small race that could result in incorrect accounting of the number
> > of outstanding G2H. Basically prior to this patch we did not increment
> > the number of outstanding G2H if we encoutered a GT reset while sending
> > a H2G. This was incorrect as the context state had already been updated
> > to anticipate a G2H response thus the counter should be incremented.
> > 
> > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 69faa39da178..b5d3972ae164 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -360,11 +360,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> >  {
> >  	int err;
> >  
> > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > -
> > -	if (!err && g2h_len_dw)
> > +	if (g2h_len_dw)
> >  		atomic_inc(&guc->outstanding_submission_g2h);
> >  
> > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> 
> I'm majorly confused by the _busy_loop naming scheme, especially here.
> Like "why do we want to send a busy loop comand to guc, this doesn't make
> sense".
> 
> It seems like you're using _busy_loop as a suffix for "this is ok to be
> called in atomic context". The linux kernel bikeshed for this is generally
> _atomic() (or _in_atomic() or something like that).  Would be good to
> rename to make this slightly less confusing.

I'd like to save the bikeshedding for follow ups if we can as we should
get the functional fixes in to stablize the stack + clean up the locking
to a somewhat sane state ASAP. Everyone has their favorite color of
paint...

> -Daniel
> 
> > +	if (err == -EBUSY && g2h_len_dw)
> > +		atomic_dec(&guc->outstanding_submission_g2h);
> > +

Also here is an example of why this really should be owned by the
submission code, it wants to increment this here even if the send failed
due to -ENODEV (GT reset in flight) as this is an internal counter of
how many G2H will need to be scrubbed.

Matt

> >  	return err;
> >  }
> >  
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 17:34           ` Daniel Vetter
@ 2021-08-17 20:41             ` Michal Wajdeczko
  2021-08-17 21:49               ` Daniel Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Wajdeczko @ 2021-08-17 20:41 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, daniel.vetter



On 17.08.2021 19:34, Daniel Vetter wrote:
> On Tue, Aug 17, 2021 at 07:27:18PM +0200, Michal Wajdeczko wrote:
>>
>>
>> On 17.08.2021 19:20, Daniel Vetter wrote:
>>> On Tue, Aug 17, 2021 at 09:36:49AM -0700, Matthew Brost wrote:
>>>> On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
>>>>> On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
>>>>>> Add GuC kernel doc for all structures added thus far for GuC submission
>>>>>> and update the main GuC submission section with the new interface
>>>>>> details.
>>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>
>>>>> There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
>>>>> locking design that also doesn't feel too consistent.
>>>>>
>>>>
>>>> That is a different layer than GuC submission so I don't we should
>>>> mention anything about that layer here. Didn't really write that layer
>>>> and it super painful to touch that code so I'm going to stay out of any
>>>> rework you think we need to do there. 
>>>
>>> Well there's three locks 
>>
>> It's likely me.
>>
>> There is one lock for the recv CTB, one for the send CTB, one for the
>> list of read messages ready to post process - do you want to use single
>> lock for both CTBs or single lock for all cases in CT ?
>>
>> Michal
>>
>> disclaimer: outstanding_g2h are not part of the CTB layer
> 
> Why? Like apparently there's not enough provided by that right now, so
> Matt is now papering over that gap with more book-keeping in the next
> layer. If the layer is not doing a good job it's either the wrong layer,
> or shouldn't be a layer.

Note that all "outstanding g2h" used by Matt are kind of unsolicited
"event" messages received from the GuC, that CTB layer is unable
correlate. CTB only tracks "requests" messages for which "response" (or
"error") reply is expected. Thus if CTB client is expecting some extra
message for its previous communication with GuC, it must track it on its
own, as only client knows where in the CTB message payload, actual
correlation data (like context ID) is stored.

> 
> And yeah the locking looks like serious amounts of overkill, was it
> benchmarked that we need the 3 separate locks for this?

I'm not aware of any (micro)benchmarking, but definitely we need some,
we were just gradually moving from single threaded blocking CTB calls
(waiting for CTB descriptor updates under mutex) to non-blocking calls
(protecting only reads/writes to CTB descriptors with spinlock - to
allow CTB usage from tasklet/irq).

And I was just assuming that we can sacrifice few more integers [1] and
have dedicated spinlocks and avoid early over-optimization.

> 
> While reading ctb code I also noticed that a bunch of stuff is checked
> before we grab the relevant spinlocks, and it's not
> - wrapped in a WARN_ON or GEM_BUG_ON or similar to just check everything
>   works as expected
> - there's no other locks
> 
> So either racy, buggy or playing some extremely clever tricks. None of
> which is very good.

I'm open to improve that code as needed, but maybe in exchange and to
increase motivation please provide feedback on already posted fixes [2] ;)

Michal

[1]
https://elixir.bootlin.com/linux/latest/source/arch/ia64/include/asm/spinlock_types.h#L10
[2] https://patchwork.freedesktop.org/series/92118/

> -Daniel
> 
>>
>>
>>> there plus it leaks out (you have your
>>> outstanding_submission_g2h atomic_t which is very closed tied to well,
>>> outstanding guc transmissions), so I guess I need someone else for that?
>>>
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc
  2021-08-17 20:41             ` Michal Wajdeczko
@ 2021-08-17 21:49               ` Daniel Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Vetter @ 2021-08-17 21:49 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: Matthew Brost, intel-gfx, dri-devel

On Tue, Aug 17, 2021 at 10:41 PM Michal Wajdeczko
<michal.wajdeczko@intel.com> wrote:
> On 17.08.2021 19:34, Daniel Vetter wrote:
> > On Tue, Aug 17, 2021 at 07:27:18PM +0200, Michal Wajdeczko wrote:
> >> On 17.08.2021 19:20, Daniel Vetter wrote:
> >>> On Tue, Aug 17, 2021 at 09:36:49AM -0700, Matthew Brost wrote:
> >>>> On Tue, Aug 17, 2021 at 01:11:41PM +0200, Daniel Vetter wrote:
> >>>>> On Mon, Aug 16, 2021 at 06:51:39AM -0700, Matthew Brost wrote:
> >>>>>> Add GuC kernel doc for all structures added thus far for GuC submission
> >>>>>> and update the main GuC submission section with the new interface
> >>>>>> details.
> >>>>>>
> >>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>>>>
> >>>>> There's quite a bit more, e.g. intel_guc_ct, which has it's own world of
> >>>>> locking design that also doesn't feel too consistent.
> >>>>>
> >>>>
> >>>> That is a different layer than GuC submission so I don't we should
> >>>> mention anything about that layer here. Didn't really write that layer
> >>>> and it super painful to touch that code so I'm going to stay out of any
> >>>> rework you think we need to do there.
> >>>
> >>> Well there's three locks
> >>
> >> It's likely me.
> >>
> >> There is one lock for the recv CTB, one for the send CTB, one for the
> >> list of read messages ready to post process - do you want to use single
> >> lock for both CTBs or single lock for all cases in CT ?
> >>
> >> Michal
> >>
> >> disclaimer: outstanding_g2h are not part of the CTB layer
> >
> > Why? Like apparently there's not enough provided by that right now, so
> > Matt is now papering over that gap with more book-keeping in the next
> > layer. If the layer is not doing a good job it's either the wrong layer,
> > or shouldn't be a layer.
>
> Note that all "outstanding g2h" used by Matt are kind of unsolicited
> "event" messages received from the GuC, that CTB layer is unable
> correlate. CTB only tracks "requests" messages for which "response" (or
> "error") reply is expected. Thus if CTB client is expecting some extra
> message for its previous communication with GuC, it must track it on its
> own, as only client knows where in the CTB message payload, actual
> correlation data (like context ID) is stored.

I thought there's some patches already to reserve g2h space because
guc dies if there's none left? Which would mean ctb should know
already whent there's more coming.

The problem is if every user of guc has to track this themselves we
get a pretty bad spaghetti monster around guc reset. Currently it's
only guc submission, so we could fix it there by wrapping a lock
around all guc submissions it does, but already on the wakeup side
it's more tricky. That really feels like work around issues somewhere
else.

> > And yeah the locking looks like serious amounts of overkill, was it
> > benchmarked that we need the 3 separate locks for this?
>
> I'm not aware of any (micro)benchmarking, but definitely we need some,
> we were just gradually moving from single threaded blocking CTB calls
> (waiting for CTB descriptor updates under mutex) to non-blocking calls
> (protecting only reads/writes to CTB descriptors with spinlock - to
> allow CTB usage from tasklet/irq).

Spinlock is fine, it it really protects everything (I've found a bunch
of checks outside of these locks that leave me wondering). Multiple
spinlocks needs justification since at least to my understand there's
a pile of overlapping stuff you need to protect. Like the reservations
of g2h space.

> And I was just assuming that we can sacrifice few more integers [1] and
> have dedicated spinlocks and avoid early over-optimization.

None of this has anything to do with saving memory, that's entirely
irrelevant here, but about complexity. Any lock you add makes the
complexity worse, and I'm not understanding why ctb needs 3 spinlocks
instead of just one.

If the only justification for this is that maybe it makes things
faster, and it was not properly benchmarked first (microbenchhmarks
don't count if it's not a relevant end use case that umds actually
care about) then it has to go and be simplified. Really should have
never landed, because taking locking complexity out is much harder
than adding it in the first place.

And the current overall i915-gem code is definitely on the wrong side
of "too complex locking design", so there's no wiggle room here for
exceptions.

> > While reading ctb code I also noticed that a bunch of stuff is checked
> > before we grab the relevant spinlocks, and it's not
> > - wrapped in a WARN_ON or GEM_BUG_ON or similar to just check everything
> >   works as expected
> > - there's no other locks
> >
> > So either racy, buggy or playing some extremely clever tricks. None of
> > which is very good.
>
> I'm open to improve that code as needed, but maybe in exchange and to
> increase motivation please provide feedback on already posted fixes [2] ;)

Sure can try, but also these patches have been sitting on the list for
almost 7 weeks now with absolutely. It's your job as submitter to make
sure these move forward, or escalate if necessary. Not wait for some
kind of mircale, those don't happen.
-Daniel

> Michal
>
> [1]
> https://elixir.bootlin.com/linux/latest/source/arch/ia64/include/asm/spinlock_types.h#L10
> [2] https://patchwork.freedesktop.org/series/92118/
>
> > -Daniel
> >
> >>
> >>
> >>> there plus it leaks out (you have your
> >>> outstanding_submission_g2h atomic_t which is very closed tied to well,
> >>> outstanding guc transmissions), so I guess I need someone else for that?
> >>>
> >



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2021-08-17 21:49 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 13:51 [Intel-gfx] [PATCH 00/22] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 01/22] drm/i915/guc: Fix blocked context accounting Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 02/22] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
2021-08-17  9:39   ` Daniel Vetter
2021-08-17 18:17     ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 03/22] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 04/22] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 05/22] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
2021-08-17  9:32   ` Daniel Vetter
2021-08-17 15:03     ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 06/22] drm/i915/execlists: Do not propagate errors to dependent fences Matthew Brost
2021-08-17  9:21   ` Daniel Vetter
2021-08-17 15:08     ` Matthew Brost
2021-08-17 15:49       ` Daniel Vetter
2021-08-16 13:51 ` [Intel-gfx] [PATCH 07/22] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
2021-08-17  9:47   ` Daniel Vetter
2021-08-17  9:57     ` Daniel Vetter
2021-08-17 16:44     ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 09/22] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 10/22] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 11/22] drm/i915/guc: Take context ref when cancelling request Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 12/22] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
2021-08-17  7:21   ` kernel test robot
2021-08-16 13:51 ` [Intel-gfx] [PATCH 13/22] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 14/22] drm/i915: Allocate error capture in atomic context Matthew Brost
2021-08-17 10:06   ` Daniel Vetter
2021-08-17 16:12     ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 15/22] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
2021-08-17 10:06   ` Daniel Vetter
2021-08-16 13:51 ` [Intel-gfx] [PATCH 16/22] drm/i915/guc: Release submit fence from an IRQ Matthew Brost
2021-08-17 10:08   ` Daniel Vetter
2021-08-16 13:51 ` [Intel-gfx] [PATCH 17/22] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
2021-08-17 10:10   ` Daniel Vetter
2021-08-16 13:51 ` [Intel-gfx] [PATCH 18/22] drm/i915/guc: Rework and simplify locking Matthew Brost
2021-08-17 10:15   ` Daniel Vetter
2021-08-17 15:30     ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 19/22] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
2021-08-17 10:27   ` Daniel Vetter
2021-08-17 15:26     ` Matthew Brost
2021-08-17 17:13       ` Daniel Vetter
2021-08-17 17:13         ` Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 20/22] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 21/22] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
2021-08-16 13:51 ` [Intel-gfx] [PATCH 22/22] drm/i915/guc: Add GuC kernel doc Matthew Brost
2021-08-17 11:11   ` Daniel Vetter
2021-08-17 16:36     ` Matthew Brost
2021-08-17 17:20       ` Daniel Vetter
2021-08-17 17:27         ` Michal Wajdeczko
2021-08-17 17:34           ` Daniel Vetter
2021-08-17 20:41             ` Michal Wajdeczko
2021-08-17 21:49               ` Daniel Vetter
2021-08-17 12:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev2) Patchwork
2021-08-17 12:51 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-08-17 13:22 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-08-17 14:39 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).