intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC
@ 2021-08-19  6:16 Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting Matthew Brost
                   ` (30 more replies)
  0 siblings, 31 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Daniel Vetter pointed out that locking in the GuC submission code was
overly complicated, let's clean this up a bit before introducing more
features in the GuC submission backend.

Also fix some CI failures, port fixes from our internal tree, and add a
few more selftests for coverage.

Lastly, add some kernel DOC explaining how the GuC submission backend
works.

v2: Fix logic error in 'Workaround reset G2H is received after schedule
done G2H', don't propagate errors to dependent fences in execlists
submissiom, resolve checkpatch issues, resend to correct lists
v3: Fix issue kicking tasklet, drop guc_active, fix ref counting in
xarray, add guc_id sub structure, drop inline fuctions, and various
other cleanup suggested by Daniel

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (27):
  drm/i915/guc: Fix blocked context accounting
  drm/i915/guc: Fix outstanding G2H accounting
  drm/i915/guc: Unwind context requests in reverse order
  drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  drm/i915/guc: Process all G2H message at once in work queue
  drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  Revert "drm/i915/gt: Propagate change in error status to children on
    unhold"
  drm/i915/selftests: Add a cancel request selftest that triggers a
    reset
  drm/i915/guc: Kick tasklet after queuing a request
  drm/i915/guc: Don't enable scheduling on a banned context, guc_id
    invalid, not registered
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  drm/i915/guc: Take context ref when cancelling request
  drm/i915/guc: Don't touch guc_state.sched_state without a lock
  drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  drm/i915: Allocate error capture in nowait context
  drm/i915/guc: Flush G2H work queue during reset
  drm/i915/guc: Release submit fence from an irq_work
  drm/i915/guc: Move guc_blocked fence to struct guc_state
  drm/i915/guc: Rework and simplify locking
  drm/i915/guc: Proper xarray usage for contexts_lookup
  drm/i915/guc: Drop pin count check trick between sched_disable and
    re-pin
  drm/i915/guc: Move GuC priority fields in context under guc_active
  drm/i915/guc: Move fields protected by guc->contexts_lock into sub
    structure
  drm/i915/guc: Drop guc_active move everything into guc_state
  drm/i915/guc: Add GuC kernel doc
  drm/i915/guc: Drop static inline functions intel_guc_submission.c

 drivers/gpu/drm/i915/gt/intel_context.c       |  19 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  81 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 955 +++++++++++-------
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 +++
 drivers/gpu/drm/i915/i915_gpu_error.c         |  39 +-
 drivers/gpu/drm/i915/i915_request.h           |  23 +-
 drivers/gpu/drm/i915/i915_trace.h             |  12 +-
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 ++
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 16 files changed, 968 insertions(+), 466 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-24 23:24   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Prior to this patch the blocked context counter was cleared on
init_sched_state (used during registering a context & resets) which is
incorrect. This state needs to be persistent or the counter can read the
incorrect value resulting in scheduling never getting enabled again.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..69faa39da178 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
 static inline bool
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-19 21:31   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

Also always use helper when decrementing this value.

Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++---------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 69faa39da178..32c414aa9009 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -352,6 +352,12 @@ static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
+static void decr_outstanding_submission_g2h(struct intel_guc *guc)
+{
+	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
+		wake_up_all(&guc->ct.wq);
+}
+
 static int guc_submission_send_busy_loop(struct intel_guc *guc,
 					 const u32 *action,
 					 u32 len,
@@ -360,11 +366,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (err == -EBUSY && g2h_len_dw)
+		decr_outstanding_submission_g2h(guc);
+
 	return err;
 }
 
@@ -616,7 +624,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		init_sched_state(ce);
 
 		if (pending_enable || destroyed || deregister) {
-			atomic_dec(&guc->outstanding_submission_g2h);
+			decr_outstanding_submission_g2h(guc);
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
@@ -635,7 +643,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 				intel_engine_signal_breadcrumbs(ce->engine);
 			}
 			intel_context_sched_disable_unpin(ce);
-			atomic_dec(&guc->outstanding_submission_g2h);
+			decr_outstanding_submission_g2h(guc);
 			spin_lock_irqsave(&ce->guc_state.lock, flags);
 			guc_blocked_fence_complete(ce);
 			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
@@ -2583,12 +2591,6 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 	return ce;
 }
 
-static void decr_outstanding_submission_g2h(struct intel_guc *guc)
-{
-	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
-		wake_up_all(&guc->ct.wq);
-}
-
 int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 					  const u32 *msg,
 					  u32 len)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-19 23:54   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reverse and
append to the head of the priority list to fix this.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 32c414aa9009..9ca0ba4ea85a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -805,9 +805,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_active.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
@@ -824,7 +824,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
 		spin_lock(&ce->guc_active.lock);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (2 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-20  0:01   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue Matthew Brost
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9ca0ba4ea85a..e4a099f8f820 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -812,8 +812,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
-
 		__i915_request_unsubmit(rq);
 
 		/* Push the request back into the queue for later resubmission. */
@@ -826,8 +824,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (3 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-20  0:06   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Rather than processing 1 G2H at a time and re-queuing the work queue if
more messages exist, process all the G2H in a single pass of the work
queue.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 22b4733b55e2..20c710a74498 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -1042,9 +1042,9 @@ static void ct_incoming_request_worker_func(struct work_struct *w)
 		container_of(w, struct intel_guc_ct, requests.worker);
 	bool done;
 
-	done = ct_process_incoming_requests(ct);
-	if (!done)
-		queue_work(system_unbound_wq, &ct->requests.worker);
+	do {
+		done = ct_process_incoming_requests(ct);
+	} while (!done);
 }
 
 static int ct_handle_event(struct intel_guc_ct *ct, struct ct_incoming_msg *request)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (4 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-24 23:31   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold" Matthew Brost
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

If the context is reset as a result of the request cancelation the
context reset G2H is received after schedule disable done G2H which is
likely the wrong order. The schedule disable done G2H release the
waiting request cancelation code which resubmits the context. This races
with the context reset G2H which also wants to resubmit the context but
in this case it really should be a NOP as request cancelation code owns
the resubmit. Use some clever tricks of checking the context state to
seal this race until if / when the GuC firmware is fixed.

v2:
 (Checkpatch)
  - Fix typos

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e4a099f8f820..8f7a11e65ef5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -832,17 +832,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
+	bool skip = false;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission.
+	 *
+	 * XXX: If the context is reset as a result of the request cancellation
+	 * this G2H is received after the schedule disable complete G2H which is
+	 * likely wrong as this creates a race between the request cancellation
+	 * code re-submitting the context and this G2H handler. This likely
+	 * should be fixed in the GuC but until if / when that gets fixed we
+	 * need to workaround this. Convert this function to a NOP if a pending
+	 * enable is in flight as this indicates that a request cancellation has
+	 * occurred.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (likely(!context_pending_enable(ce))) {
+		clr_context_enabled(ce);
+	} else {
+		skip = true;
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(skip))
+		goto out_put;
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -861,6 +879,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 out_replay:
 	guc_reset_state(ce, head, stalled);
 	__unwind_incomplete_requests(ce);
+out_put:
 	intel_context_put(ce);
 }
 
@@ -1605,6 +1624,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
 			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
 					true);
 		}
+
+		/*
+		 * XXX: Racey if context is reset, see comment in
+		 * __guc_reset_context().
+		 */
+		flush_work(&ce_to_guc(ce)->ct.requests.worker);
+
 		guc_context_unblock(ce);
 	}
 }
@@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
-	if (likely(!intel_context_is_banned(ce))) {
+	/*
+	 * XXX: Racey if request cancellation has occurred, see comment in
+	 * __guc_reset_context().
+	 */
+	if (likely(!intel_context_is_banned(ce) &&
+		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold"
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (5 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-20 19:47   ` Jason Ekstrand
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 08/27] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Propagating errors to dependent fences is wrong, don't do it. A selftest
in the following exposed the propagating of an error to a dependent
fence after an engine reset.

This reverts commit 8e9f84cf5cac248a1c6a5daa4942879c8b765058.

v2:
 (Daniel Vetter)
  - Use revert

References: 3761baae908a (Revert "drm/i915: Propagate errors on awaiting already signaled fences")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index de5f9c86b9a4..cafb0608ffb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
 			if (p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
-			/* Propagate any change in error status */
-			if (rq->fence.error)
-				i915_request_set_error_once(w, rq->fence.error);
-
 			if (w->engine != rq->engine)
 				continue;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 08/27] drm/i915/selftests: Add a cancel request selftest that triggers a reset
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (6 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold" Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request Matthew Brost
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Add a cancel request selftest that results in an engine reset to cancel
the request as it is non-preemptable. Also insert a NOP request after
the cancelled request and confirm that it completely successfully.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 ++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index d67710d10615..e2c5db77f087 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -772,6 +772,98 @@ static int __cancel_completed(struct intel_engine_cs *engine)
 	return err;
 }
 
+static int __cancel_reset(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq, *nop;
+	unsigned long preempt_timeout_ms;
+	int err = 0;
+
+	preempt_timeout_ms = engine->props.preempt_timeout_ms;
+	engine->props.preempt_timeout_ms = 100;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		goto out_restore;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	nop = intel_context_create_request(ce);
+	if (IS_ERR(nop))
+		goto out_nop;
+	i915_request_get(nop);
+	i915_request_add(nop);
+
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel hung request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+		goto out_nop;
+	}
+
+	if (i915_request_wait(nop, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to complete nop request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (nop->fence.error != 0) {
+		pr_err("%s: Nop request errored (%u)\n",
+		       engine->name, nop->fence.error);
+		err = -EINVAL;
+	}
+
+out_nop:
+	i915_request_put(nop);
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+out_restore:
+	engine->props.preempt_timeout_ms = preempt_timeout_ms;
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
 static int live_cancel_request(void *arg)
 {
 	struct drm_i915_private *i915 = arg;
@@ -804,6 +896,14 @@ static int live_cancel_request(void *arg)
 			return err;
 		if (err2)
 			return err2;
+
+		/* Expects reset so call outside of igt_live_test_* */
+		err = __cancel_reset(engine);
+		if (err)
+			return err;
+
+		if (igt_flush_test(i915))
+			return -EIO;
 	}
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (7 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 08/27] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-20 18:31   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
                   ` (21 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Kick tasklet after queuing a request so it submitted in a timely manner.

Fixes: 3a4cdf1982f0 ("drm/i915/guc: Implement GuC context operations for new inteface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8f7a11e65ef5..d61f906105ef 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1050,6 +1050,7 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
 	list_add_tail(&rq->sched.link,
 		      i915_sched_lookup_priolist(sched_engine, prio));
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
 static int guc_bypass_tasklet_submit(struct intel_guc *guc,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (8 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-20 18:42   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
                   ` (20 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

When unblocking a context, do not enable scheduling if the context is
banned, guc_id invalid, or not registered.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d61f906105ef..e53a4ef7d442 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1586,6 +1586,9 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
+		     context_guc_id_invalid(ce) ||
+		     !lrc_desc_registered(guc, ce->guc_id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (9 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25  0:07   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

GuC submission has exposed an existing memory corruption in
live_lrc_isolation. We believe that some writes to the watchdog offsets
in the LRC (0x178 & 0x17c) can result in trashing of portions of the
address space. With GuC submission there are additional objects which
can move the context redzone into the space that is trashed. To
workaround this avoid poisoning the watchdog.

v2:
 (Daniel Vetter)
  - Add VLK ref in code to workaround

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..cdc6ae48a1e1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog - VLK-22772 */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (10 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25  0:58   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request Matthew Brost
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

v2:
 (Daniel Vetter)
  - Use ifdef instead of macros for selftests
v3:
 (Checkpatch)
  - A space after 'switch' statement

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  25 ++++
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 6 files changed, 184 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..3a73f3117873 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -198,6 +198,24 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
+	 */
+	bool drop_schedule_enable;
+
+	/**
+	 * @drop_schedule_disable: Force drop of schedule disable G2H for
+	 * selftest
+	 */
+	bool drop_schedule_disable;
+
+	/**
+	 * @drop_deregister: Force drop of deregister G2H for selftest
+	 */
+	bool drop_deregister;
+#endif
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e53a4ef7d442..e0e85e4ad512 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2635,6 +2635,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+	if (unlikely(ce->drop_deregister)) {
+		ce->drop_deregister = false;
+		return 0;
+	}
+#endif
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2689,10 +2696,24 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_enable)) {
+			ce->drop_schedule_enable = false;
+			return 0;
+		}
+#endif
+
 		clr_context_pending_enable(ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_disable)) {
+			ce->drop_schedule_disable = false;
+			return 0;
+		}
+#endif
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -3069,3 +3090,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..264e2f705c17
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch (i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (11 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-21  0:07   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A context can get destroyed after cancelling a request so take a
reference to context when cancelling a request.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e0e85e4ad512..85f96d325048 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1620,8 +1620,10 @@ static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
 	if (i915_sw_fence_signaled(&rq->submit)) {
-		struct i915_sw_fence *fence = guc_context_block(ce);
+		struct i915_sw_fence *fence;
 
+		intel_context_get(ce);
+		fence = guc_context_block(ce);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -1636,6 +1638,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
 		guc_context_unblock(ce);
+		intel_context_put(ce);
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (12 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25  1:20   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
                   ` (16 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Before we did some clever tricks to not use the a lock when touching
guc_state.sched_state in certain cases. Don't do that, enforce the use
of the lock.

Part of this is removing a dead code path from guc_lrc_desc_pin where a
context could be deregistered when the aforementioned function was
called from the submission path. Remove this dead code and add a
GEM_BUG_ON if this path is ever attempted to be used.

v2:
 (kernel test robo )
  - Add __maybe_unused to sched_state_is_init()

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 58 ++++++++++---------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 85f96d325048..fa87470ea576 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -150,11 +150,23 @@ static inline void clr_context_registered(struct intel_context *ce)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
 static inline void init_sched_state(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() */
+	lockdep_assert_held(&ce->guc_state.lock);
 	atomic_set(&ce->guc_sched_state_no_lock, 0);
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
+__maybe_unused
+static bool sched_state_is_init(struct intel_context *ce)
+{
+	/*
+	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
+	 * suspend.
+	 */
+	return !(atomic_read(&ce->guc_sched_state_no_lock) &
+		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
+		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
+}
+
 static inline bool
 context_wait_for_deregister_to_register(struct intel_context *ce)
 {
@@ -165,7 +177,7 @@ context_wait_for_deregister_to_register(struct intel_context *ce)
 static inline void
 set_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() without lock */
+	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |=
 		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
@@ -605,9 +617,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
 	xa_for_each(&guc->context_lookup, index, ce) {
-		/* Flush context */
 		spin_lock_irqsave(&ce->guc_state.lock, flags);
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -623,6 +633,8 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
 		if (pending_enable || destroyed || deregister) {
 			decr_outstanding_submission_g2h(guc);
 			if (deregister)
@@ -1325,6 +1337,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
+	GEM_BUG_ON(!sched_state_is_init(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -1353,7 +1366,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->priority = ce->guc_prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
-	init_sched_state(ce);
 
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
@@ -1364,26 +1376,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	 * registering this context.
 	 */
 	if (context_registered) {
+		bool disabled;
+		unsigned long flags;
+
 		trace_intel_context_steal_guc_id(ce);
-		if (!loop) {
+		GEM_BUG_ON(!loop);
+
+		/* Seal race with Reset */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		disabled = submission_disabled(guc);
+		if (likely(!disabled)) {
 			set_context_wait_for_deregister_to_register(ce);
 			intel_context_get(ce);
-		} else {
-			bool disabled;
-			unsigned long flags;
-
-			/* Seal race with Reset */
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
-			disabled = submission_disabled(guc);
-			if (likely(!disabled)) {
-				set_context_wait_for_deregister_to_register(ce);
-				intel_context_get(ce);
-			}
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-			if (unlikely(disabled)) {
-				reset_lrc_desc(guc, desc_idx);
-				return 0;	/* Will get registered later */
-			}
+		}
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		if (unlikely(disabled)) {
+			reset_lrc_desc(guc, desc_idx);
+			return 0;	/* Will get registered later */
 		}
 
 		/*
@@ -1392,10 +1401,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		 */
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = deregister_context(ce, ce->guc_id, loop);
-		if (unlikely(ret == -EBUSY)) {
-			clr_context_wait_for_deregister_to_register(ce);
-			intel_context_put(ce);
-		} else if (unlikely(ret == -ENODEV)) {
+		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
 	} else {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (13 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-21  0:14   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 16/27] drm/i915: Allocate error capture in nowait context Matthew Brost
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Reset LRC descriptor if a context register returns -ENODEV as this means
we are mid-reset.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index fa87470ea576..4cf5a565f08e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1407,10 +1407,12 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	} else {
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = register_context(ce, loop);
-		if (unlikely(ret == -EBUSY))
+		if (unlikely(ret == -EBUSY)) {
+			reset_lrc_desc(guc, desc_idx);
+		} else if (unlikely(ret == -ENODEV)) {
 			reset_lrc_desc(guc, desc_idx);
-		else if (unlikely(ret == -ENODEV))
 			ret = 0;	/* Will get registered later */
+		}
 	}
 
 	return ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 16/27] drm/i915: Allocate error capture in nowait context
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (14 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Error captures can now be done in a work queue processing G2H messages.
These messages need to be completely done being processed in the reset
path, to avoid races in the missing G2H cleanup, which create a
dependency on memory allocations and dma fences (i915_requests).
Requests depend on resets, thus now we have a circular dependency. To
work around this, allocate the error capture in a nowait context.

v2:
 (Daniel Vetter)
  - Use GFP_NOWAIT instead GFP_ATOMIC

Fixes: dc0dad365c5e ("Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126aef3 ("Capture error state on context reset")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 39 +++++++++++++--------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 0f08bcfbe964..2819b69fbb72 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -49,8 +49,7 @@
 #include "i915_memcpy.h"
 #include "i915_scatterlist.h"
 
-#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
-#define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)
+#define ATOMIC_MAYFAIL (GFP_NOWAIT | __GFP_NOWARN)
 
 static void __sg_set_buf(struct scatterlist *sg,
 			 void *addr, unsigned int len, loff_t it)
@@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	if (e->cur == e->end) {
 		struct scatterlist *sgl;
 
-		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
+		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
 		if (!sgl) {
 			e->err = -ENOMEM;
 			return false;
@@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	}
 
 	e->size = ALIGN(len + 1, SZ_64K);
-	e->buf = kmalloc(e->size, ALLOW_FAIL);
+	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	if (!e->buf) {
 		e->size = PAGE_ALIGN(len + 1);
-		e->buf = kmalloc(e->size, GFP_KERNEL);
+		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	}
 	if (!e->buf) {
 		e->err = -ENOMEM;
@@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
 {
 	struct z_stream_s *zstream = &c->zstream;
 
-	if (pool_init(&c->pool, ALLOW_FAIL))
+	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
 		return false;
 
 	zstream->workspace =
 		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
-			ALLOW_FAIL);
+			ATOMIC_MAYFAIL);
 	if (!zstream->workspace) {
 		pool_fini(&c->pool);
 		return false;
@@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
 
 	c->tmp = NULL;
 	if (i915_has_memcpy_from_wc())
-		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
+		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 
 	return true;
 }
@@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
 	if (dst->page_count >= dst->num_pages)
 		return ERR_PTR(-ENOSPC);
 
-	page = pool_alloc(&c->pool, ALLOW_FAIL);
+	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 
@@ -376,7 +375,7 @@ struct i915_vma_compress {
 
 static bool compress_init(struct i915_vma_compress *c)
 {
-	return pool_init(&c->pool, ALLOW_FAIL) == 0;
+	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
 }
 
 static bool compress_start(struct i915_vma_compress *c)
@@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
 {
 	void *ptr;
 
-	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
+	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!ptr)
 		return -ENOMEM;
 
@@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
 
 	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
 	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
-	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
+	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
 	if (!dst)
 		return NULL;
 
@@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
 	if (!ee)
 		return NULL;
 
@@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
 		struct intel_engine_coredump *ee;
 
 		/* Refill our page pool before entering atomic section */
-		pool_refill(&compress->pool, ALLOW_FAIL);
+		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
 
 		ee = capture_engine(engine, compress);
 		if (!ee)
@@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	const struct intel_uc *uc = &gt->_gt->uc;
 	struct intel_uc_coredump *error_uc;
 
-	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
+	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
 	if (!error_uc)
 		return NULL;
 
@@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	 * As modparams are generally accesible from the userspace make
 	 * explicit copies of the firmware paths.
 	 */
-	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
-	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
+	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
+	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
 	error_uc->guc_log =
 		i915_vma_coredump_create(gt->_gt,
 					 uc->guc.log.vma, "GuC log buffer",
@@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
 {
 	struct i915_vma_compress *compress;
 
-	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
+	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
 	if (!compress)
 		return NULL;
 
@@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 	if (IS_ERR(error))
 		return error;
 
-	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
+	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
 	if (!error)
 		return ERR_PTR(-ENOMEM);
 
-	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
+	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
 	if (error->gt) {
 		struct i915_vma_compress *compress;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (15 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 16/27] drm/i915: Allocate error capture in nowait context Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-21  0:25   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work Matthew Brost
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running.

Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4cf5a565f08e..9a53bae367b1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -714,8 +714,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -731,20 +729,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	guc_flush_submissions(guc);
 
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
+
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (16 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25  1:44   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

A subsequent patch will flip the locking hierarchy from
ce->guc_state.lock -> sched_engine->lock to sched_engine->lock ->
ce->guc_state.lock. As such we need to release the submit fence for a
request from an IRQ to break a lock inversion - i.e. the fence must be
release went holding ce->guc_state.lock and the releasing of the can
acquire sched_engine->lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
 drivers/gpu/drm/i915/i915_request.h               |  5 +++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9a53bae367b1..deb2e821e441 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2025,6 +2025,14 @@ static const struct intel_context_ops guc_context_ops = {
 	.create_virtual = guc_create_virtual,
 };
 
+static void submit_work_cb(struct irq_work *wrk)
+{
+	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
+
+	might_lock(&rq->engine->sched_engine->lock);
+	i915_sw_fence_complete(&rq->submit);
+}
+
 static void __guc_signal_context_fence(struct intel_context *ce)
 {
 	struct i915_request *rq;
@@ -2034,8 +2042,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
 	if (!list_empty(&ce->guc_state.fences))
 		trace_intel_context_fence_release(ce);
 
+	/*
+	 * Use an IRQ to ensure locking order of sched_engine->lock ->
+	 * ce->guc_state.lock is preserved.
+	 */
 	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
-		i915_sw_fence_complete(&rq->submit);
+		irq_work_queue(&rq->submit_work);
 
 	INIT_LIST_HEAD(&ce->guc_state.fences);
 }
@@ -2145,6 +2157,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
+		init_irq_work(&rq->submit_work, submit_work_cb);
 		i915_sw_fence_await(&rq->submit);
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 1bc1349ba3c2..d818cfbfc41d 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -218,6 +218,11 @@ struct i915_request {
 	};
 	struct llist_head execute_cb;
 	struct i915_sw_fence semaphore;
+	/**
+	 * @submit_work: complete submit fence from an IRQ if needed for
+	 * locking hierarchy reasons.
+	 */
+	struct irq_work submit_work;
 
 	/*
 	 * A list of everyone we wait upon, and everyone who waits upon us.
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (17 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-21  0:30   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking Matthew Brost
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Move guc_blocked fence to struct guc_state as the lock which protects
the fence lives there.

s/ce->guc_blocked/ce->guc_state.blocked_fence/g

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c        |  5 +++--
 drivers/gpu/drm/i915/gt/intel_context_types.h  |  5 ++---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +++++++++---------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 745e84c72c90..0e48939ec85f 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -405,8 +405,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
 	 */
-	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_init(&ce->guc_state.blocked_fence,
+			   sw_fence_dummy_notify);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 
 	i915_active_init(&ce->active,
 			 __intel_context_active, __intel_context_retire, 0);
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 3a73f3117873..c06171ee8792 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -167,6 +167,8 @@ struct intel_context {
 		 * fence related to GuC submission
 		 */
 		struct list_head fences;
+		/* GuC context blocked fence */
+		struct i915_sw_fence blocked_fence;
 	} guc_state;
 
 	struct {
@@ -190,9 +192,6 @@ struct intel_context {
 	 */
 	struct list_head guc_id_link;
 
-	/* GuC context blocked fence */
-	struct i915_sw_fence guc_blocked;
-
 	/*
 	 * GuC priority management
 	 */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index deb2e821e441..053f4485d6e9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1490,24 +1490,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (!i915_sw_fence_done(&ce->guc_blocked))
-		i915_sw_fence_complete(&ce->guc_blocked);
+	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
+		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
 }
 
 static void guc_blocked_fence_reinit(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
+	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
 
 	/*
 	 * This fence is always complete unless a pending schedule disable is
 	 * outstanding. We arm the fence here and complete it when we receive
 	 * the pending schedule disable complete message.
 	 */
-	i915_sw_fence_fini(&ce->guc_blocked);
-	i915_sw_fence_reinit(&ce->guc_blocked);
-	i915_sw_fence_await(&ce->guc_blocked);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
+	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
+	i915_sw_fence_await(&ce->guc_state.blocked_fence);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 }
 
 static u16 prep_context_pending_disable(struct intel_context *ce)
@@ -1547,7 +1547,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 		if (enabled)
 			clr_context_enabled(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-		return &ce->guc_blocked;
+		return &ce->guc_state.blocked_fence;
 	}
 
 	/*
@@ -1563,7 +1563,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	with_intel_runtime_pm(runtime_pm, wakeref)
 		__guc_context_sched_disable(guc, ce, guc_id);
 
-	return &ce->guc_blocked;
+	return &ce->guc_state.blocked_fence;
 }
 
 static void guc_context_unblock(struct intel_context *ce)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (18 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25 16:52   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
                   ` (10 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Rework and simplify the locking with GuC subission. Drop
sched_state_no_lock and move all fields under the guc_state.sched_state
and protect all these fields with guc_state.lock . This requires
changing the locking hierarchy from guc_state.lock -> sched_engine.lock
to sched_engine.lock -> guc_state.lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
 drivers/gpu/drm/i915/i915_trace.h             |   6 +-
 3 files changed, 89 insertions(+), 108 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index c06171ee8792..d5d643b04d54 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -161,7 +161,7 @@ struct intel_context {
 		 * sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
-		u16 sched_state;
+		u32 sched_state;
 		/*
 		 * fences: maintains of list of requests that have a submit
 		 * fence related to GuC submission
@@ -178,9 +178,6 @@ struct intel_context {
 		struct list_head requests;
 	} guc_active;
 
-	/* GuC scheduling state flags that do not require a lock. */
-	atomic_t guc_sched_state_no_lock;
-
 	/* GuC LRC descriptor ID */
 	u16 guc_id;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 053f4485d6e9..509b298e7cf3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
-/*
- * Below is a set of functions which control the GuC scheduling state which do
- * not require a lock as all state transitions are mutually exclusive. i.e. It
- * is not possible for the context pinning code and submission, for the same
- * context, to be executing simultaneously. We still need an atomic as it is
- * possible for some of the bits to changing at the same time though.
- */
-#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
-#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
-#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
-static inline bool context_enabled(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_ENABLED);
-}
-
-static inline void set_context_enabled(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_enabled(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
-		   &ce->guc_sched_state_no_lock);
-}
-
-static inline bool context_pending_enable(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
-}
-
-static inline void set_context_pending_enable(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		  &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_pending_enable(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		   &ce->guc_sched_state_no_lock);
-}
-
-static inline bool context_registered(struct intel_context *ce)
-{
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_REGISTERED);
-}
-
-static inline void set_context_registered(struct intel_context *ce)
-{
-	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
-		  &ce->guc_sched_state_no_lock);
-}
-
-static inline void clr_context_registered(struct intel_context *ce)
-{
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
-		   &ce->guc_sched_state_no_lock);
-}
-
 /*
  * Below is a set of functions which control the GuC scheduling state which
- * require a lock, aside from the special case where the functions are called
- * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
- * path to be executing on the context.
+ * require a lock.
  */
 #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
 #define SCHED_STATE_DESTROYED				BIT(1)
 #define SCHED_STATE_PENDING_DISABLE			BIT(2)
 #define SCHED_STATE_BANNED				BIT(3)
-#define SCHED_STATE_BLOCKED_SHIFT			4
+#define SCHED_STATE_ENABLED				BIT(4)
+#define SCHED_STATE_PENDING_ENABLE			BIT(5)
+#define SCHED_STATE_REGISTERED				BIT(6)
+#define SCHED_STATE_BLOCKED_SHIFT			7
 #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
 static inline void init_sched_state(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	atomic_set(&ce->guc_sched_state_no_lock, 0);
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
@@ -162,9 +99,8 @@ static bool sched_state_is_init(struct intel_context *ce)
 	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
 	 * suspend.
 	 */
-	return !(atomic_read(&ce->guc_sched_state_no_lock) &
-		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
-		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
+	return !(ce->guc_state.sched_state &=
+		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
 }
 
 static inline bool
@@ -237,6 +173,57 @@ static inline void clr_context_banned(struct intel_context *ce)
 	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
 }
 
+static inline bool context_enabled(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
+}
+
+static inline void set_context_enabled(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
+}
+
+static inline void clr_context_enabled(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
+}
+
+static inline bool context_pending_enable(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline void set_context_pending_enable(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline void clr_context_pending_enable(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
+}
+
+static inline bool context_registered(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
+}
+
+static inline void set_context_registered(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
+}
+
+static inline void clr_context_registered(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
+}
+
 static inline u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
@@ -245,7 +232,6 @@ static inline u32 context_blocked(struct intel_context *ce)
 
 static inline void incr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
@@ -255,7 +241,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
 
 static inline void decr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
@@ -450,6 +435,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	u32 g2h_len_dw = 0;
 	bool enabled;
 
+	lockdep_assert_held(&rq->engine->sched_engine->lock);
+
 	/*
 	 * Corner case where requests were sitting in the priority list or a
 	 * request resubmitted after the context was banned.
@@ -457,7 +444,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(intel_context_is_banned(ce))) {
 		i915_request_put(i915_request_mark_eio(rq));
 		intel_engine_signal_breadcrumbs(ce->engine);
-		goto out;
+		return 0;
 	}
 
 	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
@@ -470,9 +457,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
 		err = guc_lrc_desc_pin(ce, false);
 		if (unlikely(err))
-			goto out;
+			return err;
 	}
 
+	spin_lock(&ce->guc_state.lock);
+
 	/*
 	 * The request / context will be run on the hardware when scheduling
 	 * gets enabled in the unblock.
@@ -507,6 +496,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_i915_request_guc_submit(rq);
 
 out:
+	spin_unlock(&ce->guc_state.lock);
 	return err;
 }
 
@@ -727,8 +717,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
-	guc_flush_submissions(guc);
-
 	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
@@ -1133,7 +1121,11 @@ static int steal_guc_id(struct intel_guc *guc)
 
 		list_del_init(&ce->guc_id_link);
 		guc_id = ce->guc_id;
+
+		spin_lock(&ce->guc_state.lock);
 		clr_context_registered(ce);
+		spin_unlock(&ce->guc_state.lock);
+
 		set_context_guc_id_invalid(ce);
 		return guc_id;
 	} else {
@@ -1169,6 +1161,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
+	might_lock(&ce->guc_state.lock);
+
 	if (context_guc_id_invalid(ce)) {
 		ret = assign_guc_id(guc, &ce->guc_id);
 		if (ret)
@@ -1248,8 +1242,13 @@ static int register_context(struct intel_context *ce, bool loop)
 	trace_intel_context_register(ce);
 
 	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
-	if (likely(!ret))
+	if (likely(!ret)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		set_context_registered(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	}
 
 	return ret;
 }
@@ -1525,7 +1524,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
 static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1534,13 +1532,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
-	/*
-	 * Sync with submission path, increment before below changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	incr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1569,7 +1561,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 static void guc_context_unblock(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1594,13 +1585,7 @@ static void guc_context_unblock(struct intel_context *ce)
 		intel_context_get(ce);
 	}
 
-	/*
-	 * Sync with submission path, decrement after above changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	decr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
@@ -1710,7 +1695,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id)) {
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_enabled(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
 
@@ -1760,7 +1747,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
 	GEM_BUG_ON(context_enabled(ce));
 
-	clr_context_registered(ce);
 	deregister_context(ce, ce->guc_id, true);
 }
 
@@ -1833,8 +1819,10 @@ static void guc_context_destroy(struct kref *kref)
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
-	if (likely(!disabled))
+	if (likely(!disabled)) {
 		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
@@ -2697,8 +2685,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		     (!context_pending_enable(ce) &&
 		     !context_pending_disable(ce)))) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
-			atomic_read(&ce->guc_sched_state_no_lock),
+			"Bad context sched_state 0x%x, desc_idx %u",
 			ce->guc_state.sched_state, desc_idx);
 		return -EPROTO;
 	}
@@ -2713,7 +2700,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		}
 #endif
 
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_pending_enable(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
@@ -2987,9 +2976,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
 			   atomic_read(&ce->guc_id_ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
-			   ce->guc_state.sched_state,
-			   atomic_read(&ce->guc_sched_state_no_lock));
+		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+			   ce->guc_state.sched_state);
 
 		guc_log_context_priority(p, ce);
 	}
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 806ad688274b..0a77eb2944b5 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
 			     __field(u32, guc_id)
 			     __field(int, pin_count)
 			     __field(u32, sched_state)
-			     __field(u32, guc_sched_state_no_lock)
 			     __field(u8, guc_prio)
 			     ),
 
@@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
 			   __entry->guc_id = ce->guc_id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_sched_state_no_lock =
-			   atomic_read(&ce->guc_sched_state_no_lock);
 			   __entry->guc_prio = ce->guc_prio;
 			   ),
 
-		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
+		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
 			      __entry->guc_id, __entry->pin_count,
 			      __entry->sched_state,
-			      __entry->guc_sched_state_no_lock,
 			      __entry->guc_prio)
 );
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (19 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-26  0:44   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Lock the xarray and take ref to the context if needed.

v2:
 (Checkpatch)
  - Add new line after declaration
 (Daniel Vetter)
  - Correct put / get accounting in xa_for_loops

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 103 +++++++++++++++---
 1 file changed, 88 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 509b298e7cf3..5f77f25322ca 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -606,8 +606,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	unsigned long index, flags;
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		/*
+		 * Corner case where the ref count on the object is zero but and
+		 * deregister G2H was lost. In this case we don't touch the ref
+		 * count and finish the destroy of the context.
+		 */
+		bool do_put = kref_get_unless_zero(&ce->ref);
+
+		xa_unlock(&guc->context_lookup);
+
+		spin_lock(&ce->guc_state.lock);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -623,7 +633,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		spin_unlock(&ce->guc_state.lock);
+
+		GEM_BUG_ON(!do_put && !destroyed);
 
 		if (pending_enable || destroyed || deregister) {
 			decr_outstanding_submission_g2h(guc);
@@ -646,13 +658,19 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			}
 			intel_context_sched_disable_unpin(ce);
 			decr_outstanding_submission_g2h(guc);
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
+
+			spin_lock(&ce->guc_state.lock);
 			guc_blocked_fence_complete(ce);
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+			spin_unlock(&ce->guc_state.lock);
 
 			intel_context_put(ce);
 		}
+
+		if (do_put)
+			intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static inline bool
@@ -873,16 +891,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
 	}
 
-	xa_for_each(&guc->context_lookup, index, ce)
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
+
 		if (intel_context_is_pinned(ce))
 			__guc_reset_context(ce, stalled);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
@@ -957,11 +988,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
+
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
 
-	xa_for_each(&guc->context_lookup, index, ce)
 		if (intel_context_is_pinned(ce))
 			guc_cancel_context_requests(ce);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	guc_cancel_sched_engine_requests(guc->sched_engine);
 
 	/* GuC is blown away, drop all references to contexts */
@@ -2850,21 +2894,28 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 	struct intel_context *ce;
 	struct i915_request *rq;
 	unsigned long index;
+	unsigned long flags;
 
 	/* Reset called during driver load? GuC not yet initialised! */
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
 		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
@@ -2874,9 +2925,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
-			return;
+			intel_context_put(ce);
+			xa_lock(&guc->context_lookup);
+			goto done;
 		}
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+
 	}
+done:
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
@@ -2892,23 +2951,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
-		spin_lock_irqsave(&ce->guc_active.lock, flags);
+		spin_lock(&ce->guc_active.lock);
 		intel_engine_dump_active_requests(&ce->guc_active.requests,
 						  hung_rq, m);
-		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+		spin_unlock(&ce->guc_active.lock);
+
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_submission_print_info(struct intel_guc *guc,
@@ -2962,7 +3032,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
 		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
@@ -2981,6 +3053,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 		guc_log_context_priority(p, ce);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static struct intel_context *
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (20 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-26  0:50   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Drop pin count check trick between a sched_disable and re-pin, now rely
on the lock and counter of the number of committed requests to determine
if scheduling should be disabled on the context.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 49 ++++++++++++-------
 2 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index d5d643b04d54..524a35a78bf4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -169,6 +169,8 @@ struct intel_context {
 		struct list_head fences;
 		/* GuC context blocked fence */
 		struct i915_sw_fence blocked_fence;
+		/* GuC committed requests */
+		int number_committed_requests;
 	} guc_state;
 
 	struct {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5f77f25322ca..3e90985b0c1b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -248,6 +248,25 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
+static inline bool context_has_committed_requests(struct intel_context *ce)
+{
+	return !!ce->guc_state.number_committed_requests;
+}
+
+static inline void incr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	++ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static inline void decr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	--ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id == GUC_INVALID_LRC_ID;
@@ -1751,14 +1770,11 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
-	 * We have to check if the context has been disabled by another thread.
-	 * We also have to check if the context has been pinned again as another
-	 * pin operation is allowed to pass this function. Checking the pin
-	 * count, within ce->guc_state.lock, synchronizes this function with
-	 * guc_request_alloc ensuring a request doesn't slip through the
-	 * 'context_pending_disable' fence. Checking within the spin lock (can't
-	 * sleep) ensures another process doesn't pin this context and generate
-	 * a request before we set the 'context_pending_disable' flag here.
+	 * We have to check if the context has been disabled by another thread,
+	 * check if submssion has been disabled to seal a race with reset and
+	 * finally check if any more requests have been committed to the
+	 * context ensursing that a request doesn't slip through the
+	 * 'context_pending_disable' fence.
 	 */
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1767,7 +1783,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
-	if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+	if (unlikely(context_has_committed_requests(ce))) {
+		intel_context_sched_disable_unpin(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		return;
 	}
@@ -1800,6 +1817,7 @@ static void __guc_context_destroy(struct intel_context *ce)
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
 		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
 	intel_context_fini(ce);
@@ -2030,6 +2048,10 @@ static void remove_from_context(struct i915_request *rq)
 
 	spin_unlock_irq(&ce->guc_active.lock);
 
+	spin_lock_irq(&ce->guc_state.lock);
+	decr_context_committed_requests(ce);
+	spin_unlock_irq(&ce->guc_state.lock);
+
 	atomic_dec(&ce->guc_id_ref);
 	i915_request_notify_execute_cb_imm(rq);
 }
@@ -2177,15 +2199,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * schedule enable or context registration if either G2H is pending
 	 * respectfully. Once a G2H returns, the fence is released that is
 	 * blocking these requests (see guc_signal_context_fence).
-	 *
-	 * We can safely check the below fields outside of the lock as it isn't
-	 * possible for these fields to transition from being clear to set but
-	 * converse is possible, hence the need for the check within the lock.
 	 */
-	if (likely(!context_wait_for_deregister_to_register(ce) &&
-		   !context_pending_disable(ce)))
-		return 0;
-
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
@@ -2194,6 +2208,7 @@ static int guc_request_alloc(struct i915_request *rq)
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
 	}
+	incr_context_committed_requests(ce);
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (21 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25 21:51   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure Matthew Brost
                   ` (7 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Move GuC management fields in context under guc_active struct as this is
where the lock that protects theses fields lives. Also only set guc_prio
field once during context init.

Fixes: ee242ca704d3 ("drm/i915/guc: Implement GuC priority management")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 68 +++++++++++--------
 drivers/gpu/drm/i915/i915_trace.h             |  2 +-
 3 files changed, 45 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 524a35a78bf4..9fb0480ccf3b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -112,6 +112,7 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
+#define CONTEXT_GUC_INIT		10
 
 	struct {
 		u64 timeout_us;
@@ -178,6 +179,11 @@ struct intel_context {
 		spinlock_t lock;
 		/** requests: active requests on this context */
 		struct list_head requests;
+		/*
+		 * GuC priority management
+		 */
+		u8 prio;
+		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_active;
 
 	/* GuC LRC descriptor ID */
@@ -191,12 +197,6 @@ struct intel_context {
 	 */
 	struct list_head guc_id_link;
 
-	/*
-	 * GuC priority management
-	 */
-	u8 guc_prio;
-	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
-
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3e90985b0c1b..bb90bedb1305 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1369,8 +1369,6 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
 	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio);
-
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -1378,8 +1376,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct intel_guc *guc = &engine->gt->uc.guc;
 	u32 desc_idx = ce->guc_id;
 	struct guc_lrc_desc *desc;
-	const struct i915_gem_context *ctx;
-	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
 	bool context_registered;
 	intel_wakeref_t wakeref;
 	int ret = 0;
@@ -1396,12 +1392,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
-	rcu_read_lock();
-	ctx = rcu_dereference(ce->gem_context);
-	if (ctx)
-		prio = ctx->sched.priority;
-	rcu_read_unlock();
-
 	reset_lrc_desc(guc, desc_idx);
 	set_lrc_desc_registered(guc, desc_idx, ce);
 
@@ -1410,8 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
 	desc->hw_context_desc = ce->lrc.lrca;
-	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
-	desc->priority = ce->guc_prio;
+	desc->priority = ce->guc_active.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
@@ -1813,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 
 static void __guc_context_destroy(struct intel_context *ce)
 {
-	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
+		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
+		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
+		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
 	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
@@ -1926,14 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
 
 	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
 		   prio > GUC_CLIENT_PRIORITY_NORMAL);
+	lockdep_assert_held(&ce->guc_active.lock);
 
-	if (ce->guc_prio == prio || submission_disabled(guc) ||
-	    !context_registered(ce))
+	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
+	    !context_registered(ce)) {
+		ce->guc_active.prio = prio;
 		return;
+	}
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
 
-	ce->guc_prio = prio;
+	ce->guc_active.prio = prio;
 	trace_intel_context_set_prio(ce);
 }
 
@@ -1953,24 +1945,24 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
 
-	++ce->guc_prio_count[guc_prio];
+	++ce->guc_active.prio_count[guc_prio];
 
 	/* Overflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
 }
 
 static inline void sub_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
 
 	/* Underflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
 
-	--ce->guc_prio_count[guc_prio];
+	--ce->guc_active.prio_count[guc_prio];
 }
 
 static inline void update_context_prio(struct intel_context *ce)
@@ -1983,8 +1975,8 @@ static inline void update_context_prio(struct intel_context *ce)
 
 	lockdep_assert_held(&ce->guc_active.lock);
 
-	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
-		if (ce->guc_prio_count[i]) {
+	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
+		if (ce->guc_active.prio_count[i]) {
 			guc_context_set_prio(guc, ce, i);
 			break;
 		}
@@ -2123,6 +2115,20 @@ static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static void guc_context_init(struct intel_context *ce)
+{
+	const struct i915_gem_context *ctx;
+	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(ce->gem_context);
+	if (ctx)
+		prio = ctx->sched.priority;
+	rcu_read_unlock();
+
+	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
+}
+
 static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
@@ -2154,6 +2160,9 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
+		guc_context_init(ce);
+
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
@@ -3031,13 +3040,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 {
 	int i;
 
-	drm_printf(p, "\t\tPriority: %d\n",
-		   ce->guc_prio);
+	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
 	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
 	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
 	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
 		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
-			   i, ce->guc_prio_count[i]);
+			   i, ce->guc_active.prio_count[i]);
 	}
 	drm_printf(p, "\n");
 }
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 0a77eb2944b5..6f882e72ed11 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
 			   __entry->guc_id = ce->guc_id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_prio = ce->guc_prio;
+			   __entry->guc_prio = ce->guc_active.prio;
 			   ),
 
 		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (22 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-25  2:00   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state Matthew Brost
                   ` (6 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

To make ownership of locking clear move fields (guc_id, guc_id_ref,
guc_id_link) to sub structure guc_id in intel_context.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   4 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +--
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 106 +++++++++---------
 drivers/gpu/drm/i915/i915_trace.h             |   4 +-
 5 files changed, 70 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 0e48939ec85f..87b84c1d5393 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -398,8 +398,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	spin_lock_init(&ce->guc_active.lock);
 	INIT_LIST_HEAD(&ce->guc_active.requests);
 
-	ce->guc_id = GUC_INVALID_LRC_ID;
-	INIT_LIST_HEAD(&ce->guc_id_link);
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
+	INIT_LIST_HEAD(&ce->guc_id.link);
 
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 9fb0480ccf3b..7a1d1537cf67 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -186,16 +186,18 @@ struct intel_context {
 		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_active;
 
-	/* GuC LRC descriptor ID */
-	u16 guc_id;
+	struct {
+		/* GuC LRC descriptor ID */
+		u16 id;
 
-	/* GuC LRC descriptor reference count */
-	atomic_t guc_id_ref;
+		/* GuC LRC descriptor reference count */
+		atomic_t ref;
 
-	/*
-	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
-	 */
-	struct list_head guc_id_link;
+		/*
+		 * GuC ID link - in list when unpinned but guc_id still valid in GuC
+		 */
+		struct list_head link;
+	} guc_id;
 
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 08f011f893b2..bf43bed905db 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -789,7 +789,7 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 skip:
@@ -1098,7 +1098,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 			count++;
@@ -1108,7 +1108,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 					pr_err("i915_reset_engine(%s:%s): failed to reset request %lld:%lld [0x%04X]\n",
 					       engine->name, test_name,
 					       rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id);
+					       rq->fence.seqno, rq->context->guc_id.id);
 					i915_request_put(rq);
 
 					GEM_TRACE_DUMP();
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bb90bedb1305..c4c018348ac0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -269,12 +269,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
 
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
-	return ce->guc_id == GUC_INVALID_LRC_ID;
+	return ce->guc_id.id == GUC_INVALID_LRC_ID;
 }
 
 static inline void set_context_guc_id_invalid(struct intel_context *ce)
 {
-	ce->guc_id = GUC_INVALID_LRC_ID;
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
 }
 
 static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
@@ -466,14 +466,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		return 0;
 	}
 
-	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
 	/*
 	 * Corner case where the GuC firmware was blown away and reloaded while
 	 * this context was pinned.
 	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
+	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
 		err = guc_lrc_desc_pin(ce, false);
 		if (unlikely(err))
 			return err;
@@ -492,14 +492,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 		action[len++] = GUC_CONTEXT_ENABLE;
 		set_context_pending_enable(ce);
 		intel_context_get(ce);
 		g2h_len_dw = G2H_LEN_DW_SCHED_CONTEXT_MODE_SET;
 	} else {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 	}
 
 	err = intel_guc_send_nb(guc, action, len, g2h_len_dw);
@@ -1150,12 +1150,12 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id);
-		reset_lrc_desc(guc, ce->guc_id);
+		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 }
 
 static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
@@ -1177,13 +1177,13 @@ static int steal_guc_id(struct intel_guc *guc)
 	if (!list_empty(&guc->guc_id_list)) {
 		ce = list_first_entry(&guc->guc_id_list,
 				      struct intel_context,
-				      guc_id_link);
+				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 		GEM_BUG_ON(context_guc_id_invalid(ce));
 
-		list_del_init(&ce->guc_id_link);
-		guc_id = ce->guc_id;
+		list_del_init(&ce->guc_id.link);
+		guc_id = ce->guc_id.id;
 
 		spin_lock(&ce->guc_state.lock);
 		clr_context_registered(ce);
@@ -1219,7 +1219,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	int ret = 0;
 	unsigned long flags, tries = PIN_GUC_ID_TRIES;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
@@ -1227,20 +1227,20 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	might_lock(&ce->guc_state.lock);
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id);
+		ret = assign_guc_id(guc, &ce->guc_id.id);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
-	atomic_inc(&ce->guc_id_ref);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
+	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/*
-	 * -EAGAIN indicates no guc_ids are available, let's retire any
+	 * -EAGAIN indicates no guc_id are available, let's retire any
 	 * outstanding requests to see if that frees up a guc_id. If the first
 	 * retire didn't help, insert a sleep with the timeslice duration before
 	 * attempting to retire more requests. Double the sleep period each
@@ -1268,15 +1268,15 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
 
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
-	    !atomic_read(&ce->guc_id_ref))
-		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
+	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
+	    !atomic_read(&ce->guc_id.ref))
+		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
@@ -1299,12 +1299,12 @@ static int register_context(struct intel_context *ce, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	u32 offset = intel_guc_ggtt_offset(guc, guc->lrc_desc_pool) +
-		ce->guc_id * sizeof(struct guc_lrc_desc);
+		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
+	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
 	if (likely(!ret)) {
 		unsigned long flags;
 
@@ -1374,7 +1374,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct intel_engine_cs *engine = ce->engine;
 	struct intel_runtime_pm *runtime_pm = engine->uncore->rpm;
 	struct intel_guc *guc = &engine->gt->uc.guc;
-	u32 desc_idx = ce->guc_id;
+	u32 desc_idx = ce->guc_id.id;
 	struct guc_lrc_desc *desc;
 	bool context_registered;
 	intel_wakeref_t wakeref;
@@ -1437,7 +1437,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		 * context whose guc_id was stolen.
 		 */
 		with_intel_runtime_pm(runtime_pm, wakeref)
-			ret = deregister_context(ce, ce->guc_id, loop);
+			ret = deregister_context(ce, ce->guc_id.id, loop);
 		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
@@ -1509,7 +1509,7 @@ static void __guc_context_sched_enable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		ce->guc_id,
+		ce->guc_id.id,
 		GUC_CONTEXT_ENABLE
 	};
 
@@ -1525,7 +1525,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		guc_id,	/* ce->guc_id not stable */
+		guc_id,	/* ce->guc_id.id not stable */
 		GUC_CONTEXT_DISABLE
 	};
 
@@ -1570,7 +1570,7 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
 	guc_blocked_fence_reinit(ce);
 	intel_context_get(ce);
 
-	return ce->guc_id;
+	return ce->guc_id.id;
 }
 
 static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
@@ -1625,7 +1625,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	if (unlikely(submission_disabled(guc) ||
 		     intel_context_is_banned(ce) ||
 		     context_guc_id_invalid(ce) ||
-		     !lrc_desc_registered(guc, ce->guc_id) ||
+		     !lrc_desc_registered(guc, ce->guc_id.id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
@@ -1730,7 +1730,7 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 		if (!context_guc_id_invalid(ce))
 			with_intel_runtime_pm(runtime_pm, wakeref)
 				__guc_context_set_preemption_timeout(guc,
-								     ce->guc_id,
+								     ce->guc_id.id,
 								     1);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	}
@@ -1746,7 +1746,7 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	bool enabled;
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
-	    !lrc_desc_registered(guc, ce->guc_id)) {
+	    !lrc_desc_registered(guc, ce->guc_id.id)) {
 		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_enabled(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
@@ -1793,11 +1793,11 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
-	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
-	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
+	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
+	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
-	deregister_context(ce, ce->guc_id, true);
+	deregister_context(ce, ce->guc_id.id, true);
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
@@ -1842,7 +1842,7 @@ static void guc_context_destroy(struct kref *kref)
 		__guc_context_destroy(ce);
 		return;
 	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id)) {
+		   !lrc_desc_registered(guc, ce->guc_id.id)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 		return;
@@ -1851,10 +1851,10 @@ static void guc_context_destroy(struct kref *kref)
 	/*
 	 * We have to acquire the context spinlock and check guc_id again, if it
 	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_ids available to
+	 * delete this context from the list of unpinned guc_id available to
 	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
 	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_ids.
+	 * returned to the pool of available guc_id.
 	 */
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 	if (context_guc_id_invalid(ce)) {
@@ -1863,8 +1863,8 @@ static void guc_context_destroy(struct kref *kref)
 		return;
 	}
 
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/* Seal race with Reset */
@@ -1909,7 +1909,7 @@ static void guc_context_set_prio(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SET_CONTEXT_PRIORITY,
-		ce->guc_id,
+		ce->guc_id.id,
 		prio,
 	};
 
@@ -2044,7 +2044,7 @@ static void remove_from_context(struct i915_request *rq)
 	decr_context_committed_requests(ce);
 	spin_unlock_irq(&ce->guc_state.lock);
 
-	atomic_dec(&ce->guc_id_ref);
+	atomic_dec(&ce->guc_id.ref);
 	i915_request_notify_execute_cb_imm(rq);
 }
 
@@ -2111,7 +2111,7 @@ static void guc_signal_context_fence(struct intel_context *ce)
 static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
 {
 	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
-		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
+		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id)) &&
 		!submission_disabled(ce_to_guc(ce));
 }
 
@@ -2166,11 +2166,11 @@ static int guc_request_alloc(struct i915_request *rq)
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
-	 * guc_ids and creating horrible race conditions. This is especially bad
-	 * when guc_ids are being stolen due to over subscription. By the time
+	 * guc_id and creating horrible race conditions. This is especially bad
+	 * when guc_id are being stolen due to over subscription. By the time
 	 * this function is reached, it is guaranteed that the guc_id will be
 	 * persistent until the generated request is retired. Thus, sealing these
-	 * race conditions. It is still safe to fail here if guc_ids are
+	 * race conditions. It is still safe to fail here if guc_id are
 	 * exhausted and return -EAGAIN to the user indicating that they can try
 	 * again in the future.
 	 *
@@ -2180,7 +2180,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * decremented on each retire. When it is zero, a lock around the
 	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
 	 */
-	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
+	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
 		goto out;
 
 	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
@@ -2193,7 +2193,7 @@ static int guc_request_alloc(struct i915_request *rq)
 				disable_submission(guc);
 				goto out;	/* GPU will be reset */
 			}
-			atomic_dec(&ce->guc_id_ref);
+			atomic_dec(&ce->guc_id.ref);
 			unpin_guc_id(guc, ce);
 			return ret;
 		}
@@ -3028,7 +3028,7 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 		priolist_for_each_request(rq, pl)
 			drm_printf(p, "guc_id=%u, seqno=%llu\n",
-				   rq->context->guc_id,
+				   rq->context->guc_id.id,
 				   rq->fence.seqno);
 	}
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
@@ -3059,7 +3059,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
+		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
 		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
 			   ce->ring->head,
@@ -3070,7 +3070,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 		drm_printf(p, "\t\tContext Pin Count: %u\n",
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id_ref));
+			   atomic_read(&ce->guc_id.ref));
 		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
 			   ce->guc_state.sched_state);
 
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 6f882e72ed11..0574f5c7a985 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -805,7 +805,7 @@ DECLARE_EVENT_CLASS(i915_request,
 			   __entry->dev = rq->engine->i915->drm.primary->index;
 			   __entry->class = rq->engine->uabi_class;
 			   __entry->instance = rq->engine->uabi_instance;
-			   __entry->guc_id = rq->context->guc_id;
+			   __entry->guc_id = rq->context->guc_id.id;
 			   __entry->ctx = rq->fence.context;
 			   __entry->seqno = rq->fence.seqno;
 			   __entry->tail = rq->tail;
@@ -907,7 +907,7 @@ DECLARE_EVENT_CLASS(intel_context,
 			     ),
 
 		    TP_fast_assign(
-			   __entry->guc_id = ce->guc_id;
+			   __entry->guc_id = ce->guc_id.id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
 			   __entry->guc_prio = ce->guc_active.prio;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (23 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-26  0:54   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc Matthew Brost
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Now that we have locking hierarchy of sched_engine->lock ->
ce->guc_state everything from guc_active can be moved into guc_state and
protected the guc_state.lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 10 +--
 drivers/gpu/drm/i915/gt/intel_context_types.h |  7 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 88 +++++++++----------
 drivers/gpu/drm/i915/i915_trace.h             |  2 +-
 4 files changed, 49 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 87b84c1d5393..adfe49b53b1b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -394,9 +394,7 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	spin_lock_init(&ce->guc_state.lock);
 	INIT_LIST_HEAD(&ce->guc_state.fences);
-
-	spin_lock_init(&ce->guc_active.lock);
-	INIT_LIST_HEAD(&ce->guc_active.requests);
+	INIT_LIST_HEAD(&ce->guc_state.requests);
 
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
@@ -521,15 +519,15 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_active.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_active.requests,
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
 				    sched.link) {
 		if (i915_request_completed(rq))
 			break;
 
 		active = rq;
 	}
-	spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 7a1d1537cf67..66286ce36c84 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -172,11 +172,6 @@ struct intel_context {
 		struct i915_sw_fence blocked_fence;
 		/* GuC committed requests */
 		int number_committed_requests;
-	} guc_state;
-
-	struct {
-		/** lock: protects everything in guc_active */
-		spinlock_t lock;
 		/** requests: active requests on this context */
 		struct list_head requests;
 		/*
@@ -184,7 +179,7 @@ struct intel_context {
 		 */
 		u8 prio;
 		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
-	} guc_active;
+	} guc_state;
 
 	struct {
 		/* GuC LRC descriptor ID */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c4c018348ac0..4b9a2f3774d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -827,9 +827,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	list_for_each_entry_safe_reverse(rq, rn,
-					 &ce->guc_active.requests,
+					 &ce->guc_state.requests,
 					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
@@ -848,7 +848,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
@@ -945,10 +945,10 @@ static void guc_cancel_context_requests(struct intel_context *ce)
 
 	/* Mark all executing requests as skipped. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
+	spin_lock(&ce->guc_state.lock);
+	list_for_each_entry(rq, &ce->guc_state.requests, sched.link)
 		i915_request_put(i915_request_mark_eio(rq));
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
@@ -1400,7 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
 	desc->hw_context_desc = ce->lrc.lrca;
-	desc->priority = ce->guc_active.prio;
+	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
@@ -1802,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 
 static void __guc_context_destroy(struct intel_context *ce)
 {
-	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
-		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
-		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
-		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
 	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
@@ -1915,17 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
 
 	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
 		   prio > GUC_CLIENT_PRIORITY_NORMAL);
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
+	if (ce->guc_state.prio == prio || submission_disabled(guc) ||
 	    !context_registered(ce)) {
-		ce->guc_active.prio = prio;
+		ce->guc_state.prio = prio;
 		return;
 	}
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
 
-	ce->guc_active.prio = prio;
+	ce->guc_state.prio = prio;
 	trace_intel_context_set_prio(ce);
 }
 
@@ -1944,25 +1944,25 @@ static inline u8 map_i915_prio_to_guc_prio(int prio)
 static inline void add_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
-	++ce->guc_active.prio_count[guc_prio];
+	++ce->guc_state.prio_count[guc_prio];
 
 	/* Overflow protection */
-	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 }
 
 static inline void sub_context_inflight_prio(struct intel_context *ce,
 					     u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
 	/* Underflow protection */
-	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 
-	--ce->guc_active.prio_count[guc_prio];
+	--ce->guc_state.prio_count[guc_prio];
 }
 
 static inline void update_context_prio(struct intel_context *ce)
@@ -1973,10 +1973,10 @@ static inline void update_context_prio(struct intel_context *ce)
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH > GUC_CLIENT_PRIORITY_NORMAL);
 
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
-		if (ce->guc_active.prio_count[i]) {
+	for (i = 0; i < ARRAY_SIZE(ce->guc_state.prio_count); ++i) {
+		if (ce->guc_state.prio_count[i]) {
 			guc_context_set_prio(guc, ce, i);
 			break;
 		}
@@ -1996,8 +1996,8 @@ static void add_to_context(struct i915_request *rq)
 
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
-	spin_lock(&ce->guc_active.lock);
-	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
+	spin_lock(&ce->guc_state.lock);
+	list_move_tail(&rq->sched.link, &ce->guc_state.requests);
 
 	if (rq->guc_prio == GUC_PRIO_INIT) {
 		rq->guc_prio = new_guc_prio;
@@ -2009,12 +2009,12 @@ static void add_to_context(struct i915_request *rq)
 	}
 	update_context_prio(ce);
 
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
 	if (rq->guc_prio != GUC_PRIO_INIT &&
 	    rq->guc_prio != GUC_PRIO_FINI) {
@@ -2028,7 +2028,7 @@ static void remove_from_context(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock_irq(&ce->guc_active.lock);
+	spin_lock_irq(&ce->guc_state.lock);
 
 	list_del_init(&rq->sched.link);
 	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
@@ -2038,10 +2038,8 @@ static void remove_from_context(struct i915_request *rq)
 
 	guc_prio_fini(rq, ce);
 
-	spin_unlock_irq(&ce->guc_active.lock);
-
-	spin_lock_irq(&ce->guc_state.lock);
 	decr_context_committed_requests(ce);
+
 	spin_unlock_irq(&ce->guc_state.lock);
 
 	atomic_dec(&ce->guc_id.ref);
@@ -2126,7 +2124,7 @@ static void guc_context_init(struct intel_context *ce)
 		prio = ctx->sched.priority;
 	rcu_read_unlock();
 
-	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
+	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
 }
 
 static int guc_request_alloc(struct i915_request *rq)
@@ -2359,7 +2357,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 	     !new_guc_prio_higher(rq->guc_prio, new_guc_prio)))
 		return;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	if (rq->guc_prio != GUC_PRIO_FINI) {
 		if (rq->guc_prio != GUC_PRIO_INIT)
 			sub_context_inflight_prio(ce, rq->guc_prio);
@@ -2367,16 +2365,16 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 		add_context_inflight_prio(ce, rq->guc_prio);
 		update_context_prio(ce);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void sanitize_hwsp(struct intel_engine_cs *engine)
@@ -2942,7 +2940,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
-		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
+		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
@@ -2993,10 +2991,10 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 				goto next;
 		}
 
-		spin_lock(&ce->guc_active.lock);
-		intel_engine_dump_active_requests(&ce->guc_active.requests,
+		spin_lock(&ce->guc_state.lock);
+		intel_engine_dump_active_requests(&ce->guc_state.requests,
 						  hung_rq, m);
-		spin_unlock(&ce->guc_active.lock);
+		spin_unlock(&ce->guc_state.lock);
 
 next:
 		intel_context_put(ce);
@@ -3040,12 +3038,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 {
 	int i;
 
-	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
+	drm_printf(p, "\t\tPriority: %d\n", ce->guc_state.prio);
 	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
 	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
 	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
 		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
-			   i, ce->guc_active.prio_count[i]);
+			   i, ce->guc_state.prio_count[i]);
 	}
 	drm_printf(p, "\n");
 }
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 0574f5c7a985..ec7fe12b94aa 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
 			   __entry->guc_id = ce->guc_id.id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_prio = ce->guc_active.prio;
+			   __entry->guc_prio = ce->guc_state.prio;
 			   ),
 
 		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (24 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-26  1:03   ` Daniele Ceraolo Spurio
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 27/27] drm/i915/guc: Drop static inline functions intel_guc_submission.c Matthew Brost
                   ` (4 subsequent siblings)
  30 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

Add GuC kernel doc for all structures added thus far for GuC submission
and update the main GuC submission section with the new interface
details.

v2:
 - Drop guc_active.lock DOC

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h | 44 ++++++---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 19 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 97 ++++++++++++++-----
 drivers/gpu/drm/i915/i915_request.h           | 18 ++--
 4 files changed, 128 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 66286ce36c84..80bbdc7810f6 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -156,40 +156,52 @@ struct intel_context {
 	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
 
 	struct {
-		/** lock: protects everything in guc_state */
+		/** @lock: protects everything in guc_state */
 		spinlock_t lock;
 		/**
-		 * sched_state: scheduling state of this context using GuC
+		 * @sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
 		u32 sched_state;
 		/*
-		 * fences: maintains of list of requests that have a submit
-		 * fence related to GuC submission
+		 * @fences: maintains a list of requests are currently being
+		 * fenced until a GuC operation completes
 		 */
 		struct list_head fences;
-		/* GuC context blocked fence */
+		/**
+		 * @blocked_fence: fence used to signal when the blocking of a
+		 * contexts submissions is complete.
+		 */
 		struct i915_sw_fence blocked_fence;
-		/* GuC committed requests */
+		/** @number_committed_requests: number of committed requests */
 		int number_committed_requests;
-		/** requests: active requests on this context */
+		/** @requests: list of active requests on this context */
 		struct list_head requests;
-		/*
-		 * GuC priority management
-		 */
+		/** @prio: the contexts current guc priority */
 		u8 prio;
+		/**
+		 * @prio_count: a counter of the number requests inflight in
+		 * each priority bucket
+		 */
 		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_state;
 
 	struct {
-		/* GuC LRC descriptor ID */
+		/**
+		 * @id: unique handle which is used to communicate information
+		 * with the GuC about this context, protected by
+		 * guc->contexts_lock
+		 */
 		u16 id;
-
-		/* GuC LRC descriptor reference count */
+		/**
+		 * @ref: the number of references to the guc_id, when
+		 * transitioning in and out of zero protected by
+		 * guc->contexts_lock
+		 */
 		atomic_t ref;
-
-		/*
-		 * GuC ID link - in list when unpinned but guc_id still valid in GuC
+		/**
+		 * @link: in guc->guc_id_list when the guc_id has no refs but is
+		 * still valid, protected by guc->contexts_lock
 		 */
 		struct list_head link;
 	} guc_id;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 2e27fe59786b..112dd29a63fe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -41,6 +41,10 @@ struct intel_guc {
 	spinlock_t irq_lock;
 	unsigned int msg_enabled_mask;
 
+	/**
+	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
+	 * submission, used to determine if the GT is idle
+	 */
 	atomic_t outstanding_submission_g2h;
 
 	struct {
@@ -49,12 +53,16 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/*
-	 * contexts_lock protects the pool of free guc ids and a linked list of
-	 * guc ids available to be stolen
+	/**
+	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
+	 * ce->guc_id.ref when transitioning in and out of zero
 	 */
 	spinlock_t contexts_lock;
+	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/**
+	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
+	 */
 	struct list_head guc_id_list;
 
 	bool submission_supported;
@@ -70,7 +78,10 @@ struct intel_guc {
 	struct i915_vma *lrc_desc_pool;
 	void *lrc_desc_pool_vaddr;
 
-	/* guc_id to intel_context lookup */
+	/**
+	 * @context_lookup: used to resolve intel_context from guc_id, if a
+	 * context is present in this structure it is registered with the GuC
+	 */
 	struct xarray context_lookup;
 
 	/* Control params for fw initialization */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4b9a2f3774d5..7e0a32e729c2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -28,21 +28,6 @@
 /**
  * DOC: GuC-based command submission
  *
- * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
- * firmware is moving to an updated submission interface and we plan to
- * turn submission back on when that lands. The below documentation (and related
- * code) matches the old submission model and will be updated as part of the
- * upgrade to the new flow.
- *
- * GuC stage descriptor:
- * During initialization, the driver allocates a static pool of 1024 such
- * descriptors, and shares them with the GuC. Currently, we only use one
- * descriptor. This stage descriptor lets the GuC know about the workqueue and
- * process descriptor. Theoretically, it also lets the GuC know about our HW
- * contexts (context ID, etc...), but we actually employ a kind of submission
- * where the GuC uses the LRCA sent via the work item instead. This is called
- * a "proxy" submission.
- *
  * The Scratch registers:
  * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
  * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
@@ -51,14 +36,82 @@
  * processes the request. The kernel driver polls waiting for this update and
  * then proceeds.
  *
- * Work Items:
- * There are several types of work items that the host may place into a
- * workqueue, each with its own requirements and limitations. Currently only
- * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
- * represents in-order queue. The kernel driver packs ring tail pointer and an
- * ELSP context descriptor dword into Work Item.
- * See guc_add_request()
+ * Command Transport buffers (CTBs):
+ * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
+ * G2H) are a message interface between the i915 and GuC used to controls
+ * submissions.
+ *
+ * Context registration:
+ * Before a context can be submitted it must be registered with the GuC via a
+ * H2G. A unique guc_id is associated with each context. The context is either
+ * registered at request creation time (normal operation) or at submission time
+ * (abnormal operation, e.g. after a reset).
+ *
+ * Context submission:
+ * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
+ * or context submit H2G is used to submit a context.
+ *
+ * Context unpin:
+ * To unpin a context a H2G is used to disable scheduling and when the
+ * corresponding G2H returns indicating the scheduling disable operation has
+ * completed it is safe to unpin the context. While a disable is in flight it
+ * isn't safe to resubmit the context so a fence is used to stall all future
+ * requests until the G2H is returned.
+ *
+ * Context deregistration:
+ * Before a context can be destroyed or we steal its guc_id we must deregister
+ * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
+ * submit anything to this guc_id until the deregister completes so a fence is
+ * used to stall all requests associated with this guc_ids until the
+ * corresponding G2H returns indicating the guc_id has been deregistered.
+ *
+ * guc_ids:
+ * Unique number associated with private GuC context data passed in during
+ * context registration / submission / deregistration. 64k available. Simple ida
+ * is used for allocation.
+ *
+ * Stealing guc_ids:
+ * If no guc_ids are available they can be stolen from another context at
+ * request creation time if that context is unpinned. If a guc_id can't be found
+ * we punt this problem to the user as we believe this is near impossible to hit
+ * during normal use cases.
+ *
+ * Locking:
+ * In the GuC submission code we have 3 basic spin locks which protect
+ * everything. Details about each below.
+ *
+ * sched_engine->lock
+ * This is the submission lock for all contexts that share a i915 schedule
+ * engine (sched_engine), thus only 1 context which share a sched_engine can be
+ * submitting at a time. Currently only 1 sched_engine used for all of GuC
+ * submission but that could change in the future.
+ *
+ * guc->contexts_lock
+ * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
+ * submission can hold this at a time.
+ *
+ * ce->guc_state.lock
+ * Protects everything under ce->guc_state. Ensures that a context is in the
+ * correct state before issuing a H2G. e.g. We don't issue a schedule disable
+ * on disabled context (bad idea), we don't issue schedule enable when a
+ * schedule disable is inflight, etc... Also protects list of inflight requests
+ * on the context and the priority management state. Lock individual to each
+ * context.
+ *
+ * Lock ordering rules:
+ * sched_engine->lock -> ce->guc_state.lock
+ * guc->contexts_lock -> ce->guc_state.lock
  *
+ * Reset races:
+ * When a GPU full reset is triggered it is assumed that some G2H responses to
+ * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
+ * fatal as we do certain operations upon receiving a G2H (e.g. destroy
+ * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
+ * context state and cleanup appropriately, however this is quite racey. To
+ * avoid races the rules are check for submission being disabled (i.e. check for
+ * mid reset) with the appropriate lock being held. If submission is disabled
+ * don't send the H2G or update the context state. The reset code must disable
+ * submission and grab all these locks before scrubbing for the missing G2H.
  */
 
 /* GuC Virtual Engine */
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index d818cfbfc41d..177eaf55adff 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -290,18 +290,20 @@ struct i915_request {
 		struct hrtimer timer;
 	} watchdog;
 
-	/*
-	 * Requests may need to be stalled when using GuC submission waiting for
-	 * certain GuC operations to complete. If that is the case, stalled
-	 * requests are added to a per context list of stalled requests. The
-	 * below list_head is the link in that list.
+	/**
+	 * @guc_fence_link: Requests may need to be stalled when using GuC
+	 * submission waiting for certain GuC operations to complete. If that is
+	 * the case, stalled requests are added to a per context list of stalled
+	 * requests. The below list_head is the link in that list. Protected by
+	 * ce->guc_state.lock.
 	 */
 	struct list_head guc_fence_link;
 
 	/**
-	 * Priority level while the request is inflight. Differs from i915
-	 * scheduler priority. See comment above
-	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
+	 * @guc_prio: Priority level while the request is inflight. Differs from
+	 * i915 scheduler priority. See comment above
+	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
+	 * ce->guc_active.lock.
 	 */
 #define	GUC_PRIO_INIT	0xff
 #define	GUC_PRIO_FINI	0xfe
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 27/27] drm/i915/guc: Drop static inline functions intel_guc_submission.c
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (25 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc Matthew Brost
@ 2021-08-19  6:16 ` Matthew Brost
  2021-08-19  7:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3) Patchwork
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19  6:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter

s/static inline/static/g + fix function argument alignment to make
checkpatch happy.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 116 +++++++++---------
 1 file changed, 57 insertions(+), 59 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 7e0a32e729c2..ad4420100908 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -139,7 +139,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 #define SCHED_STATE_BLOCKED_SHIFT			7
 #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
-static inline void init_sched_state(struct intel_context *ce)
+static void init_sched_state(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
@@ -156,14 +156,14 @@ static bool sched_state_is_init(struct intel_context *ce)
 		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
 }
 
-static inline bool
+static bool
 context_wait_for_deregister_to_register(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state &
 		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void
+static void
 set_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
@@ -171,7 +171,7 @@ set_context_wait_for_deregister_to_register(struct intel_context *ce)
 		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void
+static void
 clr_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
@@ -179,111 +179,111 @@ clr_context_wait_for_deregister_to_register(struct intel_context *ce)
 		~SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline bool
+static bool
 context_destroyed(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_DESTROYED;
 }
 
-static inline void
+static void
 set_context_destroyed(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
-static inline bool context_pending_disable(struct intel_context *ce)
+static bool context_pending_disable(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline void set_context_pending_disable(struct intel_context *ce)
+static void set_context_pending_disable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline void clr_context_pending_disable(struct intel_context *ce)
+static void clr_context_pending_disable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline bool context_banned(struct intel_context *ce)
+static bool context_banned(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_BANNED;
 }
 
-static inline void set_context_banned(struct intel_context *ce)
+static void set_context_banned(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_BANNED;
 }
 
-static inline void clr_context_banned(struct intel_context *ce)
+static void clr_context_banned(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
 }
 
-static inline bool context_enabled(struct intel_context *ce)
+static bool context_enabled(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
 }
 
-static inline void set_context_enabled(struct intel_context *ce)
+static void set_context_enabled(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
 }
 
-static inline void clr_context_enabled(struct intel_context *ce)
+static void clr_context_enabled(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
 }
 
-static inline bool context_pending_enable(struct intel_context *ce)
+static bool context_pending_enable(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline void set_context_pending_enable(struct intel_context *ce)
+static void set_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline void clr_context_pending_enable(struct intel_context *ce)
+static void clr_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline bool context_registered(struct intel_context *ce)
+static bool context_registered(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
 }
 
-static inline void set_context_registered(struct intel_context *ce)
+static void set_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
 }
 
-static inline void clr_context_registered(struct intel_context *ce)
+static void clr_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
 }
 
-static inline u32 context_blocked(struct intel_context *ce)
+static u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
 		SCHED_STATE_BLOCKED_SHIFT;
 }
 
-static inline void incr_context_blocked(struct intel_context *ce)
+static void incr_context_blocked(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
@@ -292,7 +292,7 @@ static inline void incr_context_blocked(struct intel_context *ce)
 	GEM_BUG_ON(!context_blocked(ce));	/* Overflow check */
 }
 
-static inline void decr_context_blocked(struct intel_context *ce)
+static void decr_context_blocked(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
@@ -301,41 +301,41 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
-static inline bool context_has_committed_requests(struct intel_context *ce)
+static bool context_has_committed_requests(struct intel_context *ce)
 {
 	return !!ce->guc_state.number_committed_requests;
 }
 
-static inline void incr_context_committed_requests(struct intel_context *ce)
+static void incr_context_committed_requests(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	++ce->guc_state.number_committed_requests;
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
-static inline void decr_context_committed_requests(struct intel_context *ce)
+static void decr_context_committed_requests(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	--ce->guc_state.number_committed_requests;
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
-static inline bool context_guc_id_invalid(struct intel_context *ce)
+static bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id.id == GUC_INVALID_LRC_ID;
 }
 
-static inline void set_context_guc_id_invalid(struct intel_context *ce)
+static void set_context_guc_id_invalid(struct intel_context *ce)
 {
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 }
 
-static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
+static struct intel_guc *ce_to_guc(struct intel_context *ce)
 {
 	return &ce->engine->gt->uc.guc;
 }
 
-static inline struct i915_priolist *to_priolist(struct rb_node *rb)
+static struct i915_priolist *to_priolist(struct rb_node *rb)
 {
 	return rb_entry(rb, struct i915_priolist, node);
 }
@@ -349,7 +349,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 	return &base[index];
 }
 
-static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
+static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
@@ -379,12 +379,12 @@ static void guc_lrc_desc_pool_destroy(struct intel_guc *guc)
 	i915_vma_unpin_and_release(&guc->lrc_desc_pool, I915_VMA_RELEASE_MAP);
 }
 
-static inline bool guc_submission_initialized(struct intel_guc *guc)
+static bool guc_submission_initialized(struct intel_guc *guc)
 {
 	return !!guc->lrc_desc_pool_vaddr;
 }
 
-static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
+static void reset_lrc_desc(struct intel_guc *guc, u32 id)
 {
 	if (likely(guc_submission_initialized(guc))) {
 		struct guc_lrc_desc *desc = __get_lrc_desc(guc, id);
@@ -402,13 +402,13 @@ static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
 	}
 }
 
-static inline bool lrc_desc_registered(struct intel_guc *guc, u32 id)
+static bool lrc_desc_registered(struct intel_guc *guc, u32 id)
 {
 	return __get_context(guc, id);
 }
 
-static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
-					   struct intel_context *ce)
+static void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
+				    struct intel_context *ce)
 {
 	unsigned long flags;
 
@@ -572,13 +572,13 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
-static inline void guc_set_lrc_tail(struct i915_request *rq)
+static void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
 		intel_ring_set_tail(rq->ring, rq->tail);
 }
 
-static inline int rq_prio(const struct i915_request *rq)
+static int rq_prio(const struct i915_request *rq)
 {
 	return rq->sched.attr.priority;
 }
@@ -745,7 +745,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
-static inline bool
+static bool
 submission_disabled(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -826,7 +826,7 @@ guc_virtual_get_sibling(struct intel_engine_cs *ve, unsigned int sibling)
 	return NULL;
 }
 
-static inline struct intel_engine_cs *
+static struct intel_engine_cs *
 __context_to_physical_engine(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -1144,9 +1144,9 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	i915_sched_engine_put(guc->sched_engine);
 }
 
-static inline void queue_request(struct i915_sched_engine *sched_engine,
-				 struct i915_request *rq,
-				 int prio)
+static void queue_request(struct i915_sched_engine *sched_engine,
+			  struct i915_request *rq,
+			  int prio)
 {
 	GEM_BUG_ON(!list_empty(&rq->sched.link));
 	list_add_tail(&rq->sched.link,
@@ -1842,7 +1842,7 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_context_sched_disable_unpin(ce);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
@@ -1982,7 +1982,7 @@ static void guc_context_set_prio(struct intel_guc *guc,
 	trace_intel_context_set_prio(ce);
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio)
+static u8 map_i915_prio_to_guc_prio(int prio)
 {
 	if (prio == I915_PRIORITY_NORMAL)
 		return GUC_CLIENT_PRIORITY_KMD_NORMAL;
@@ -1994,8 +1994,7 @@ static inline u8 map_i915_prio_to_guc_prio(int prio)
 		return GUC_CLIENT_PRIORITY_KMD_HIGH;
 }
 
-static inline void add_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void add_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
@@ -2006,8 +2005,7 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
 	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 }
 
-static inline void sub_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void sub_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
@@ -2018,7 +2016,7 @@ static inline void sub_context_inflight_prio(struct intel_context *ce,
 	--ce->guc_state.prio_count[guc_prio];
 }
 
-static inline void update_context_prio(struct intel_context *ce)
+static void update_context_prio(struct intel_context *ce)
 {
 	struct intel_guc *guc = &ce->engine->gt->uc.guc;
 	int i;
@@ -2036,7 +2034,7 @@ static inline void update_context_prio(struct intel_context *ce)
 	}
 }
 
-static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
+static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 {
 	/* Lower value is higher priority */
 	return new_guc_prio < old_guc_prio;
@@ -2506,15 +2504,15 @@ static void guc_set_default_submission(struct intel_engine_cs *engine)
 	engine->submit_request = guc_submit_request;
 }
 
-static inline void guc_kernel_context_pin(struct intel_guc *guc,
-					  struct intel_context *ce)
+static void guc_kernel_context_pin(struct intel_guc *guc,
+				   struct intel_context *ce)
 {
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 	guc_lrc_desc_pin(ce, true);
 }
 
-static inline void guc_init_lrc_mapping(struct intel_guc *guc)
+static void guc_init_lrc_mapping(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct intel_engine_cs *engine;
@@ -2617,7 +2615,7 @@ static void rcs_submission_override(struct intel_engine_cs *engine)
 	}
 }
 
-static inline void guc_default_irqs(struct intel_engine_cs *engine)
+static void guc_default_irqs(struct intel_engine_cs *engine)
 {
 	engine->irq_keep_mask = GT_RENDER_USER_INTERRUPT;
 	intel_engine_set_irq_handler(engine, cs_irq_handler);
@@ -2713,7 +2711,7 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
-static inline struct intel_context *
+static struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
@@ -3086,8 +3084,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 	drm_printf(p, "\n");
 }
 
-static inline void guc_log_context_priority(struct drm_printer *p,
-					    struct intel_context *ce)
+static void guc_log_context_priority(struct drm_printer *p,
+				     struct intel_context *ce)
 {
 	int i;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (26 preceding siblings ...)
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 27/27] drm/i915/guc: Drop static inline functions intel_guc_submission.c Matthew Brost
@ 2021-08-19  7:18 ` Patchwork
  2021-08-19  7:20 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 76+ messages in thread
From: Patchwork @ 2021-08-19  7:18 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
URL   : https://patchwork.freedesktop.org/series/93704/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
cdd86549dd1d drm/i915/guc: Fix blocked context accounting
de712a42478e drm/i915/guc: Fix outstanding G2H accounting
75fb55198651 drm/i915/guc: Unwind context requests in reverse order
ef46dfb56828 drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
c2e59d2d1528 drm/i915/guc: Process all G2H message at once in work queue
54c8cf84e527 drm/i915/guc: Workaround reset G2H is received after schedule done G2H
-:7: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#7: 
If the context is reset as a result of the request cancelation the
                                                   ^^^^^^^^^^^

-:10: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#10: 
waiting request cancelation code which resubmits the context. This races
                ^^^^^^^^^^^

-:12: WARNING:TYPO_SPELLING: 'cancelation' may be misspelled - perhaps 'cancellation'?
#12: 
in this case it really should be a NOP as request cancelation code owns
                                                  ^^^^^^^^^^^

-:58: WARNING:BRACES: braces {} are not necessary for any arm of this statement
#58: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:856:
+	if (likely(!context_pending_enable(ce))) {
[...]
+	} else {
[...]

total: 0 errors, 4 warnings, 0 checks, 73 lines checked
7cc5849adb41 Revert "drm/i915/gt: Propagate change in error status to children on unhold"
-:17: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#17: 
References: 3761baae908a (Revert "drm/i915: Propagate errors on awaiting already signaled fences")

-:17: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 3761baae908a ("Revert "drm/i915: Propagate errors on awaiting already signaled fences"")'
#17: 
References: 3761baae908a (Revert "drm/i915: Propagate errors on awaiting already signaled fences")

total: 1 errors, 1 warnings, 0 checks, 10 lines checked
b13cbbf2cc00 drm/i915/selftests: Add a cancel request selftest that triggers a reset
72e6f5ea745f drm/i915/guc: Kick tasklet after queuing a request
-:8: WARNING:TYPO_SPELLING: 'inteface' may be misspelled - perhaps 'interface'?
#8: 
Fixes: 3a4cdf1982f0 ("drm/i915/guc: Implement GuC context operations for new inteface")
                                                                             ^^^^^^^^

total: 0 errors, 1 warnings, 0 checks, 7 lines checked
be92b59039f4 drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
c328800aa348 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
24f6154536c7 drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
-:104: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#104: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 232 lines checked
95c83c9a3e05 drm/i915/guc: Take context ref when cancelling request
fe52b71dec8a drm/i915/guc: Don't touch guc_state.sched_state without a lock
23d73efe3398 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
f739dd54a7ee drm/i915: Allocate error capture in nowait context
8081613c600f drm/i915/guc: Flush G2H work queue during reset
f7a536406581 drm/i915/guc: Release submit fence from an irq_work
1debca6b1bd8 drm/i915/guc: Move guc_blocked fence to struct guc_state
c8b2f80e4529 drm/i915/guc: Rework and simplify locking
3bf483be619f drm/i915/guc: Proper xarray usage for contexts_lookup
2fd6d9d96159 drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
9b10873534b2 drm/i915/guc: Move GuC priority fields in context under guc_active
0e84142a2b56 drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure
a0ff703dedb1 drm/i915/guc: Drop guc_active move everything into guc_state
16fdda30af6b drm/i915/guc: Add GuC kernel doc
516f3fcdd0dd drm/i915/guc: Drop static inline functions intel_guc_submission.c



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (27 preceding siblings ...)
  2021-08-19  7:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3) Patchwork
@ 2021-08-19  7:20 ` Patchwork
  2021-08-19  7:51 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
  2021-08-19  9:08 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  30 siblings, 0 replies; 76+ messages in thread
From: Patchwork @ 2021-08-19  7:20 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
URL   : https://patchwork.freedesktop.org/series/93704/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/display/intel_display.c:1901:21:    expected struct i915_vma *[assigned] vma
+drivers/gpu/drm/i915/display/intel_display.c:1901:21:    got void [noderef] __iomem *[assigned] iomem
+drivers/gpu/drm/i915/display/intel_display.c:1901:21: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1374:34:    expected struct i915_address_space *vm
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1374:34:    got struct i915_address_space [noderef] __rcu *vm
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1374:34: warning: incorrect type in argument 1 (different address spaces)
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:43:25:    expected struct i915_address_space [noderef] __rcu *vm
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:43:25:    got struct i915_address_space *
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:43:25: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:60:34:    expected struct i915_address_space *vm
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:60:34:    got struct i915_address_space [noderef] __rcu *vm
+drivers/gpu/drm/i915/gem/selftests/mock_context.c:60:34: warning: incorrect type in argument 1 (different address spaces)
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/gt/intel_ring_submission.c:1268:24: warning: Using plain integer as NULL pointer
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/selftests/i915_syncmap.c:80:54: warning: dubious: x | !y
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write8' - different lock contexts for basic block
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (28 preceding siblings ...)
  2021-08-19  7:20 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2021-08-19  7:51 ` Patchwork
  2021-08-19  9:08 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  30 siblings, 0 replies; 76+ messages in thread
From: Patchwork @ 2021-08-19  7:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 5436 bytes --]

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
URL   : https://patchwork.freedesktop.org/series/93704/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10498 -> Patchwork_20851
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/index.html

New tests
---------

  New tests have been introduced between CI_DRM_10498 and Patchwork_20851:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 29 pass(s)
    - Exec time: [0.41, 5.16] s

  

Known issues
------------

  Here are the changes found in Patchwork_20851 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size:
    - fi-kbl-soraka:      [PASS][1] -> [FAIL][2] ([i915#2346])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/fi-kbl-soraka/igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/fi-kbl-soraka/igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size.html

  * igt@runner@aborted:
    - fi-bdw-5557u:       NOTRUN -> [FAIL][3] ([i915#1602] / [i915#2029])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/fi-bdw-5557u/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@i915_module_load@reload:
    - fi-tgl-1115g4:      [DMESG-WARN][4] -> [PASS][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/fi-tgl-1115g4/igt@i915_module_load@reload.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/fi-tgl-1115g4/igt@i915_module_load@reload.html

  * igt@i915_selftest@live@gtt:
    - {fi-tgl-dsi}:       [DMESG-WARN][6] -> [PASS][7]
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/fi-tgl-dsi/igt@i915_selftest@live@gtt.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/fi-tgl-dsi/igt@i915_selftest@live@gtt.html

  * igt@i915_selftest@live@perf:
    - {fi-tgl-dsi}:       [DMESG-WARN][8] ([i915#2867]) -> [PASS][9] +6 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/fi-tgl-dsi/igt@i915_selftest@live@perf.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/fi-tgl-dsi/igt@i915_selftest@live@perf.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#1602]: https://gitlab.freedesktop.org/drm/intel/issues/1602
  [i915#2029]: https://gitlab.freedesktop.org/drm/intel/issues/2029
  [i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867


Participating hosts (35 -> 34)
------------------------------

  Missing    (1): fi-bsw-cyan 


Build changes
-------------

  * Linux: CI_DRM_10498 -> Patchwork_20851

  CI-20190529: 20190529
  CI_DRM_10498: b66f2ed13db3f8f7bcf616cea0e59ebe8728b131 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6178: 146260200f9a6d4536e48a195e2ab49a07d4f0c1 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20851: 516f3fcdd0dda3d21d60a2644a099d50d7933835 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

516f3fcdd0dd drm/i915/guc: Drop static inline functions intel_guc_submission.c
16fdda30af6b drm/i915/guc: Add GuC kernel doc
a0ff703dedb1 drm/i915/guc: Drop guc_active move everything into guc_state
0e84142a2b56 drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure
9b10873534b2 drm/i915/guc: Move GuC priority fields in context under guc_active
2fd6d9d96159 drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
3bf483be619f drm/i915/guc: Proper xarray usage for contexts_lookup
c8b2f80e4529 drm/i915/guc: Rework and simplify locking
1debca6b1bd8 drm/i915/guc: Move guc_blocked fence to struct guc_state
f7a536406581 drm/i915/guc: Release submit fence from an irq_work
8081613c600f drm/i915/guc: Flush G2H work queue during reset
f739dd54a7ee drm/i915: Allocate error capture in nowait context
23d73efe3398 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
fe52b71dec8a drm/i915/guc: Don't touch guc_state.sched_state without a lock
95c83c9a3e05 drm/i915/guc: Take context ref when cancelling request
24f6154536c7 drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
c328800aa348 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
be92b59039f4 drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
72e6f5ea745f drm/i915/guc: Kick tasklet after queuing a request
b13cbbf2cc00 drm/i915/selftests: Add a cancel request selftest that triggers a reset
7cc5849adb41 Revert "drm/i915/gt: Propagate change in error status to children on unhold"
54c8cf84e527 drm/i915/guc: Workaround reset G2H is received after schedule done G2H
c2e59d2d1528 drm/i915/guc: Process all G2H message at once in work queue
ef46dfb56828 drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
75fb55198651 drm/i915/guc: Unwind context requests in reverse order
de712a42478e drm/i915/guc: Fix outstanding G2H accounting
cdd86549dd1d drm/i915/guc: Fix blocked context accounting

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/index.html

[-- Attachment #2: Type: text/html, Size: 6363 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
  2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
                   ` (29 preceding siblings ...)
  2021-08-19  7:51 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2021-08-19  9:08 ` Patchwork
  30 siblings, 0 replies; 76+ messages in thread
From: Patchwork @ 2021-08-19  9:08 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30288 bytes --]

== Series Details ==

Series: Clean up GuC CI failures, simplify locking, and kernel DOC (rev3)
URL   : https://patchwork.freedesktop.org/series/93704/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10498_full -> Patchwork_20851_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_20851_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_20851_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_20851_full:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_eio@unwedge-stress:
    - shard-skl:          [PASS][1] -> [FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl3/igt@gem_eio@unwedge-stress.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl9/igt@gem_eio@unwedge-stress.html

  
Known issues
------------

  Here are the changes found in Patchwork_20851_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@feature_discovery@display-4x:
    - shard-iclb:         NOTRUN -> [SKIP][3] ([i915#1839])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@feature_discovery@display-4x.html

  * igt@gem_create@create-massive:
    - shard-snb:          NOTRUN -> [DMESG-WARN][4] ([i915#3002])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb7/igt@gem_create@create-massive.html

  * igt@gem_ctx_persistence@legacy-engines-queued:
    - shard-snb:          NOTRUN -> [SKIP][5] ([fdo#109271] / [i915#1099]) +1 similar issue
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb7/igt@gem_ctx_persistence@legacy-engines-queued.html

  * igt@gem_eio@unwedge-stress:
    - shard-snb:          NOTRUN -> [FAIL][6] ([i915#3354])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb7/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-pace@bcs0:
    - shard-iclb:         [PASS][7] -> [FAIL][8] ([i915#2842])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb8/igt@gem_exec_fair@basic-pace@bcs0.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb6/igt@gem_exec_fair@basic-pace@bcs0.html

  * igt@gem_exec_fair@basic-pace@vcs1:
    - shard-kbl:          [PASS][9] -> [FAIL][10] ([i915#2842])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl4/igt@gem_exec_fair@basic-pace@vcs1.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@gem_exec_fair@basic-pace@vcs1.html
    - shard-tglb:         [PASS][11] -> [FAIL][12] ([i915#2842]) +2 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-tglb3/igt@gem_exec_fair@basic-pace@vcs1.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-tglb7/igt@gem_exec_fair@basic-pace@vcs1.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [PASS][13] -> [SKIP][14] ([fdo#109271]) +1 similar issue
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl4/igt@gem_exec_fair@basic-pace@vecs0.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-glk:          [PASS][15] -> [FAIL][16] ([i915#2842])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-glk5/igt@gem_exec_fair@basic-throttle@rcs0.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-glk4/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_flush@basic-batch-kernel-default-cmd:
    - shard-iclb:         NOTRUN -> [SKIP][17] ([fdo#109313])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html

  * igt@gem_exec_whisper@basic-queues-priority:
    - shard-glk:          [PASS][18] -> [DMESG-WARN][19] ([i915#118] / [i915#95])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-glk9/igt@gem_exec_whisper@basic-queues-priority.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-glk3/igt@gem_exec_whisper@basic-queues-priority.html

  * igt@gem_mmap_gtt@cpuset-big-copy-xy:
    - shard-iclb:         [PASS][20] -> [FAIL][21] ([i915#307])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb6/igt@gem_mmap_gtt@cpuset-big-copy-xy.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb1/igt@gem_mmap_gtt@cpuset-big-copy-xy.html

  * igt@gem_pread@exhaustion:
    - shard-apl:          NOTRUN -> [WARN][22] ([i915#2658])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@gem_pread@exhaustion.html

  * igt@gem_pwrite@basic-exhaustion:
    - shard-skl:          NOTRUN -> [WARN][23] ([i915#2658])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl6/igt@gem_pwrite@basic-exhaustion.html

  * igt@gen3_render_tiledy_blits:
    - shard-iclb:         NOTRUN -> [SKIP][24] ([fdo#109289])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@gen3_render_tiledy_blits.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-180-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][25] ([fdo#109271] / [i915#3777])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-180-hflip.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-apl:          NOTRUN -> [SKIP][26] ([fdo#109271] / [i915#3777])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-async-flip:
    - shard-skl:          NOTRUN -> [FAIL][27] ([i915#3722]) +1 similar issue
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-async-flip.html

  * igt@kms_big_fb@yf-tiled-8bpp-rotate-270:
    - shard-iclb:         NOTRUN -> [SKIP][28] ([fdo#110723])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_big_fb@yf-tiled-8bpp-rotate-270.html

  * igt@kms_ccs@pipe-a-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          NOTRUN -> [SKIP][29] ([fdo#109271] / [i915#3886]) +6 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl1/igt@kms_ccs@pipe-a-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-bad-rotation-90-y_tiled_gen12_rc_ccs_cc:
    - shard-apl:          NOTRUN -> [SKIP][30] ([fdo#109271] / [i915#3886]) +3 similar issues
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@kms_ccs@pipe-b-bad-rotation-90-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-c-bad-pixel-format-y_tiled_gen12_mc_ccs:
    - shard-kbl:          NOTRUN -> [SKIP][31] ([fdo#109271] / [i915#3886])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@kms_ccs@pipe-c-bad-pixel-format-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-c-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc:
    - shard-iclb:         NOTRUN -> [SKIP][32] ([fdo#109278] / [i915#3886]) +1 similar issue
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_ccs@pipe-c-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-c-random-ccs-data-y_tiled_gen12_rc_ccs:
    - shard-apl:          NOTRUN -> [SKIP][33] ([fdo#109271]) +81 similar issues
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_ccs@pipe-c-random-ccs-data-y_tiled_gen12_rc_ccs.html

  * igt@kms_chamelium@dp-mode-timings:
    - shard-apl:          NOTRUN -> [SKIP][34] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@kms_chamelium@dp-mode-timings.html

  * igt@kms_chamelium@hdmi-hpd-enable-disable-mode:
    - shard-kbl:          NOTRUN -> [SKIP][35] ([fdo#109271] / [fdo#111827])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@kms_chamelium@hdmi-hpd-enable-disable-mode.html

  * igt@kms_chamelium@hdmi-hpd-fast:
    - shard-snb:          NOTRUN -> [SKIP][36] ([fdo#109271] / [fdo#111827]) +17 similar issues
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb5/igt@kms_chamelium@hdmi-hpd-fast.html

  * igt@kms_chamelium@hdmi-hpd-storm:
    - shard-iclb:         NOTRUN -> [SKIP][37] ([fdo#109284] / [fdo#111827])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_chamelium@hdmi-hpd-storm.html

  * igt@kms_chamelium@vga-hpd-after-suspend:
    - shard-skl:          NOTRUN -> [SKIP][38] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@kms_chamelium@vga-hpd-after-suspend.html

  * igt@kms_color@pipe-a-ctm-0-5:
    - shard-skl:          [PASS][39] -> [DMESG-WARN][40] ([i915#1982])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl8/igt@kms_color@pipe-a-ctm-0-5.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl7/igt@kms_color@pipe-a-ctm-0-5.html

  * igt@kms_color@pipe-d-ctm-max:
    - shard-iclb:         NOTRUN -> [SKIP][41] ([fdo#109278] / [i915#1149])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_color@pipe-d-ctm-max.html

  * igt@kms_cursor_crc@pipe-a-cursor-512x512-offscreen:
    - shard-iclb:         NOTRUN -> [SKIP][42] ([fdo#109278] / [fdo#109279]) +2 similar issues
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_cursor_crc@pipe-a-cursor-512x512-offscreen.html

  * igt@kms_cursor_crc@pipe-a-cursor-max-size-offscreen:
    - shard-iclb:         NOTRUN -> [SKIP][43] ([fdo#109278]) +3 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_cursor_crc@pipe-a-cursor-max-size-offscreen.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-kbl:          [PASS][44] -> [DMESG-WARN][45] ([i915#180]) +5 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl7/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl4/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_cursor_edge_walk@pipe-d-128x128-right-edge:
    - shard-snb:          NOTRUN -> [SKIP][46] ([fdo#109271]) +304 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb7/igt@kms_cursor_edge_walk@pipe-d-128x128-right-edge.html

  * igt@kms_cursor_legacy@pipe-d-torture-move:
    - shard-skl:          NOTRUN -> [SKIP][47] ([fdo#109271]) +89 similar issues
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@kms_cursor_legacy@pipe-d-torture-move.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-apl:          [PASS][48] -> [INCOMPLETE][49] ([i915#180])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl6/igt@kms_fbcon_fbt@fbc-suspend.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl3/igt@kms_fbcon_fbt@fbc-suspend.html
    - shard-kbl:          [PASS][50] -> [INCOMPLETE][51] ([i915#155] / [i915#180] / [i915#636])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl6/igt@kms_fbcon_fbt@fbc-suspend.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@2x-flip-vs-fences-interruptible:
    - shard-iclb:         NOTRUN -> [SKIP][52] ([fdo#109274])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_flip@2x-flip-vs-fences-interruptible.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-apl:          [PASS][53] -> [DMESG-WARN][54] ([i915#180]) +1 similar issue
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl1/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl1/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs:
    - shard-skl:          NOTRUN -> [SKIP][55] ([fdo#109271] / [i915#2672])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl6/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilercccs:
    - shard-apl:          NOTRUN -> [SKIP][56] ([fdo#109271] / [i915#2672])
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilercccs.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-cpu:
    - shard-iclb:         NOTRUN -> [SKIP][57] ([fdo#109280]) +2 similar issues
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-cpu.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-blt:
    - shard-kbl:          NOTRUN -> [SKIP][58] ([fdo#109271]) +17 similar issues
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-blt.html

  * igt@kms_hdr@bpc-switch:
    - shard-apl:          [PASS][59] -> [FAIL][60] ([i915#1188])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl7/igt@kms_hdr@bpc-switch.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl1/igt@kms_hdr@bpc-switch.html
    - shard-skl:          [PASS][61] -> [FAIL][62] ([i915#1188])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl1/igt@kms_hdr@bpc-switch.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl3/igt@kms_hdr@bpc-switch.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d:
    - shard-apl:          NOTRUN -> [SKIP][63] ([fdo#109271] / [i915#533])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes:
    - shard-apl:          NOTRUN -> [DMESG-WARN][64] ([i915#180]) +1 similar issue
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
    - shard-apl:          NOTRUN -> [FAIL][65] ([fdo#108145] / [i915#265])
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-constant-alpha-max:
    - shard-skl:          NOTRUN -> [FAIL][66] ([fdo#108145] / [i915#265])
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@kms_plane_alpha_blend@pipe-c-constant-alpha-max.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-5:
    - shard-skl:          NOTRUN -> [SKIP][67] ([fdo#109271] / [i915#658])
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-5.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-2:
    - shard-apl:          NOTRUN -> [SKIP][68] ([fdo#109271] / [i915#658])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-5:
    - shard-iclb:         NOTRUN -> [SKIP][69] ([i915#658])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-5.html

  * igt@kms_psr@psr2_primary_page_flip:
    - shard-iclb:         NOTRUN -> [SKIP][70] ([fdo#109441])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@kms_psr@psr2_primary_page_flip.html

  * igt@kms_psr@psr2_sprite_render:
    - shard-iclb:         [PASS][71] -> [SKIP][72] ([fdo#109441]) +1 similar issue
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb2/igt@kms_psr@psr2_sprite_render.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb8/igt@kms_psr@psr2_sprite_render.html

  * igt@kms_setmode@basic:
    - shard-snb:          NOTRUN -> [FAIL][73] ([i915#31])
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-snb7/igt@kms_setmode@basic.html

  * igt@kms_vblank@pipe-b-ts-continuation-suspend:
    - shard-skl:          [PASS][74] -> [INCOMPLETE][75] ([i915#198] / [i915#2828])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl6/igt@kms_vblank@pipe-b-ts-continuation-suspend.html
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl6/igt@kms_vblank@pipe-b-ts-continuation-suspend.html

  * igt@prime_nv_pcopy@test_semaphore:
    - shard-iclb:         NOTRUN -> [SKIP][76] ([fdo#109291])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@prime_nv_pcopy@test_semaphore.html

  * igt@sysfs_clients@create:
    - shard-skl:          NOTRUN -> [SKIP][77] ([fdo#109271] / [i915#2994])
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl8/igt@sysfs_clients@create.html

  * igt@sysfs_clients@pidname:
    - shard-apl:          NOTRUN -> [SKIP][78] ([fdo#109271] / [i915#2994])
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@sysfs_clients@pidname.html

  * igt@sysfs_clients@sema-25:
    - shard-iclb:         NOTRUN -> [SKIP][79] ([i915#2994])
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb5/igt@sysfs_clients@sema-25.html

  
#### Possible fixes ####

  * igt@gem_exec_fair@basic-deadline:
    - shard-glk:          [FAIL][80] ([i915#2846]) -> [PASS][81]
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-glk6/igt@gem_exec_fair@basic-deadline.html
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-glk9/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-glk:          [FAIL][82] ([i915#2842]) -> [PASS][83] +1 similar issue
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-glk9/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-glk3/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace@vcs0:
    - shard-kbl:          [SKIP][84] ([fdo#109271]) -> [PASS][85]
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl4/igt@gem_exec_fair@basic-pace@vcs0.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@gem_exec_fair@basic-pace@vcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-tglb:         [FAIL][86] ([i915#2842]) -> [PASS][87]
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-tglb3/igt@gem_exec_fair@basic-pace@vecs0.html
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-tglb7/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@kms_dither@fb-8bpc-vs-panel-8bpc@edp-1-pipe-a:
    - shard-iclb:         [SKIP][88] ([i915#3788]) -> [PASS][89]
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb2/igt@kms_dither@fb-8bpc-vs-panel-8bpc@edp-1-pipe-a.html
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb8/igt@kms_dither@fb-8bpc-vs-panel-8bpc@edp-1-pipe-a.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-skl:          [FAIL][90] ([i915#2122]) -> [PASS][91] +2 similar issues
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl1/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl3/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-kbl:          [DMESG-WARN][92] ([i915#180]) -> [PASS][93] +2 similar issues
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl6/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_hdr@bpc-switch-suspend:
    - shard-skl:          [FAIL][94] ([i915#1188]) -> [PASS][95]
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl7/igt@kms_hdr@bpc-switch-suspend.html
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl5/igt@kms_hdr@bpc-switch-suspend.html

  * igt@kms_plane_alpha_blend@pipe-a-coverage-7efc:
    - shard-skl:          [FAIL][96] ([fdo#108145] / [i915#265]) -> [PASS][97]
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl8/igt@kms_plane_alpha_blend@pipe-a-coverage-7efc.html
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl7/igt@kms_plane_alpha_blend@pipe-a-coverage-7efc.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [SKIP][98] ([fdo#109441]) -> [PASS][99] +1 similar issue
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb6/igt@kms_psr@psr2_cursor_render.html
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb2/igt@kms_psr@psr2_cursor_render.html

  * igt@kms_vblank@pipe-b-ts-continuation-suspend:
    - shard-apl:          [DMESG-WARN][100] ([i915#180]) -> [PASS][101] +1 similar issue
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl7/igt@kms_vblank@pipe-b-ts-continuation-suspend.html
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl7/igt@kms_vblank@pipe-b-ts-continuation-suspend.html

  
#### Warnings ####

  * igt@i915_pm_rc6_residency@rc6-idle:
    - shard-iclb:         [WARN][102] ([i915#1804] / [i915#2684]) -> [WARN][103] ([i915#2684])
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb6/igt@i915_pm_rc6_residency@rc6-idle.html
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb1/igt@i915_pm_rc6_residency@rc6-idle.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-2:
    - shard-iclb:         [SKIP][104] ([i915#2920]) -> [SKIP][105] ([i915#658])
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb2/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-2.html
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb8/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4:
    - shard-iclb:         [SKIP][106] ([i915#658]) -> [SKIP][107] ([i915#2920]) +1 similar issue
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-iclb6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4.html
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-4.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][108], [FAIL][109], [FAIL][110], [FAIL][111]) ([i915#180] / [i915#2505] / [i915#3002] / [i915#3363]) -> ([FAIL][112], [FAIL][113], [FAIL][114], [FAIL][115], [FAIL][116], [FAIL][117], [FAIL][118]) ([i915#180] / [i915#1814] / [i915#2505] / [i915#3002] / [i915#3363] / [i915#92])
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl6/igt@runner@aborted.html
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl6/igt@runner@aborted.html
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl4/igt@runner@aborted.html
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-kbl6/igt@runner@aborted.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl3/igt@runner@aborted.html
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl2/igt@runner@aborted.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl4/igt@runner@aborted.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl4/igt@runner@aborted.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl4/igt@runner@aborted.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@runner@aborted.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-kbl6/igt@runner@aborted.html
    - shard-apl:          ([FAIL][119], [FAIL][120], [FAIL][121], [FAIL][122], [FAIL][123]) ([i915#180] / [i915#1814] / [i915#3002] / [i915#3363]) -> ([FAIL][124], [FAIL][125], [FAIL][126], [FAIL][127], [FAIL][128], [FAIL][129]) ([fdo#109271] / [i915#1610] / [i915#180] / [i915#1814] / [i915#2292] / [i915#3002] / [i915#3363])
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl8/igt@runner@aborted.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl1/igt@runner@aborted.html
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl3/igt@runner@aborted.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl6/igt@runner@aborted.html
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-apl7/igt@runner@aborted.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@runner@aborted.html
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl2/igt@runner@aborted.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl3/igt@runner@aborted.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl6/igt@runner@aborted.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl1/igt@runner@aborted.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-apl3/igt@runner@aborted.html
    - shard-skl:          ([FAIL][130], [FAIL][131]) ([i915#3002] / [i915#3363]) -> ([FAIL][132], [FAIL][133], [FAIL][134]) ([i915#1814] / [i915#2029] / [i915#3002] / [i915#3363])
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl5/igt@runner@aborted.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10498/shard-skl8/igt@runner@aborted.html
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl9/igt@runner@aborted.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl4/igt@runner@aborted.html
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/shard-skl5/igt@runner@aborted.html

  
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109313]: https://bugs.freedesktop.org/show_bug.cgi?id=109313
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1099]: https://gitlab.freedesktop.org/drm/intel/issues/1099
  [i915#1149]: https://gitlab.freedesktop.org/drm/intel/issues/1149
  [i915#118]: https://gitlab.freedesktop.org/drm/intel/issues/118
  [i915#1188]: https://gitlab.freedesktop.org/drm/intel/issues/1188
  [i915#155]: https://gitlab.freedesktop.org/drm/intel/issues/155
  [i915#1610]: https://gitlab.freedesktop.org/drm/intel/issues/1610
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1804]: https://gitlab.freedesktop.org/drm/intel/issues/1804
  [i915#1814]: https://gitlab.freedesktop.org/drm/intel/issues/1814
  [i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
  [i915#198]: https://gitlab.freedesktop.org/drm/intel/issues/198
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2029]: https://gitlab.freedesktop.org/drm/intel/issues/2029
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#2292]: https://gitlab.freedesktop.org/drm/intel/issues/2292
  [i915#2505]: https://gitlab.freedesktop.org/drm/intel/issues/2505
  [i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
  [i915#2658]: https://gitlab.freedesktop.org/drm/intel/issues/2658
  [i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
  [i915#2684]: https://gitlab.freedesktop.org/drm/intel/issues/2684
  [i915#2828]: https://gitlab.freedesktop.org/drm/intel/issues/2828
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2846]: https://gitlab.freedesktop.org/drm/intel/issues/2846
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#307]: https://gitlab.freedesktop.org/drm/intel/issues/307
  [i915#31]: https://gitlab.freedesktop.org/drm/intel/issues/31
  [i915#3354]: https://gitlab.freedesktop.org/drm/intel/issues/3354
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3722]: https://gitlab.freedesktop.org/drm/intel/issues/3722
  [i915#3777]: https://gitlab.freedesktop.org/drm/intel/issues/3777
  [i915#3788]: https://gitlab.freedesktop.org/drm/intel/issues/3788
  [i915#3886]: https://gitl

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20851/index.html

[-- Attachment #2: Type: text/html, Size: 38267 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-19 21:31   ` Daniele Ceraolo Spurio
@ 2021-08-19 21:30     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19 21:30 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Thu, Aug 19, 2021 at 02:31:51PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > A small race that could result in incorrect accounting of the number
> > of outstanding G2H. Basically prior to this patch we did not increment
> > the number of outstanding G2H if we encoutered a GT reset while sending
> > a H2G. This was incorrect as the context state had already been updated
> > to anticipate a G2H response thus the counter should be incremented.
> > 
> > Also always use helper when decrementing this value.
> > 
> > Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++---------
> >   1 file changed, 13 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 69faa39da178..32c414aa9009 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -352,6 +352,12 @@ static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
> >   	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> > +static void decr_outstanding_submission_g2h(struct intel_guc *guc)
> > +{
> > +	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
> > +		wake_up_all(&guc->ct.wq);
> > +}
> > +
> >   static int guc_submission_send_busy_loop(struct intel_guc *guc,
> >   					 const u32 *action,
> >   					 u32 len,
> > @@ -360,11 +366,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
> >   {
> >   	int err;
> > -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > -
> > -	if (!err && g2h_len_dw)
> > +	if (g2h_len_dw)
> >   		atomic_inc(&guc->outstanding_submission_g2h);
> > +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> > +	if (err == -EBUSY && g2h_len_dw)
> > +		decr_outstanding_submission_g2h(guc);
> > +
> 
> here you're special casing  -EBUSY, which kind of implies that the caller
> needs to handle this differently, but most callers seem to ignore the return
> code. Is the counter handled somewhere else in case of EBUSY? if so, please
> add a comment about it.
> 

Good catch, this is a dead code path. Will delete.

Matt

> Daniele
> 
> >   	return err;
> >   }
> > @@ -616,7 +624,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   		init_sched_state(ce);
> >   		if (pending_enable || destroyed || deregister) {
> > -			atomic_dec(&guc->outstanding_submission_g2h);
> > +			decr_outstanding_submission_g2h(guc);
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > @@ -635,7 +643,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   				intel_engine_signal_breadcrumbs(ce->engine);
> >   			}
> >   			intel_context_sched_disable_unpin(ce);
> > -			atomic_dec(&guc->outstanding_submission_g2h);
> > +			decr_outstanding_submission_g2h(guc);
> >   			spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   			guc_blocked_fence_complete(ce);
> >   			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > @@ -2583,12 +2591,6 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   	return ce;
> >   }
> > -static void decr_outstanding_submission_g2h(struct intel_guc *guc)
> > -{
> > -	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
> > -		wake_up_all(&guc->ct.wq);
> > -}
> > -
> >   int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   					  const u32 *msg,
> >   					  u32 len)
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
@ 2021-08-19 21:31   ` Daniele Ceraolo Spurio
  2021-08-19 21:30     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-19 21:31 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> A small race that could result in incorrect accounting of the number
> of outstanding G2H. Basically prior to this patch we did not increment
> the number of outstanding G2H if we encoutered a GT reset while sending
> a H2G. This was incorrect as the context state had already been updated
> to anticipate a G2H response thus the counter should be incremented.
>
> Also always use helper when decrementing this value.
>
> Fixes: f4eb1f3fe946 ("drm/i915/guc: Ensure G2H response has space in buffer")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++---------
>   1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 69faa39da178..32c414aa9009 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -352,6 +352,12 @@ static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
> +static void decr_outstanding_submission_g2h(struct intel_guc *guc)
> +{
> +	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
> +		wake_up_all(&guc->ct.wq);
> +}
> +
>   static int guc_submission_send_busy_loop(struct intel_guc *guc,
>   					 const u32 *action,
>   					 u32 len,
> @@ -360,11 +366,13 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>   {
>   	int err;
>   
> -	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> -
> -	if (!err && g2h_len_dw)
> +	if (g2h_len_dw)
>   		atomic_inc(&guc->outstanding_submission_g2h);
>   
> +	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> +	if (err == -EBUSY && g2h_len_dw)
> +		decr_outstanding_submission_g2h(guc);
> +

here you're special casing  -EBUSY, which kind of implies that the 
caller needs to handle this differently, but most callers seem to ignore 
the return code. Is the counter handled somewhere else in case of EBUSY? 
if so, please add a comment about it.

Daniele

>   	return err;
>   }
>   
> @@ -616,7 +624,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   		init_sched_state(ce);
>   
>   		if (pending_enable || destroyed || deregister) {
> -			atomic_dec(&guc->outstanding_submission_g2h);
> +			decr_outstanding_submission_g2h(guc);
>   			if (deregister)
>   				guc_signal_context_fence(ce);
>   			if (destroyed) {
> @@ -635,7 +643,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   				intel_engine_signal_breadcrumbs(ce->engine);
>   			}
>   			intel_context_sched_disable_unpin(ce);
> -			atomic_dec(&guc->outstanding_submission_g2h);
> +			decr_outstanding_submission_g2h(guc);
>   			spin_lock_irqsave(&ce->guc_state.lock, flags);
>   			guc_blocked_fence_complete(ce);
>   			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> @@ -2583,12 +2591,6 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   	return ce;
>   }
>   
> -static void decr_outstanding_submission_g2h(struct intel_guc *guc)
> -{
> -	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
> -		wake_up_all(&guc->ct.wq);
> -}
> -
>   int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   					  const u32 *msg,
>   					  u32 len)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order
  2021-08-19 23:54   ` Daniele Ceraolo Spurio
@ 2021-08-19 23:53     ` Matthew Brost
  2021-08-20  0:03       ` Daniele Ceraolo Spurio
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-19 23:53 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Thu, Aug 19, 2021 at 04:54:00PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > When unwinding requests on a reset context, if other requests in the
> > context are in the priority list the requests could be resubmitted out
> > of seqno order. Traverse the list of active requests in reverse and
> > append to the head of the priority list to fix this.
> > 
> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
> >   1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 32c414aa9009..9ca0ba4ea85a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -805,9 +805,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   	spin_lock_irqsave(&sched_engine->lock, flags);
> >   	spin_lock(&ce->guc_active.lock);
> > -	list_for_each_entry_safe(rq, rn,
> > -				 &ce->guc_active.requests,
> > -				 sched.link) {
> > +	list_for_each_entry_safe_reverse(rq, rn,
> > +					 &ce->guc_active.requests,
> > +					 sched.link) {
> >   		if (i915_request_completed(rq))
> 
> The execlists unwind function has a list_del if the request is completed.
> Any reason not to do that here?
> 

Def isn't needed here as this is done in remove_from_context(), probably
not needed in execlists mode either.


> >   			continue;
> > @@ -824,7 +824,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   		}
> >   		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
> > -		list_add_tail(&rq->sched.link, pl);
> > +		list_add(&rq->sched.link, pl);
> 
> Since you always do both list_del and list_add and it doesn't look like you
> use the fact that the list is empty between the 2 calls, you can merge them
> in a list_move.
>

Can't use a list move here because we drop
spin_lock(&ce->guc_active.lock), that gets fixed later in the series and
at that point we likely can use a list_move.

Matt

> Apart from these nits, the change to navigate the list in reverse and append
> here at the top LGTM.
> 
> Daniele
> 
> >   		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> >   		spin_lock(&ce->guc_active.lock);
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
@ 2021-08-19 23:54   ` Daniele Ceraolo Spurio
  2021-08-19 23:53     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-19 23:54 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> When unwinding requests on a reset context, if other requests in the
> context are in the priority list the requests could be resubmitted out
> of seqno order. Traverse the list of active requests in reverse and
> append to the head of the priority list to fix this.
>
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 32c414aa9009..9ca0ba4ea85a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -805,9 +805,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   
>   	spin_lock_irqsave(&sched_engine->lock, flags);
>   	spin_lock(&ce->guc_active.lock);
> -	list_for_each_entry_safe(rq, rn,
> -				 &ce->guc_active.requests,
> -				 sched.link) {
> +	list_for_each_entry_safe_reverse(rq, rn,
> +					 &ce->guc_active.requests,
> +					 sched.link) {
>   		if (i915_request_completed(rq))

The execlists unwind function has a list_del if the request is 
completed. Any reason not to do that here?

>   			continue;
>   
> @@ -824,7 +824,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   		}
>   		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
>   
> -		list_add_tail(&rq->sched.link, pl);
> +		list_add(&rq->sched.link, pl);

Since you always do both list_del and list_add and it doesn't look like 
you use the fact that the list is empty between the 2 calls, you can 
merge them in a list_move.

Apart from these nits, the change to navigate the list in reverse and 
append here at the top LGTM.

Daniele

>   		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>   
>   		spin_lock(&ce->guc_active.lock);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  2021-08-20  0:01   ` Daniele Ceraolo Spurio
@ 2021-08-19 23:58     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-19 23:58 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Thu, Aug 19, 2021 at 05:01:03PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Don't drop ce->guc_active.lock when unwinding a context after reset.
> > At one point we had to drop this because of a lock inversion but that is
> > no longer the case. It is much safer to hold the lock so let's do that.
> > 
> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> 
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> Do we have a trybot of this series with GuC enabled? I've checked the
> functions called in the previously unlocked chunk and didn't spot anything
> requiring the lock to be dropped, but I'd feel safer if we had lockdep
> results as well.
> 

RKL uses GuC submission with BAT. This has been thoroughly tested by me
too and no lockdep splats. Can kick off a trybot on the next rev of this
series too.

Matt

> Daniele
> 
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ----
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 9ca0ba4ea85a..e4a099f8f820 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -812,8 +812,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   			continue;
> >   		list_del_init(&rq->sched.link);
> > -		spin_unlock(&ce->guc_active.lock);
> > -
> >   		__i915_request_unsubmit(rq);
> >   		/* Push the request back into the queue for later resubmission. */
> > @@ -826,8 +824,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   		list_add(&rq->sched.link, pl);
> >   		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > -
> > -		spin_lock(&ce->guc_active.lock);
> >   	}
> >   	spin_unlock(&ce->guc_active.lock);
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
@ 2021-08-20  0:01   ` Daniele Ceraolo Spurio
  2021-08-19 23:58     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-20  0:01 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Don't drop ce->guc_active.lock when unwinding a context after reset.
> At one point we had to drop this because of a lock inversion but that is
> no longer the case. It is much safer to hold the lock so let's do that.
>
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Do we have a trybot of this series with GuC enabled? I've checked the 
functions called in the previously unlocked chunk and didn't spot 
anything requiring the lock to be dropped, but I'd feel safer if we had 
lockdep results as well.

Daniele

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9ca0ba4ea85a..e4a099f8f820 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -812,8 +812,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   			continue;
>   
>   		list_del_init(&rq->sched.link);
> -		spin_unlock(&ce->guc_active.lock);
> -
>   		__i915_request_unsubmit(rq);
>   
>   		/* Push the request back into the queue for later resubmission. */
> @@ -826,8 +824,6 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   
>   		list_add(&rq->sched.link, pl);
>   		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> -
> -		spin_lock(&ce->guc_active.lock);
>   	}
>   	spin_unlock(&ce->guc_active.lock);
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order
  2021-08-19 23:53     ` Matthew Brost
@ 2021-08-20  0:03       ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-20  0:03 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 8/19/2021 4:53 PM, Matthew Brost wrote:
> On Thu, Aug 19, 2021 at 04:54:00PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 8/18/2021 11:16 PM, Matthew Brost wrote:
>>> When unwinding requests on a reset context, if other requests in the
>>> context are in the priority list the requests could be resubmitted out
>>> of seqno order. Traverse the list of active requests in reverse and
>>> append to the head of the priority list to fix this.
>>>
>>> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Cc: <stable@vger.kernel.org>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 ++++----
>>>    1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 32c414aa9009..9ca0ba4ea85a 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -805,9 +805,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
>>>    	spin_lock_irqsave(&sched_engine->lock, flags);
>>>    	spin_lock(&ce->guc_active.lock);
>>> -	list_for_each_entry_safe(rq, rn,
>>> -				 &ce->guc_active.requests,
>>> -				 sched.link) {
>>> +	list_for_each_entry_safe_reverse(rq, rn,
>>> +					 &ce->guc_active.requests,
>>> +					 sched.link) {
>>>    		if (i915_request_completed(rq))
>> The execlists unwind function has a list_del if the request is completed.
>> Any reason not to do that here?
>>
> Def isn't needed here as this is done in remove_from_context(), probably
> not needed in execlists mode either.
>
>
>>>    			continue;
>>> @@ -824,7 +824,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
>>>    		}
>>>    		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
>>> -		list_add_tail(&rq->sched.link, pl);
>>> +		list_add(&rq->sched.link, pl);
>> Since you always do both list_del and list_add and it doesn't look like you
>> use the fact that the list is empty between the 2 calls, you can merge them
>> in a list_move.
>>
> Can't use a list move here because we drop
> spin_lock(&ce->guc_active.lock), that gets fixed later in the series and
> at that point we likely can use a list_move.

fair enough. I'll leave it to you to decide if it is worth moving this 
patch after the next one and using a list_move. With or without that, 
this is:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>
> Matt
>
>> Apart from these nits, the change to navigate the list in reverse and append
>> here at the top LGTM.
>>
>> Daniele
>>
>>>    		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>>>    		spin_lock(&ce->guc_active.lock);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue Matthew Brost
@ 2021-08-20  0:06   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-20  0:06 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Rather than processing 1 G2H at a time and re-queuing the work queue if
> more messages exist, process all the G2H in a single pass of the work
> queue.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 22b4733b55e2..20c710a74498 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -1042,9 +1042,9 @@ static void ct_incoming_request_worker_func(struct work_struct *w)
>   		container_of(w, struct intel_guc_ct, requests.worker);
>   	bool done;
>   
> -	done = ct_process_incoming_requests(ct);
> -	if (!done)
> -		queue_work(system_unbound_wq, &ct->requests.worker);
> +	do {
> +		done = ct_process_incoming_requests(ct);
> +	} while (!done);
>   }
>   
>   static int ct_handle_event(struct intel_guc_ct *ct, struct ct_incoming_msg *request)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request Matthew Brost
@ 2021-08-20 18:31   ` Daniele Ceraolo Spurio
  2021-08-20 18:36     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-20 18:31 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Kick tasklet after queuing a request so it submitted in a timely manner.
>
> Fixes: 3a4cdf1982f0 ("drm/i915/guc: Implement GuC context operations for new inteface")

Is this actually a bug or just a performance issue? in the latter case I 
don't think we need a fixes tag.

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 8f7a11e65ef5..d61f906105ef 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1050,6 +1050,7 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
>   	list_add_tail(&rq->sched.link,
>   		      i915_sched_lookup_priolist(sched_engine, prio));
>   	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> +	tasklet_hi_schedule(&sched_engine->tasklet);

the caller of queue_request() already has a tasklet_hi_schedule in 
another branch of the if/else statement. Maybe we can have the caller 
own the kick to keep it in one place? Not a blocker.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   }
>   
>   static int guc_bypass_tasklet_submit(struct intel_guc *guc,


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request
  2021-08-20 18:31   ` Daniele Ceraolo Spurio
@ 2021-08-20 18:36     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-20 18:36 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Fri, Aug 20, 2021 at 11:31:56AM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Kick tasklet after queuing a request so it submitted in a timely manner.
> > 
> > Fixes: 3a4cdf1982f0 ("drm/i915/guc: Implement GuC context operations for new inteface")
> 
> Is this actually a bug or just a performance issue? in the latter case I
> don't think we need a fixes tag.
> 

Basically the tasklet won't get queued in certain ituations until the
heartbeat ping. Didn't notice it as the tasklet is only used during flow
control or after a full GT reset which both are rather rare. We can
probably drop the fixes tag as GuC submission isn't on by default is
still works without this fix.

> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 8f7a11e65ef5..d61f906105ef 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1050,6 +1050,7 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
> >   	list_add_tail(&rq->sched.link,
> >   		      i915_sched_lookup_priolist(sched_engine, prio));
> >   	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > +	tasklet_hi_schedule(&sched_engine->tasklet);
> 
> the caller of queue_request() already has a tasklet_hi_schedule in another
> branch of the if/else statement. Maybe we can have the caller own the kick
> to keep it in one place? Not a blocker.
>

I guess it could be:

bool kick = need_taklet()

if (kick)
	queue_requst()
else
	kick = bypass()
if (kick)
	kick_tasklet()

Idk, it that is better? I'll think on this and decide before the next
post.

Matt

> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> Daniele
> 
> >   }
> >   static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-20 18:42   ` Daniele Ceraolo Spurio
@ 2021-08-20 18:42     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-20 18:42 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Fri, Aug 20, 2021 at 11:42:38AM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index d61f906105ef..e53a4ef7d442 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1586,6 +1586,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >   		     !intel_context_is_pinned(ce) ||
> >   		     context_pending_disable(ce) ||
> >   		     context_blocked(ce) > 1)) {
> 
> This is getting to a lot of conditions. Maybe we can simplify it a bit? E.g

Yea, this some defensive programming to cover all the basis if another
async operation (ban, reset, unpin) happens when this op is in flight.
Probably some of the conditions are not needed but being extra safe
here.

> it should be possible to check context_blocked, context_banned and
> context_pending_disable as a single op:
> 
> #define SCHED_STATE_MULTI_BLOCKED_MASK \  /* 2 or more blocks */
>     (SCHED_STATE_BLOCKED_MASK & ~SCHED_STATE_BLOCKED)
> #define SCHED_STATE_NO_UNBLOCK \
>     SCHED_STATE_MULTI_BLOCKED_MASK | \
>     SCHED_STATE_PENDING_DISABLE | \
>     SCHED_STATE_BANNED

Good idea, let me move this to helper in the next spin.

Matt

> 
> Not a blocker.
> 
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> Daniele
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
@ 2021-08-20 18:42   ` Daniele Ceraolo Spurio
  2021-08-20 18:42     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-20 18:42 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> When unblocking a context, do not enable scheduling if the context is
> banned, guc_id invalid, or not registered.
>
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d61f906105ef..e53a4ef7d442 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1586,6 +1586,9 @@ static void guc_context_unblock(struct intel_context *ce)
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	if (unlikely(submission_disabled(guc) ||
> +		     intel_context_is_banned(ce) ||
> +		     context_guc_id_invalid(ce) ||
> +		     !lrc_desc_registered(guc, ce->guc_id) ||
>   		     !intel_context_is_pinned(ce) ||
>   		     context_pending_disable(ce) ||
>   		     context_blocked(ce) > 1)) {

This is getting to a lot of conditions. Maybe we can simplify it a bit? 
E.g it should be possible to check context_blocked, context_banned and 
context_pending_disable as a single op:

#define SCHED_STATE_MULTI_BLOCKED_MASK \  /* 2 or more blocks */
     (SCHED_STATE_BLOCKED_MASK & ~SCHED_STATE_BLOCKED)
#define SCHED_STATE_NO_UNBLOCK \
     SCHED_STATE_MULTI_BLOCKED_MASK | \
     SCHED_STATE_PENDING_DISABLE | \
     SCHED_STATE_BANNED

Not a blocker.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold"
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold" Matthew Brost
@ 2021-08-20 19:47   ` Jason Ekstrand
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Ekstrand @ 2021-08-20 19:47 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Intel GFX, Maling list - DRI developers, Daniel Vetter

On Thu, Aug 19, 2021 at 1:22 AM Matthew Brost <matthew.brost@intel.com> wrote:
>
> Propagating errors to dependent fences is wrong, don't do it. A selftest
> in the following exposed the propagating of an error to a dependent
> fence after an engine reset.

I feel like we could still have a bit of a better message.  Maybe
something like this:

Propagating errors to dependent fences is broken and can lead to
errors from one client ending up in another.  In 3761baae908a (Revert
"drm/i915: Propagate errors on awaiting already signaled fences"), we
attempted to get rid of fence error propagation but missed the case
added in 8e9f84cf5cac ("drm/i915/gt: Propagate change in error status
to children on unhold").  Revert that one too.  This error was found
by an up-and-coming selftest which <salient information here>.

Otherwise, looks good to me.

--Jason

>
> This reverts commit 8e9f84cf5cac248a1c6a5daa4942879c8b765058.
>
> v2:
>  (Daniel Vetter)
>   - Use revert
>
> References: 3761baae908a (Revert "drm/i915: Propagate errors on awaiting already signaled fences")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index de5f9c86b9a4..cafb0608ffb4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
>                         if (p->flags & I915_DEPENDENCY_WEAK)
>                                 continue;
>
> -                       /* Propagate any change in error status */
> -                       if (rq->fence.error)
> -                               i915_request_set_error_once(w, rq->fence.error);
> -
>                         if (w->engine != rq->engine)
>                                 continue;
>
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request Matthew Brost
@ 2021-08-21  0:07   ` Daniele Ceraolo Spurio
  2021-08-24 15:42     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-21  0:07 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> A context can get destroyed after cancelling a request so take a
> reference to context when cancelling a request.

What's the exact race? AFAICS __i915_request_skip does not have a 
context_put().

Daniele

>
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e0e85e4ad512..85f96d325048 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1620,8 +1620,10 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   				       struct i915_request *rq)
>   {
>   	if (i915_sw_fence_signaled(&rq->submit)) {
> -		struct i915_sw_fence *fence = guc_context_block(ce);
> +		struct i915_sw_fence *fence;
>   
> +		intel_context_get(ce);
> +		fence = guc_context_block(ce);
>   		i915_sw_fence_wait(fence);
>   		if (!i915_request_completed(rq)) {
>   			__i915_request_skip(rq);
> @@ -1636,6 +1638,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
>   
>   		guc_context_unblock(ce);
> +		intel_context_put(ce);
>   	}
>   }
>   


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
@ 2021-08-21  0:14   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-21  0:14 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Reset LRC descriptor if a context register returns -ENODEV as this means
> we are mid-reset.
>
> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index fa87470ea576..4cf5a565f08e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1407,10 +1407,12 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	} else {
>   		with_intel_runtime_pm(runtime_pm, wakeref)
>   			ret = register_context(ce, loop);
> -		if (unlikely(ret == -EBUSY))
> +		if (unlikely(ret == -EBUSY)) {
> +			reset_lrc_desc(guc, desc_idx);
> +		} else if (unlikely(ret == -ENODEV)) {
>   			reset_lrc_desc(guc, desc_idx);
> -		else if (unlikely(ret == -ENODEV))
>   			ret = 0;	/* Will get registered later */
> +		}
>   	}
>   
>   	return ret;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
@ 2021-08-21  0:25   ` Daniele Ceraolo Spurio
  2021-08-24 15:44     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-21  0:25 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> It isn't safe to scrub for missing G2H or continue with the reset until
> all G2H processing is complete. Flush the G2H work queue during reset to
> ensure it is done running.

Might be worth moving this patch closer to "drm/i915/guc: Process all 
G2H message at once in work queue".

> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
>   1 file changed, 2 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 4cf5a565f08e..9a53bae367b1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -714,8 +714,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> -	int i;
> -
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
> @@ -731,20 +729,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   
>   	guc_flush_submissions(guc);
>   
> -	/*
> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> -	 * each pass as interrupt have been disabled. We always scrub for
> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> -	 * be incremented after the context state update.
> -	 */
> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> -		intel_guc_to_host_event_handler(guc);
> -#define wait_for_reset(guc, wait_var) \
> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> -		do {
> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> -		} while (!list_empty(&guc->ct.requests.incoming));
> -	}
> +	flush_work(&guc->ct.requests.worker);
> +

We're now not waiting in the requests anymore, just ensuring that the 
processing of the ones we already received is done. Is this intended? We 
do still handle the remaining oustanding submission in the scrub so it's 
functionally correct, but the commit message doesn't state the change in 
waiting behavior, so wanted to double check it was planned.

Daniele

>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }
>   


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
@ 2021-08-21  0:30   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-21  0:30 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Move guc_blocked fence to struct guc_state as the lock which protects
> the fence lives there.
>
> s/ce->guc_blocked/ce->guc_state.blocked_fence/g

Could also call it just ce->guc_state.blocked, blocked_fence sounds to 
me like the fence itself is blocked.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c        |  5 +++--
>   drivers/gpu/drm/i915/gt/intel_context_types.h  |  5 ++---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +++++++++---------
>   3 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 745e84c72c90..0e48939ec85f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -405,8 +405,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
>   	 */
> -	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
> -	i915_sw_fence_commit(&ce->guc_blocked);
> +	i915_sw_fence_init(&ce->guc_state.blocked_fence,
> +			   sw_fence_dummy_notify);
> +	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
>   
>   	i915_active_init(&ce->active,
>   			 __intel_context_active, __intel_context_retire, 0);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 3a73f3117873..c06171ee8792 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -167,6 +167,8 @@ struct intel_context {
>   		 * fence related to GuC submission
>   		 */
>   		struct list_head fences;
> +		/* GuC context blocked fence */
> +		struct i915_sw_fence blocked_fence;
>   	} guc_state;
>   
>   	struct {
> @@ -190,9 +192,6 @@ struct intel_context {
>   	 */
>   	struct list_head guc_id_link;
>   
> -	/* GuC context blocked fence */
> -	struct i915_sw_fence guc_blocked;
> -
>   	/*
>   	 * GuC priority management
>   	 */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index deb2e821e441..053f4485d6e9 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1490,24 +1490,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
>   {
>   	lockdep_assert_held(&ce->guc_state.lock);
>   
> -	if (!i915_sw_fence_done(&ce->guc_blocked))
> -		i915_sw_fence_complete(&ce->guc_blocked);
> +	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
> +		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
>   }
>   
>   static void guc_blocked_fence_reinit(struct intel_context *ce)
>   {
>   	lockdep_assert_held(&ce->guc_state.lock);
> -	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
> +	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
>   
>   	/*
>   	 * This fence is always complete unless a pending schedule disable is
>   	 * outstanding. We arm the fence here and complete it when we receive
>   	 * the pending schedule disable complete message.
>   	 */
> -	i915_sw_fence_fini(&ce->guc_blocked);
> -	i915_sw_fence_reinit(&ce->guc_blocked);
> -	i915_sw_fence_await(&ce->guc_blocked);
> -	i915_sw_fence_commit(&ce->guc_blocked);
> +	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_await(&ce->guc_state.blocked_fence);
> +	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
>   }
>   
>   static u16 prep_context_pending_disable(struct intel_context *ce)
> @@ -1547,7 +1547,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   		if (enabled)
>   			clr_context_enabled(ce);
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -		return &ce->guc_blocked;
> +		return &ce->guc_state.blocked_fence;
>   	}
>   
>   	/*
> @@ -1563,7 +1563,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   	with_intel_runtime_pm(runtime_pm, wakeref)
>   		__guc_context_sched_disable(guc, ce, guc_id);
>   
> -	return &ce->guc_blocked;
> +	return &ce->guc_state.blocked_fence;
>   }
>   
>   static void guc_context_unblock(struct intel_context *ce)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request
  2021-08-21  0:07   ` Daniele Ceraolo Spurio
@ 2021-08-24 15:42     ` Matthew Brost
  2021-08-25  1:21       ` Daniele Ceraolo Spurio
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-24 15:42 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Fri, Aug 20, 2021 at 05:07:27PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > A context can get destroyed after cancelling a request so take a
> > reference to context when cancelling a request.
> 
> What's the exact race? AFAICS __i915_request_skip does not have a
> context_put().

This commit message isn't quite right, it is really a context reset or a
GT reset which could result in the context getting destroyed. I haven't
actually seen this happen but this just being paranoid about ref
counting. Can fix up the commit message.

Matt

> 
> Daniele
> 
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index e0e85e4ad512..85f96d325048 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1620,8 +1620,10 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >   				       struct i915_request *rq)
> >   {
> >   	if (i915_sw_fence_signaled(&rq->submit)) {
> > -		struct i915_sw_fence *fence = guc_context_block(ce);
> > +		struct i915_sw_fence *fence;
> > +		intel_context_get(ce);
> > +		fence = guc_context_block(ce);
> >   		i915_sw_fence_wait(fence);
> >   		if (!i915_request_completed(rq)) {
> >   			__i915_request_skip(rq);
> > @@ -1636,6 +1638,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> >   		guc_context_unblock(ce);
> > +		intel_context_put(ce);
> >   	}
> >   }
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset
  2021-08-21  0:25   ` Daniele Ceraolo Spurio
@ 2021-08-24 15:44     ` Matthew Brost
  2021-08-25  1:22       ` Daniele Ceraolo Spurio
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-24 15:44 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Fri, Aug 20, 2021 at 05:25:41PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > It isn't safe to scrub for missing G2H or continue with the reset until
> > all G2H processing is complete. Flush the G2H work queue during reset to
> > ensure it is done running.
> 
> Might be worth moving this patch closer to "drm/i915/guc: Process all G2H
> message at once in work queue".
> 

Sure.

> > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
> >   1 file changed, 2 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 4cf5a565f08e..9a53bae367b1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -714,8 +714,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> > -	int i;
> > -
> >   	if (unlikely(!guc_submission_initialized(guc))) {
> >   		/* Reset called during driver load? GuC not yet initialised! */
> >   		return;
> > @@ -731,20 +729,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	guc_flush_submissions(guc);
> > -	/*
> > -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > -	 * each pass as interrupt have been disabled. We always scrub for
> > -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
> > -	 * be incremented after the context state update.
> > -	 */
> > -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
> > -		intel_guc_to_host_event_handler(guc);
> > -#define wait_for_reset(guc, wait_var) \
> > -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
> > -		do {
> > -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > -		} while (!list_empty(&guc->ct.requests.incoming));
> > -	}
> > +	flush_work(&guc->ct.requests.worker);
> > +
> 
> We're now not waiting in the requests anymore, just ensuring that the
> processing of the ones we already received is done. Is this intended? We do
> still handle the remaining oustanding submission in the scrub so it's
> functionally correct, but the commit message doesn't state the change in
> waiting behavior, so wanted to double check it was planned.
> 

Yes, it is planned as scrub code should be able to cope with any missing
G2H. Will update the commit message to reflect that.

Matt

> Daniele
> 
> >   	scrub_guc_desc_for_outstanding_g2h(guc);
> >   }
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting Matthew Brost
@ 2021-08-24 23:24   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-24 23:24 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Prior to this patch the blocked context counter was cleared on
> init_sched_state (used during registering a context & resets) which is
> incorrect. This state needs to be persistent or the counter can read the
> incorrect value resulting in scheduling never getting enabled again.

Can you elaborate a bit more here on the conditions in which we hit this 
issue?
After a GT reset the GuC state is cleared so we need to re-enable 
everything no matter what the old enable status was, so I don't think we 
can hit the described error there, unless your aim is to keep the 
context blocked across the reset (in which case the commit message needs 
rewording). On the registration side, if a context is not registered, it 
will be enabled on the submission that is causing the registration, so 
again we should be covered.

Daniele

> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 87d8dc8f51b9..69faa39da178 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
>   {
>   	/* Only should be called from guc_lrc_desc_pin() */
>   	atomic_set(&ce->guc_sched_state_no_lock, 0);
> -	ce->guc_state.sched_state = 0;
> +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>   }
>   
>   static inline bool


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
@ 2021-08-24 23:31   ` Daniele Ceraolo Spurio
  2021-08-25  4:05     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-24 23:31 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> If the context is reset as a result of the request cancelation the
> context reset G2H is received after schedule disable done G2H which is
> likely the wrong order. The schedule disable done G2H release the
> waiting request cancelation code which resubmits the context. This races
> with the context reset G2H which also wants to resubmit the context but
> in this case it really should be a NOP as request cancelation code owns
> the resubmit. Use some clever tricks of checking the context state to
> seal this race until if / when the GuC firmware is fixed.

Did you raise this with the GuC team? If it's a GuC issue we definitely 
want a fix there ASAP so we can drop any i915-side WAs.

>
> v2:
>   (Checkpatch)
>    - Fix typos
>
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
>   1 file changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e4a099f8f820..8f7a11e65ef5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -832,17 +832,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   {
>   	struct i915_request *rq;
> +	unsigned long flags;
>   	u32 head;
> +	bool skip = false;
>   
>   	intel_context_get(ce);
>   
>   	/*
> -	 * GuC will implicitly mark the context as non-schedulable
> -	 * when it sends the reset notification. Make sure our state
> -	 * reflects this change. The context will be marked enabled
> -	 * on resubmission.
> +	 * GuC will implicitly mark the context as non-schedulable when it sends
> +	 * the reset notification. Make sure our state reflects this change. The
> +	 * context will be marked enabled on resubmission.
> +	 *
> +	 * XXX: If the context is reset as a result of the request cancellation
> +	 * this G2H is received after the schedule disable complete G2H which is
> +	 * likely wrong as this creates a race between the request cancellation
> +	 * code re-submitting the context and this G2H handler. This likely
> +	 * should be fixed in the GuC but until if / when that gets fixed we
> +	 * need to workaround this. Convert this function to a NOP if a pending
> +	 * enable is in flight as this indicates that a request cancellation has
> +	 * occurred.
>   	 */

IMO this comment sounds like we're not clear on expected behavior. 
Either the ordering is wrong, in which case we have a GuC bug and this 
is a temporary WA, or the ordering is allowed and we need to cope with 
it. The way the comment is written sounds like we're not sure.

Code changes look ok.

Daniele

> -	clr_context_enabled(ce);
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	if (likely(!context_pending_enable(ce))) {
> +		clr_context_enabled(ce);
> +	} else {
> +		skip = true;
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(skip))
> +		goto out_put;
>   
>   	rq = intel_context_find_active_request(ce);
>   	if (!rq) {
> @@ -861,6 +879,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   out_replay:
>   	guc_reset_state(ce, head, stalled);
>   	__unwind_incomplete_requests(ce);
> +out_put:
>   	intel_context_put(ce);
>   }
>   
> @@ -1605,6 +1624,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
>   					true);
>   		}
> +
> +		/*
> +		 * XXX: Racey if context is reset, see comment in
> +		 * __guc_reset_context().
> +		 */
> +		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> +
>   		guc_context_unblock(ce);
>   	}
>   }
> @@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>   {
>   	trace_intel_context_reset(ce);
>   
> -	if (likely(!intel_context_is_banned(ce))) {
> +	/*
> +	 * XXX: Racey if request cancellation has occurred, see comment in
> +	 * __guc_reset_context().
> +	 */
> +	if (likely(!intel_context_is_banned(ce) &&
> +		   !context_blocked(ce))) {
>   		capture_error_state(guc, ce);
>   		guc_context_replay(ce);
>   	}


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
@ 2021-08-25  0:07   ` Daniele Ceraolo Spurio
  2021-08-25 20:03     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  0:07 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> GuC submission has exposed an existing memory corruption in
> live_lrc_isolation. We believe that some writes to the watchdog offsets
> in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> address space. With GuC submission there are additional objects which
> can move the context redzone into the space that is trashed. To
> workaround this avoid poisoning the watchdog.

This is kind of a worrying explanation, as it implies an HW issue. 
AFAICS we no longer increase the context size with GuC submission, so 
the redzone should be in the same place relative to the base address of 
the context; although it is true that we have more objects in memory due 
to support the GuC, hitting the redzone consistently feels too much like 
a coincidence. When we write the watchdog regs there is a risk we're 
triggering a watchdog interrupt, which will cause the GuC to handle 
that; on a media reset, the GuC overwrites the context with the golden 
context in the ADS, are we sure that's not what is causing this problem?
Looking in the ADS we set the context memcpy size to:

real_size = intel_engine_context_size(gt, engine_class);

but then we only initialize real_size - SKIP_SIZE(gt->i915), which IMO 
could be the real cause of the bug as the GuC memcpy starts at SKIP_SIZE().

Daniele

>
> v2:
>   (Daniel Vetter)
>    - Add VLK ref in code to workaround
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
>   1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> index b0977a3b699b..cdc6ae48a1e1 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
>   	goto err_after;
>   }
>   
> +static u32 safe_offset(u32 offset, u32 reg)
> +{
> +	/* XXX skip testing of watchdog - VLK-22772 */
> +	if (offset == 0x178 || offset == 0x17c)
> +		reg = 0;
> +
> +	return reg;
> +}
> +
> +static int get_offset_mask(struct intel_engine_cs *engine)
> +{
> +	if (GRAPHICS_VER(engine->i915) < 12)
> +		return 0xfff;
> +
> +	switch (engine->class) {
> +	default:
> +	case RENDER_CLASS:
> +		return 0x07ff;
> +	case COPY_ENGINE_CLASS:
> +		return 0x0fff;
> +	case VIDEO_DECODE_CLASS:
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		return 0x3fff;
> +	}
> +}
> +
>   static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
>   {
>   	struct i915_vma *batch;
> @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
>   		len = (len + 1) / 2;
>   		*cs++ = MI_LOAD_REGISTER_IMM(len);
>   		while (len--) {
> -			*cs++ = hw[dw];
> +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> +					    hw[dw]);
>   			*cs++ = poison;
>   			dw += 2;
>   		}


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
@ 2021-08-25  0:58   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  0:58 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> While debugging an issue with full GT resets I went down a rabbit hole
> thinking the scrubbing of lost G2H wasn't working correctly. This proved
> to be incorrect as this was working just fine but this chase inspired me
> to write a selftest to prove that this works. This simple selftest
> injects errors dropping various G2H and then issues a full GT reset
> proving that the scrubbing of these G2H doesn't blow up.
>
> v2:
>   (Daniel Vetter)
>    - Use ifdef instead of macros for selftests
> v3:
>   (Checkpatch)
>    - A space after 'switch' statement
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  25 ++++
>   drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++++++++++++++++
>   .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>   .../i915/selftests/intel_scheduler_helpers.c  |  12 ++
>   .../i915/selftests/intel_scheduler_helpers.h  |   2 +
>   6 files changed, 184 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index e54351a170e2..3a73f3117873 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -198,6 +198,24 @@ struct intel_context {
>   	 */
>   	u8 guc_prio;
>   	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> +
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +	/**
> +	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> +	 */
> +	bool drop_schedule_enable;
> +
> +	/**
> +	 * @drop_schedule_disable: Force drop of schedule disable G2H for
> +	 * selftest
> +	 */
> +	bool drop_schedule_disable;
> +
> +	/**
> +	 * @drop_deregister: Force drop of deregister G2H for selftest
> +	 */
> +	bool drop_deregister;
> +#endif
>   };
>   
>   #endif /* __INTEL_CONTEXT_TYPES__ */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e53a4ef7d442..e0e85e4ad512 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2635,6 +2635,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   
>   	trace_intel_context_deregister_done(ce);
>   
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +	if (unlikely(ce->drop_deregister)) {
> +		ce->drop_deregister = false;
> +		return 0;
> +	}
> +#endif
> +
>   	if (context_wait_for_deregister_to_register(ce)) {
>   		struct intel_runtime_pm *runtime_pm =
>   			&ce->engine->gt->i915->runtime_pm;
> @@ -2689,10 +2696,24 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>   	trace_intel_context_sched_done(ce);
>   
>   	if (context_pending_enable(ce)) {
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +		if (unlikely(ce->drop_schedule_enable)) {
> +			ce->drop_schedule_enable = false;
> +			return 0;
> +		}
> +#endif
> +
>   		clr_context_pending_enable(ce);
>   	} else if (context_pending_disable(ce)) {
>   		bool banned;
>   
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +		if (unlikely(ce->drop_schedule_disable)) {
> +			ce->drop_schedule_disable = false;
> +			return 0;
> +		}
> +#endif
> +
>   		/*
>   		 * Unpin must be done before __guc_signal_context_fence,
>   		 * otherwise a race exists between the requests getting
> @@ -3069,3 +3090,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
>   
>   	return false;
>   }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftest_guc.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> new file mode 100644
> index 000000000000..264e2f705c17
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright �� 2021 Intel Corporation
> + */
> +
> +#include "selftests/intel_scheduler_helpers.h"
> +
> +static struct i915_request *nop_user_request(struct intel_context *ce,
> +					     struct i915_request *from)
> +{
> +	struct i915_request *rq;
> +	int ret;
> +
> +	rq = intel_context_create_request(ce);
> +	if (IS_ERR(rq))
> +		return rq;
> +
> +	if (from) {
> +		ret = i915_sw_fence_await_dma_fence(&rq->submit,
> +						    &from->fence, 0,
> +						    I915_FENCE_GFP);
> +		if (ret < 0) {
> +			i915_request_put(rq);
> +			return ERR_PTR(ret);
> +		}
> +	}
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +
> +	return rq;
> +}
> +
> +static int intel_guc_scrub_ctbs(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	int ret = 0;
> +	int i;
> +	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
> +	intel_wakeref_t wakeref;
> +	struct intel_engine_cs *engine;
> +	struct intel_context *ce;
> +
> +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> +	engine = intel_selftest_find_any_engine(gt);
> +
> +	/* Submit requests and inject errors forcing G2H to be dropped */
> +	for (i = 0; i < 3; ++i) {
> +		ce = intel_context_create(engine);
> +		if (IS_ERR(ce)) {
> +			ret = PTR_ERR(ce);
> +			pr_err("Failed to create context, %d: %d\n", i, ret);
> +			goto err;
> +		}
> +
> +		switch (i) {
> +		case 0:
> +			ce->drop_schedule_enable = true;
> +			break;
> +		case 1:
> +			ce->drop_schedule_disable = true;
> +			break;
> +		case 2:
> +			ce->drop_deregister = true;
> +			break;
> +		}

Would it be worth making the drop bitmask?

#ifdef CONFIG_DRM_I915_SELFTEST
	/**
	 * @drop_g2h: Force drop of selected G2H for selftest
	 */
	u32 drop_g2h;
#define I915_SELFTEST_DROP_GUC_SCHED_ENABLE BIT(0)
....
#endif


So in the test instead of a switch you can use:

     ce->drop_g2h = BIT(i);


Not a blocker.

> +
> +		rq = nop_user_request(ce, NULL);
> +		intel_context_put(ce);
> +
> +		if (IS_ERR(rq)) {
> +			ret = PTR_ERR(rq);
> +			pr_err("Failed to create request, %d: %d\n", i, ret);
> +			goto err;
> +		}
> +
> +		last[i] = rq;
> +	}
> +
> +	for (i = 0; i < 3; ++i) {
> +		ret = i915_request_wait(last[i], 0, HZ);
> +		if (ret < 0) {
> +			pr_err("Last request failed to complete: %d\n", ret);
> +			goto err;
> +		}
> +		i915_request_put(last[i]);
> +		last[i] = NULL;
> +	}
> +
> +	/* Force all H2G / G2H to be submitted / processed */
> +	intel_gt_retire_requests(gt);
> +	msleep(500);
> +
> +	/* Scrub missing G2H */
> +	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
> +
> +	ret = intel_gt_wait_for_idle(gt, HZ);

I think here we could use a small comment where we explain that the GT 
won't go idle if the scrubbing was not done correctly.
With that:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> +	if (ret < 0) {
> +		pr_err("GT failed to idle: %d\n", ret);
> +		goto err;
> +	}
> +
> +err:
> +	for (i = 0; i < 3; ++i)
> +		if (last[i])
> +			i915_request_put(last[i]);
> +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> +
> +	return ret;
> +}
> +
> +int intel_guc_live_selftests(struct drm_i915_private *i915)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(intel_guc_scrub_ctbs),
> +	};
> +	struct intel_gt *gt = &i915->gt;
> +
> +	if (intel_gt_is_wedged(gt))
> +		return 0;
> +
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		return 0;
> +
> +	return intel_gt_live_subtests(tests, gt);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> index cfa5c4165a4f..3cf6758931f9 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> @@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
>   selftest(ring_submission, intel_ring_submission_live_selftests)
>   selftest(perf, i915_perf_live_selftests)
>   selftest(slpc, intel_slpc_live_selftests)
> +selftest(guc, intel_guc_live_selftests)
>   /* Here be dragons: keep last to run last! */
>   selftest(late_gt_pm, intel_gt_pm_late_selftests)
> diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> index 4b328346b48a..310fb83c527e 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
> @@ -14,6 +14,18 @@
>   #define REDUCED_PREEMPT		10
>   #define WAIT_FOR_RESET_TIME	10000
>   
> +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, gt, id)
> +		return engine;
> +
> +	pr_err("No valid engine found!\n");
> +	return NULL;
> +}
> +
>   int intel_selftest_modify_policy(struct intel_engine_cs *engine,
>   				 struct intel_selftest_saved_policy *saved,
>   				 u32 modify_type)
> diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> index 35c098601ac0..ae60bb507f45 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
> @@ -10,6 +10,7 @@
>   
>   struct i915_request;
>   struct intel_engine_cs;
> +struct intel_gt;
>   
>   struct intel_selftest_saved_policy {
>   	u32 flags;
> @@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
>   	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
>   };
>   
> +struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
>   int intel_selftest_modify_policy(struct intel_engine_cs *engine,
>   				 struct intel_selftest_saved_policy *saved,
>   				 enum selftest_scheduler_modify modify_type);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
@ 2021-08-25  1:20   ` Daniele Ceraolo Spurio
  2021-08-25  1:44     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  1:20 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Before we did some clever tricks to not use the a lock when touching
> guc_state.sched_state in certain cases. Don't do that, enforce the use
> of the lock.
>
> Part of this is removing a dead code path from guc_lrc_desc_pin where a
> context could be deregistered when the aforementioned function was
> called from the submission path. Remove this dead code and add a
> GEM_BUG_ON if this path is ever attempted to be used.
>
> v2:
>   (kernel test robo )
>    - Add __maybe_unused to sched_state_is_init()
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 58 ++++++++++---------
>   1 file changed, 32 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 85f96d325048..fa87470ea576 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -150,11 +150,23 @@ static inline void clr_context_registered(struct intel_context *ce)
>   #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
>   static inline void init_sched_state(struct intel_context *ce)
>   {
> -	/* Only should be called from guc_lrc_desc_pin() */
> +	lockdep_assert_held(&ce->guc_state.lock);
>   	atomic_set(&ce->guc_sched_state_no_lock, 0);
>   	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>   }
>   
> +__maybe_unused
> +static bool sched_state_is_init(struct intel_context *ce)
> +{
> +	/*
> +	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
> +	 * suspend.
> +	 */

This seems like something we want to fix. Not a blocker for this, but we 
can add it to the list.

> +	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> +		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> +		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> +}
> +
>   static inline bool
>   context_wait_for_deregister_to_register(struct intel_context *ce)
>   {
> @@ -165,7 +177,7 @@ context_wait_for_deregister_to_register(struct intel_context *ce)
>   static inline void
>   set_context_wait_for_deregister_to_register(struct intel_context *ce)
>   {
> -	/* Only should be called from guc_lrc_desc_pin() without lock */
> +	lockdep_assert_held(&ce->guc_state.lock);
>   	ce->guc_state.sched_state |=
>   		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
>   }
> @@ -605,9 +617,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   	bool pending_disable, pending_enable, deregister, destroyed, banned;
>   
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		/* Flush context */
>   		spin_lock_irqsave(&ce->guc_state.lock, flags);
> -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   
>   		/*
>   		 * Once we are at this point submission_disabled() is guaranteed
> @@ -623,6 +633,8 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   		banned = context_banned(ce);
>   		init_sched_state(ce);
>   
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
>   		if (pending_enable || destroyed || deregister) {
>   			decr_outstanding_submission_g2h(guc);
>   			if (deregister)
> @@ -1325,6 +1337,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	int ret = 0;
>   
>   	GEM_BUG_ON(!engine->mask);
> +	GEM_BUG_ON(!sched_state_is_init(ce));
>   
>   	/*
>   	 * Ensure LRC + CT vmas are is same region as write barrier is done
> @@ -1353,7 +1366,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->priority = ce->guc_prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
> -	init_sched_state(ce);
>   
>   	/*
>   	 * The context_lookup xarray is used to determine if the hardware
> @@ -1364,26 +1376,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	 * registering this context.
>   	 */
>   	if (context_registered) {
> +		bool disabled;
> +		unsigned long flags;
> +
>   		trace_intel_context_steal_guc_id(ce);
> -		if (!loop) {
> +		GEM_BUG_ON(!loop);
> +
> +		/* Seal race with Reset */
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		disabled = submission_disabled(guc);
> +		if (likely(!disabled)) {
>   			set_context_wait_for_deregister_to_register(ce);
>   			intel_context_get(ce);
> -		} else {
> -			bool disabled;
> -			unsigned long flags;
> -
> -			/* Seal race with Reset */
> -			spin_lock_irqsave(&ce->guc_state.lock, flags);
> -			disabled = submission_disabled(guc);
> -			if (likely(!disabled)) {
> -				set_context_wait_for_deregister_to_register(ce);
> -				intel_context_get(ce);
> -			}
> -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -			if (unlikely(disabled)) {
> -				reset_lrc_desc(guc, desc_idx);
> -				return 0;	/* Will get registered later */
> -			}
> +		}
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		if (unlikely(disabled)) {
> +			reset_lrc_desc(guc, desc_idx);
> +			return 0;	/* Will get registered later */
>   		}
>   
>   		/*
> @@ -1392,10 +1401,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   		 */
>   		with_intel_runtime_pm(runtime_pm, wakeref)
>   			ret = deregister_context(ce, ce->guc_id, loop);
> -		if (unlikely(ret == -EBUSY)) {
> -			clr_context_wait_for_deregister_to_register(ce);
> -			intel_context_put(ce);

Why is the EBUSY case not applicable anymore?

Daniele

> -		} else if (unlikely(ret == -ENODEV)) {
> +		if (unlikely(ret == -ENODEV)) {
>   			ret = 0;	/* Will get registered later */
>   		}
>   	} else {


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request
  2021-08-24 15:42     ` Matthew Brost
@ 2021-08-25  1:21       ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  1:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 8/24/2021 8:42 AM, Matthew Brost wrote:
> On Fri, Aug 20, 2021 at 05:07:27PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 8/18/2021 11:16 PM, Matthew Brost wrote:
>>> A context can get destroyed after cancelling a request so take a
>>> reference to context when cancelling a request.
>> What's the exact race? AFAICS __i915_request_skip does not have a
>> context_put().
> This commit message isn't quite right, it is really a context reset or a
> GT reset which could result in the context getting destroyed. I haven't
> actually seen this happen but this just being paranoid about ref
> counting. Can fix up the commit message.

ok, with an updated commit message:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>
> Matt
>
>> Daniele
>>
>>> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index e0e85e4ad512..85f96d325048 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1620,8 +1620,10 @@ static void guc_context_cancel_request(struct intel_context *ce,
>>>    				       struct i915_request *rq)
>>>    {
>>>    	if (i915_sw_fence_signaled(&rq->submit)) {
>>> -		struct i915_sw_fence *fence = guc_context_block(ce);
>>> +		struct i915_sw_fence *fence;
>>> +		intel_context_get(ce);
>>> +		fence = guc_context_block(ce);
>>>    		i915_sw_fence_wait(fence);
>>>    		if (!i915_request_completed(rq)) {
>>>    			__i915_request_skip(rq);
>>> @@ -1636,6 +1638,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>>>    		flush_work(&ce_to_guc(ce)->ct.requests.worker);
>>>    		guc_context_unblock(ce);
>>> +		intel_context_put(ce);
>>>    	}
>>>    }


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset
  2021-08-24 15:44     ` Matthew Brost
@ 2021-08-25  1:22       ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  1:22 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 8/24/2021 8:44 AM, Matthew Brost wrote:
> On Fri, Aug 20, 2021 at 05:25:41PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 8/18/2021 11:16 PM, Matthew Brost wrote:
>>> It isn't safe to scrub for missing G2H or continue with the reset until
>>> all G2H processing is complete. Flush the G2H work queue during reset to
>>> ensure it is done running.
>> Might be worth moving this patch closer to "drm/i915/guc: Process all G2H
>> message at once in work queue".
>>
> Sure.
>
>>> Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 ++----------------
>>>    1 file changed, 2 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 4cf5a565f08e..9a53bae367b1 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -714,8 +714,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
>>>    void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>>>    {
>>> -	int i;
>>> -
>>>    	if (unlikely(!guc_submission_initialized(guc))) {
>>>    		/* Reset called during driver load? GuC not yet initialised! */
>>>    		return;
>>> @@ -731,20 +729,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>>>    	guc_flush_submissions(guc);
>>> -	/*
>>> -	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
>>> -	 * each pass as interrupt have been disabled. We always scrub for
>>> -	 * outstanding G2H as it is possible for outstanding_submission_g2h to
>>> -	 * be incremented after the context state update.
>>> -	 */
>>> -	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
>>> -		intel_guc_to_host_event_handler(guc);
>>> -#define wait_for_reset(guc, wait_var) \
>>> -		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
>>> -		do {
>>> -			wait_for_reset(guc, &guc->outstanding_submission_g2h);
>>> -		} while (!list_empty(&guc->ct.requests.incoming));
>>> -	}
>>> +	flush_work(&guc->ct.requests.worker);
>>> +
>> We're now not waiting in the requests anymore, just ensuring that the
>> processing of the ones we already received is done. Is this intended? We do
>> still handle the remaining oustanding submission in the scrub so it's
>> functionally correct, but the commit message doesn't state the change in
>> waiting behavior, so wanted to double check it was planned.
>>
> Yes, it is planned as scrub code should be able to cope with any missing
> G2H. Will update the commit message to reflect that.

With the updated commit msg:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>
> Matt
>
>> Daniele
>>
>>>    	scrub_guc_desc_for_outstanding_g2h(guc);
>>>    }


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-25  1:20   ` Daniele Ceraolo Spurio
@ 2021-08-25  1:44     ` Matthew Brost
  2021-08-25  1:51       ` Daniele Ceraolo Spurio
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-25  1:44 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 24, 2021 at 06:20:49PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Before we did some clever tricks to not use the a lock when touching
> > guc_state.sched_state in certain cases. Don't do that, enforce the use
> > of the lock.
> > 
> > Part of this is removing a dead code path from guc_lrc_desc_pin where a
> > context could be deregistered when the aforementioned function was
> > called from the submission path. Remove this dead code and add a
> > GEM_BUG_ON if this path is ever attempted to be used.
> > 
> > v2:
> >   (kernel test robo )
> >    - Add __maybe_unused to sched_state_is_init()
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 58 ++++++++++---------
> >   1 file changed, 32 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 85f96d325048..fa87470ea576 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -150,11 +150,23 @@ static inline void clr_context_registered(struct intel_context *ce)
> >   #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
> >   static inline void init_sched_state(struct intel_context *ce)
> >   {
> > -	/* Only should be called from guc_lrc_desc_pin() */
> > +	lockdep_assert_held(&ce->guc_state.lock);
> >   	atomic_set(&ce->guc_sched_state_no_lock, 0);
> >   	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> >   }
> > +__maybe_unused
> > +static bool sched_state_is_init(struct intel_context *ce)
> > +{
> > +	/*
> > +	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
> > +	 * suspend.
> > +	 */
> 
> This seems like something we want to fix. Not a blocker for this, but we can
> add it to the list.
>

Right, hence the comment in the code.
 
> > +	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> > +		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> > +		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> > +}
> > +
> >   static inline bool
> >   context_wait_for_deregister_to_register(struct intel_context *ce)
> >   {
> > @@ -165,7 +177,7 @@ context_wait_for_deregister_to_register(struct intel_context *ce)
> >   static inline void
> >   set_context_wait_for_deregister_to_register(struct intel_context *ce)
> >   {
> > -	/* Only should be called from guc_lrc_desc_pin() without lock */
> > +	lockdep_assert_held(&ce->guc_state.lock);
> >   	ce->guc_state.sched_state |=
> >   		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
> >   }
> > @@ -605,9 +617,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   	bool pending_disable, pending_enable, deregister, destroyed, banned;
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> > -		/* Flush context */
> >   		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >   		/*
> >   		 * Once we are at this point submission_disabled() is guaranteed
> > @@ -623,6 +633,8 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   		banned = context_banned(ce);
> >   		init_sched_state(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +
> >   		if (pending_enable || destroyed || deregister) {
> >   			decr_outstanding_submission_g2h(guc);
> >   			if (deregister)
> > @@ -1325,6 +1337,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	int ret = 0;
> >   	GEM_BUG_ON(!engine->mask);
> > +	GEM_BUG_ON(!sched_state_is_init(ce));
> >   	/*
> >   	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > @@ -1353,7 +1366,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->priority = ce->guc_prio;
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > -	init_sched_state(ce);
> >   	/*
> >   	 * The context_lookup xarray is used to determine if the hardware
> > @@ -1364,26 +1376,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	 * registering this context.
> >   	 */
> >   	if (context_registered) {
> > +		bool disabled;
> > +		unsigned long flags;
> > +
> >   		trace_intel_context_steal_guc_id(ce);
> > -		if (!loop) {
> > +		GEM_BUG_ON(!loop);
> > +
> > +		/* Seal race with Reset */
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +		disabled = submission_disabled(guc);
> > +		if (likely(!disabled)) {
> >   			set_context_wait_for_deregister_to_register(ce);
> >   			intel_context_get(ce);
> > -		} else {
> > -			bool disabled;
> > -			unsigned long flags;
> > -
> > -			/* Seal race with Reset */
> > -			spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -			disabled = submission_disabled(guc);
> > -			if (likely(!disabled)) {
> > -				set_context_wait_for_deregister_to_register(ce);
> > -				intel_context_get(ce);
> > -			}
> > -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -			if (unlikely(disabled)) {
> > -				reset_lrc_desc(guc, desc_idx);
> > -				return 0;	/* Will get registered later */
> > -			}
> > +		}
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		if (unlikely(disabled)) {
> > +			reset_lrc_desc(guc, desc_idx);
> > +			return 0;	/* Will get registered later */
> >   		}
> >   		/*
> > @@ -1392,10 +1401,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   		 */
> >   		with_intel_runtime_pm(runtime_pm, wakeref)
> >   			ret = deregister_context(ce, ce->guc_id, loop);
> > -		if (unlikely(ret == -EBUSY)) {
> > -			clr_context_wait_for_deregister_to_register(ce);
> > -			intel_context_put(ce);
> 
> Why is the EBUSY case not applicable anymore?
> 

Commmit message cover this - this is dead code that can't be reached
in the current code nor can be it be reached in upcoming code. Or put
another way loop is always true thus we can't get -EBUSY from
deregister_context().

Matt 

> Daniele
> 
> > -		} else if (unlikely(ret == -ENODEV)) {
> > +		if (unlikely(ret == -ENODEV)) {
> >   			ret = 0;	/* Will get registered later */
> >   		}
> >   	} else {
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work Matthew Brost
@ 2021-08-25  1:44   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  1:44 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> A subsequent patch will flip the locking hierarchy from
> ce->guc_state.lock -> sched_engine->lock to sched_engine->lock ->
> ce->guc_state.lock. As such we need to release the submit fence for a
> request from an IRQ to break a lock inversion - i.e. the fence must be
> release went holding ce->guc_state.lock and the releasing of the can
> acquire sched_engine->lock.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>   drivers/gpu/drm/i915/i915_request.h               |  5 +++++
>   2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9a53bae367b1..deb2e821e441 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2025,6 +2025,14 @@ static const struct intel_context_ops guc_context_ops = {
>   	.create_virtual = guc_create_virtual,
>   };
>   
> +static void submit_work_cb(struct irq_work *wrk)
> +{
> +	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
> +
> +	might_lock(&rq->engine->sched_engine->lock);
> +	i915_sw_fence_complete(&rq->submit);
> +}
> +
>   static void __guc_signal_context_fence(struct intel_context *ce)
>   {
>   	struct i915_request *rq;
> @@ -2034,8 +2042,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
>   	if (!list_empty(&ce->guc_state.fences))
>   		trace_intel_context_fence_release(ce);
>   
> +	/*
> +	 * Use an IRQ to ensure locking order of sched_engine->lock ->
> +	 * ce->guc_state.lock is preserved.
> +	 */
>   	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
> -		i915_sw_fence_complete(&rq->submit);
> +		irq_work_queue(&rq->submit_work);

I think we should clear rq->guc_fence_link before queueing the work, 
just to make sure the work can't interfere back to this list (I know we 
don't now, it's just for future proofing paranoia). with that:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   
>   	INIT_LIST_HEAD(&ce->guc_state.fences);
>   }
> @@ -2145,6 +2157,7 @@ static int guc_request_alloc(struct i915_request *rq)
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	if (context_wait_for_deregister_to_register(ce) ||
>   	    context_pending_disable(ce)) {
> +		init_irq_work(&rq->submit_work, submit_work_cb);
>   		i915_sw_fence_await(&rq->submit);
>   
>   		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 1bc1349ba3c2..d818cfbfc41d 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -218,6 +218,11 @@ struct i915_request {
>   	};
>   	struct llist_head execute_cb;
>   	struct i915_sw_fence semaphore;
> +	/**
> +	 * @submit_work: complete submit fence from an IRQ if needed for
> +	 * locking hierarchy reasons.
> +	 */
> +	struct irq_work submit_work;
>   
>   	/*
>   	 * A list of everyone we wait upon, and everyone who waits upon us.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock
  2021-08-25  1:44     ` Matthew Brost
@ 2021-08-25  1:51       ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  1:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 8/24/2021 6:44 PM, Matthew Brost wrote:
> On Tue, Aug 24, 2021 at 06:20:49PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 8/18/2021 11:16 PM, Matthew Brost wrote:
>>> Before we did some clever tricks to not use the a lock when touching
>>> guc_state.sched_state in certain cases. Don't do that, enforce the use
>>> of the lock.
>>>
>>> Part of this is removing a dead code path from guc_lrc_desc_pin where a
>>> context could be deregistered when the aforementioned function was
>>> called from the submission path. Remove this dead code and add a
>>> GEM_BUG_ON if this path is ever attempted to be used.
>>>
>>> v2:
>>>    (kernel test robo )
>>>     - Add __maybe_unused to sched_state_is_init()
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> ---
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 58 ++++++++++---------
>>>    1 file changed, 32 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 85f96d325048..fa87470ea576 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -150,11 +150,23 @@ static inline void clr_context_registered(struct intel_context *ce)
>>>    #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
>>>    static inline void init_sched_state(struct intel_context *ce)
>>>    {
>>> -	/* Only should be called from guc_lrc_desc_pin() */
>>> +	lockdep_assert_held(&ce->guc_state.lock);
>>>    	atomic_set(&ce->guc_sched_state_no_lock, 0);
>>>    	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>>>    }
>>> +__maybe_unused
>>> +static bool sched_state_is_init(struct intel_context *ce)
>>> +{
>>> +	/*
>>> +	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
>>> +	 * suspend.
>>> +	 */
>> This seems like something we want to fix. Not a blocker for this, but we can
>> add it to the list.
>>
> Right, hence the comment in the code.
>   
>>> +	return !(atomic_read(&ce->guc_sched_state_no_lock) &
>>> +		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
>>> +		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
>>> +}
>>> +
>>>    static inline bool
>>>    context_wait_for_deregister_to_register(struct intel_context *ce)
>>>    {
>>> @@ -165,7 +177,7 @@ context_wait_for_deregister_to_register(struct intel_context *ce)
>>>    static inline void
>>>    set_context_wait_for_deregister_to_register(struct intel_context *ce)
>>>    {
>>> -	/* Only should be called from guc_lrc_desc_pin() without lock */
>>> +	lockdep_assert_held(&ce->guc_state.lock);
>>>    	ce->guc_state.sched_state |=
>>>    		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
>>>    }
>>> @@ -605,9 +617,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    	bool pending_disable, pending_enable, deregister, destroyed, banned;
>>>    	xa_for_each(&guc->context_lookup, index, ce) {
>>> -		/* Flush context */
>>>    		spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>>    		/*
>>>    		 * Once we are at this point submission_disabled() is guaranteed
>>> @@ -623,6 +633,8 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    		banned = context_banned(ce);
>>>    		init_sched_state(ce);
>>> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +
>>>    		if (pending_enable || destroyed || deregister) {
>>>    			decr_outstanding_submission_g2h(guc);
>>>    			if (deregister)
>>> @@ -1325,6 +1337,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	int ret = 0;
>>>    	GEM_BUG_ON(!engine->mask);
>>> +	GEM_BUG_ON(!sched_state_is_init(ce));
>>>    	/*
>>>    	 * Ensure LRC + CT vmas are is same region as write barrier is done
>>> @@ -1353,7 +1366,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	desc->priority = ce->guc_prio;
>>>    	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>>    	guc_context_policy_init(engine, desc);
>>> -	init_sched_state(ce);
>>>    	/*
>>>    	 * The context_lookup xarray is used to determine if the hardware
>>> @@ -1364,26 +1376,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	 * registering this context.
>>>    	 */
>>>    	if (context_registered) {
>>> +		bool disabled;
>>> +		unsigned long flags;
>>> +
>>>    		trace_intel_context_steal_guc_id(ce);
>>> -		if (!loop) {
>>> +		GEM_BUG_ON(!loop);
>>> +
>>> +		/* Seal race with Reset */
>>> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> +		disabled = submission_disabled(guc);
>>> +		if (likely(!disabled)) {
>>>    			set_context_wait_for_deregister_to_register(ce);
>>>    			intel_context_get(ce);
>>> -		} else {
>>> -			bool disabled;
>>> -			unsigned long flags;
>>> -
>>> -			/* Seal race with Reset */
>>> -			spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> -			disabled = submission_disabled(guc);
>>> -			if (likely(!disabled)) {
>>> -				set_context_wait_for_deregister_to_register(ce);
>>> -				intel_context_get(ce);
>>> -			}
>>> -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> -			if (unlikely(disabled)) {
>>> -				reset_lrc_desc(guc, desc_idx);
>>> -				return 0;	/* Will get registered later */
>>> -			}
>>> +		}
>>> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +		if (unlikely(disabled)) {
>>> +			reset_lrc_desc(guc, desc_idx);
>>> +			return 0;	/* Will get registered later */
>>>    		}
>>>    		/*
>>> @@ -1392,10 +1401,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    		 */
>>>    		with_intel_runtime_pm(runtime_pm, wakeref)
>>>    			ret = deregister_context(ce, ce->guc_id, loop);
>>> -		if (unlikely(ret == -EBUSY)) {
>>> -			clr_context_wait_for_deregister_to_register(ce);
>>> -			intel_context_put(ce);
>> Why is the EBUSY case not applicable anymore?
>>
> Commmit message cover this - this is dead code that can't be reached
> in the current code nor can be it be reached in upcoming code. Or put
> another way loop is always true thus we can't get -EBUSY from
> deregister_context().

ok, I hadn't realized that we could get -EBUSY only if loop=false.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> Matt
>
>> Daniele
>>
>>> -		} else if (unlikely(ret == -ENODEV)) {
>>> +		if (unlikely(ret == -ENODEV)) {
>>>    			ret = 0;	/* Will get registered later */
>>>    		}
>>>    	} else {


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure Matthew Brost
@ 2021-08-25  2:00   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25  2:00 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> To make ownership of locking clear move fields (guc_id, guc_id_ref,
> guc_id_link) to sub structure guc_id in intel_context.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   4 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  18 +--
>   drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   6 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 106 +++++++++---------
>   drivers/gpu/drm/i915/i915_trace.h             |   4 +-
>   5 files changed, 70 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 0e48939ec85f..87b84c1d5393 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -398,8 +398,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	spin_lock_init(&ce->guc_active.lock);
>   	INIT_LIST_HEAD(&ce->guc_active.requests);
>   
> -	ce->guc_id = GUC_INVALID_LRC_ID;
> -	INIT_LIST_HEAD(&ce->guc_id_link);
> +	ce->guc_id.id = GUC_INVALID_LRC_ID;
> +	INIT_LIST_HEAD(&ce->guc_id.link);
>   
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 9fb0480ccf3b..7a1d1537cf67 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -186,16 +186,18 @@ struct intel_context {
>   		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
>   	} guc_active;
>   
> -	/* GuC LRC descriptor ID */
> -	u16 guc_id;
> +	struct {
> +		/* GuC LRC descriptor ID */
> +		u16 id;
>   
> -	/* GuC LRC descriptor reference count */
> -	atomic_t guc_id_ref;
> +		/* GuC LRC descriptor reference count */
> +		atomic_t ref;
>   
> -	/*
> -	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> -	 */
> -	struct list_head guc_id_link;
> +		/*
> +		 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> +		 */
> +		struct list_head link;
> +	} guc_id;

Maybe add a

/* protected via guc->contexts_lock */

somewhere in the struct doc?

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> index 08f011f893b2..bf43bed905db 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
> @@ -789,7 +789,7 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
>   				if (err)
>   					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
>   					       engine->name, rq->fence.context,
> -					       rq->fence.seqno, rq->context->guc_id, err);
> +					       rq->fence.seqno, rq->context->guc_id.id, err);
>   			}
>   
>   skip:
> @@ -1098,7 +1098,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
>   				if (err)
>   					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
>   					       engine->name, rq->fence.context,
> -					       rq->fence.seqno, rq->context->guc_id, err);
> +					       rq->fence.seqno, rq->context->guc_id.id, err);
>   			}
>   
>   			count++;
> @@ -1108,7 +1108,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
>   					pr_err("i915_reset_engine(%s:%s): failed to reset request %lld:%lld [0x%04X]\n",
>   					       engine->name, test_name,
>   					       rq->fence.context,
> -					       rq->fence.seqno, rq->context->guc_id);
> +					       rq->fence.seqno, rq->context->guc_id.id);
>   					i915_request_put(rq);
>   
>   					GEM_TRACE_DUMP();
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index bb90bedb1305..c4c018348ac0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -269,12 +269,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
>   
>   static inline bool context_guc_id_invalid(struct intel_context *ce)
>   {
> -	return ce->guc_id == GUC_INVALID_LRC_ID;
> +	return ce->guc_id.id == GUC_INVALID_LRC_ID;
>   }
>   
>   static inline void set_context_guc_id_invalid(struct intel_context *ce)
>   {
> -	ce->guc_id = GUC_INVALID_LRC_ID;
> +	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   }
>   
>   static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
> @@ -466,14 +466,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   		return 0;
>   	}
>   
> -	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>   	GEM_BUG_ON(context_guc_id_invalid(ce));
>   
>   	/*
>   	 * Corner case where the GuC firmware was blown away and reloaded while
>   	 * this context was pinned.
>   	 */
> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
> +	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
>   		err = guc_lrc_desc_pin(ce, false);
>   		if (unlikely(err))
>   			return err;
> @@ -492,14 +492,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   
>   	if (!enabled) {
>   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> -		action[len++] = ce->guc_id;
> +		action[len++] = ce->guc_id.id;
>   		action[len++] = GUC_CONTEXT_ENABLE;
>   		set_context_pending_enable(ce);
>   		intel_context_get(ce);
>   		g2h_len_dw = G2H_LEN_DW_SCHED_CONTEXT_MODE_SET;
>   	} else {
>   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> -		action[len++] = ce->guc_id;
> +		action[len++] = ce->guc_id.id;
>   	}
>   
>   	err = intel_guc_send_nb(guc, action, len, g2h_len_dw);
> @@ -1150,12 +1150,12 @@ static int new_guc_id(struct intel_guc *guc)
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id);
> -		reset_lrc_desc(guc, ce->guc_id);
> +		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> +		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> -	if (!list_empty(&ce->guc_id_link))
> -		list_del_init(&ce->guc_id_link);
> +	if (!list_empty(&ce->guc_id.link))
> +		list_del_init(&ce->guc_id.link);
>   }
>   
>   static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> @@ -1177,13 +1177,13 @@ static int steal_guc_id(struct intel_guc *guc)
>   	if (!list_empty(&guc->guc_id_list)) {
>   		ce = list_first_entry(&guc->guc_id_list,
>   				      struct intel_context,
> -				      guc_id_link);
> +				      guc_id.link);
>   
> -		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> +		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   		GEM_BUG_ON(context_guc_id_invalid(ce));
>   
> -		list_del_init(&ce->guc_id_link);
> -		guc_id = ce->guc_id;
> +		list_del_init(&ce->guc_id.link);
> +		guc_id = ce->guc_id.id;
>   
>   		spin_lock(&ce->guc_state.lock);
>   		clr_context_registered(ce);
> @@ -1219,7 +1219,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	int ret = 0;
>   	unsigned long flags, tries = PIN_GUC_ID_TRIES;
>   
> -	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> +	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   
>   try_again:
>   	spin_lock_irqsave(&guc->contexts_lock, flags);
> @@ -1227,20 +1227,20 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	might_lock(&ce->guc_state.lock);
>   
>   	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id);
> +		ret = assign_guc_id(guc, &ce->guc_id.id);
>   		if (ret)
>   			goto out_unlock;
>   		ret = 1;	/* Indidcates newly assigned guc_id */
>   	}
> -	if (!list_empty(&ce->guc_id_link))
> -		list_del_init(&ce->guc_id_link);
> -	atomic_inc(&ce->guc_id_ref);
> +	if (!list_empty(&ce->guc_id.link))
> +		list_del_init(&ce->guc_id.link);
> +	atomic_inc(&ce->guc_id.ref);
>   
>   out_unlock:
>   	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>   
>   	/*
> -	 * -EAGAIN indicates no guc_ids are available, let's retire any
> +	 * -EAGAIN indicates no guc_id are available, let's retire any
>   	 * outstanding requests to see if that frees up a guc_id. If the first
>   	 * retire didn't help, insert a sleep with the timeslice duration before
>   	 * attempting to retire more requests. Double the sleep period each
> @@ -1268,15 +1268,15 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> -	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
> +	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
>   
>   	if (unlikely(context_guc_id_invalid(ce)))
>   		return;
>   
>   	spin_lock_irqsave(&guc->contexts_lock, flags);
> -	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
> -	    !atomic_read(&ce->guc_id_ref))
> -		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
> +	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
> +	    !atomic_read(&ce->guc_id.ref))
> +		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
>   	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>   }
>   
> @@ -1299,12 +1299,12 @@ static int register_context(struct intel_context *ce, bool loop)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
>   	u32 offset = intel_guc_ggtt_offset(guc, guc->lrc_desc_pool) +
> -		ce->guc_id * sizeof(struct guc_lrc_desc);
> +		ce->guc_id.id * sizeof(struct guc_lrc_desc);
>   	int ret;
>   
>   	trace_intel_context_register(ce);
>   
> -	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
> +	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
>   	if (likely(!ret)) {
>   		unsigned long flags;
>   
> @@ -1374,7 +1374,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	struct intel_engine_cs *engine = ce->engine;
>   	struct intel_runtime_pm *runtime_pm = engine->uncore->rpm;
>   	struct intel_guc *guc = &engine->gt->uc.guc;
> -	u32 desc_idx = ce->guc_id;
> +	u32 desc_idx = ce->guc_id.id;
>   	struct guc_lrc_desc *desc;
>   	bool context_registered;
>   	intel_wakeref_t wakeref;
> @@ -1437,7 +1437,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   		 * context whose guc_id was stolen.
>   		 */
>   		with_intel_runtime_pm(runtime_pm, wakeref)
> -			ret = deregister_context(ce, ce->guc_id, loop);
> +			ret = deregister_context(ce, ce->guc_id.id, loop);
>   		if (unlikely(ret == -ENODEV)) {
>   			ret = 0;	/* Will get registered later */
>   		}
> @@ -1509,7 +1509,7 @@ static void __guc_context_sched_enable(struct intel_guc *guc,
>   {
>   	u32 action[] = {
>   		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
> -		ce->guc_id,
> +		ce->guc_id.id,
>   		GUC_CONTEXT_ENABLE
>   	};
>   
> @@ -1525,7 +1525,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
>   {
>   	u32 action[] = {
>   		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
> -		guc_id,	/* ce->guc_id not stable */
> +		guc_id,	/* ce->guc_id.id not stable */
>   		GUC_CONTEXT_DISABLE
>   	};
>   
> @@ -1570,7 +1570,7 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
>   	guc_blocked_fence_reinit(ce);
>   	intel_context_get(ce);
>   
> -	return ce->guc_id;
> +	return ce->guc_id.id;
>   }
>   
>   static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> @@ -1625,7 +1625,7 @@ static void guc_context_unblock(struct intel_context *ce)
>   	if (unlikely(submission_disabled(guc) ||
>   		     intel_context_is_banned(ce) ||
>   		     context_guc_id_invalid(ce) ||
> -		     !lrc_desc_registered(guc, ce->guc_id) ||
> +		     !lrc_desc_registered(guc, ce->guc_id.id) ||
>   		     !intel_context_is_pinned(ce) ||
>   		     context_pending_disable(ce) ||
>   		     context_blocked(ce) > 1)) {
> @@ -1730,7 +1730,7 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>   		if (!context_guc_id_invalid(ce))
>   			with_intel_runtime_pm(runtime_pm, wakeref)
>   				__guc_context_set_preemption_timeout(guc,
> -								     ce->guc_id,
> +								     ce->guc_id.id,
>   								     1);
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   	}
> @@ -1746,7 +1746,7 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   	bool enabled;
>   
>   	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> -	    !lrc_desc_registered(guc, ce->guc_id)) {
> +	    !lrc_desc_registered(guc, ce->guc_id.id)) {
>   		spin_lock_irqsave(&ce->guc_state.lock, flags);
>   		clr_context_enabled(ce);
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> @@ -1793,11 +1793,11 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
>   
> -	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
> -	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
> +	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> +	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> -	deregister_context(ce, ce->guc_id, true);
> +	deregister_context(ce, ce->guc_id.id, true);
>   }
>   
>   static void __guc_context_destroy(struct intel_context *ce)
> @@ -1842,7 +1842,7 @@ static void guc_context_destroy(struct kref *kref)
>   		__guc_context_destroy(ce);
>   		return;
>   	} else if (submission_disabled(guc) ||
> -		   !lrc_desc_registered(guc, ce->guc_id)) {
> +		   !lrc_desc_registered(guc, ce->guc_id.id)) {
>   		release_guc_id(guc, ce);
>   		__guc_context_destroy(ce);
>   		return;
> @@ -1851,10 +1851,10 @@ static void guc_context_destroy(struct kref *kref)
>   	/*
>   	 * We have to acquire the context spinlock and check guc_id again, if it
>   	 * is valid it hasn't been stolen and needs to be deregistered. We
> -	 * delete this context from the list of unpinned guc_ids available to
> +	 * delete this context from the list of unpinned guc_id available to
>   	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
>   	 * returns indicating this context has been deregistered the guc_id is
> -	 * returned to the pool of available guc_ids.
> +	 * returned to the pool of available guc_id.
>   	 */
>   	spin_lock_irqsave(&guc->contexts_lock, flags);
>   	if (context_guc_id_invalid(ce)) {
> @@ -1863,8 +1863,8 @@ static void guc_context_destroy(struct kref *kref)
>   		return;
>   	}
>   
> -	if (!list_empty(&ce->guc_id_link))
> -		list_del_init(&ce->guc_id_link);
> +	if (!list_empty(&ce->guc_id.link))
> +		list_del_init(&ce->guc_id.link);
>   	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>   
>   	/* Seal race with Reset */
> @@ -1909,7 +1909,7 @@ static void guc_context_set_prio(struct intel_guc *guc,
>   {
>   	u32 action[] = {
>   		INTEL_GUC_ACTION_SET_CONTEXT_PRIORITY,
> -		ce->guc_id,
> +		ce->guc_id.id,
>   		prio,
>   	};
>   
> @@ -2044,7 +2044,7 @@ static void remove_from_context(struct i915_request *rq)
>   	decr_context_committed_requests(ce);
>   	spin_unlock_irq(&ce->guc_state.lock);
>   
> -	atomic_dec(&ce->guc_id_ref);
> +	atomic_dec(&ce->guc_id.ref);
>   	i915_request_notify_execute_cb_imm(rq);
>   }
>   
> @@ -2111,7 +2111,7 @@ static void guc_signal_context_fence(struct intel_context *ce)
>   static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
>   {
>   	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
> -		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
> +		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id)) &&
>   		!submission_disabled(ce_to_guc(ce));
>   }
>   
> @@ -2166,11 +2166,11 @@ static int guc_request_alloc(struct i915_request *rq)
>   	/*
>   	 * Call pin_guc_id here rather than in the pinning step as with
>   	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> -	 * guc_ids and creating horrible race conditions. This is especially bad
> -	 * when guc_ids are being stolen due to over subscription. By the time
> +	 * guc_id and creating horrible race conditions. This is especially bad
> +	 * when guc_id are being stolen due to over subscription. By the time
>   	 * this function is reached, it is guaranteed that the guc_id will be
>   	 * persistent until the generated request is retired. Thus, sealing these
> -	 * race conditions. It is still safe to fail here if guc_ids are
> +	 * race conditions. It is still safe to fail here if guc_id are
>   	 * exhausted and return -EAGAIN to the user indicating that they can try
>   	 * again in the future.
>   	 *
> @@ -2180,7 +2180,7 @@ static int guc_request_alloc(struct i915_request *rq)
>   	 * decremented on each retire. When it is zero, a lock around the
>   	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
>   	 */
> -	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
> +	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
>   		goto out;
>   
>   	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
> @@ -2193,7 +2193,7 @@ static int guc_request_alloc(struct i915_request *rq)
>   				disable_submission(guc);
>   				goto out;	/* GPU will be reset */
>   			}
> -			atomic_dec(&ce->guc_id_ref);
> +			atomic_dec(&ce->guc_id.ref);
>   			unpin_guc_id(guc, ce);
>   			return ret;
>   		}
> @@ -3028,7 +3028,7 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>   
>   		priolist_for_each_request(rq, pl)
>   			drm_printf(p, "guc_id=%u, seqno=%llu\n",
> -				   rq->context->guc_id,
> +				   rq->context->guc_id.id,
>   				   rq->fence.seqno);
>   	}
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> @@ -3059,7 +3059,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> +		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
>   		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
>   		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
>   			   ce->ring->head,
> @@ -3070,7 +3070,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   		drm_printf(p, "\t\tContext Pin Count: %u\n",
>   			   atomic_read(&ce->pin_count));
>   		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> -			   atomic_read(&ce->guc_id_ref));
> +			   atomic_read(&ce->guc_id.ref));
>   		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
>   			   ce->guc_state.sched_state);
>   
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 6f882e72ed11..0574f5c7a985 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -805,7 +805,7 @@ DECLARE_EVENT_CLASS(i915_request,
>   			   __entry->dev = rq->engine->i915->drm.primary->index;
>   			   __entry->class = rq->engine->uabi_class;
>   			   __entry->instance = rq->engine->uabi_instance;
> -			   __entry->guc_id = rq->context->guc_id;
> +			   __entry->guc_id = rq->context->guc_id.id;
>   			   __entry->ctx = rq->fence.context;
>   			   __entry->seqno = rq->fence.seqno;
>   			   __entry->tail = rq->tail;
> @@ -907,7 +907,7 @@ DECLARE_EVENT_CLASS(intel_context,
>   			     ),
>   
>   		    TP_fast_assign(
> -			   __entry->guc_id = ce->guc_id;
> +			   __entry->guc_id = ce->guc_id.id;
>   			   __entry->pin_count = atomic_read(&ce->pin_count);
>   			   __entry->sched_state = ce->guc_state.sched_state;
>   			   __entry->guc_prio = ce->guc_active.prio;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  2021-08-24 23:31   ` Daniele Ceraolo Spurio
@ 2021-08-25  4:05     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-25  4:05 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 24, 2021 at 04:31:21PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > If the context is reset as a result of the request cancelation the
> > context reset G2H is received after schedule disable done G2H which is
> > likely the wrong order. The schedule disable done G2H release the
> > waiting request cancelation code which resubmits the context. This races
> > with the context reset G2H which also wants to resubmit the context but
> > in this case it really should be a NOP as request cancelation code owns
> > the resubmit. Use some clever tricks of checking the context state to
> > seal this race until if / when the GuC firmware is fixed.
> 
> Did you raise this with the GuC team? If it's a GuC issue we definitely want
> a fix there ASAP so we can drop any i915-side WAs.
>

Yep, def an issue with the GuC firmware behavior. Will get fixed, just
not sure when.
 
> > 
> > v2:
> >   (Checkpatch)
> >    - Fix typos
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 ++++++++++++++++---
> >   1 file changed, 37 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index e4a099f8f820..8f7a11e65ef5 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -832,17 +832,35 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   {
> >   	struct i915_request *rq;
> > +	unsigned long flags;
> >   	u32 head;
> > +	bool skip = false;
> >   	intel_context_get(ce);
> >   	/*
> > -	 * GuC will implicitly mark the context as non-schedulable
> > -	 * when it sends the reset notification. Make sure our state
> > -	 * reflects this change. The context will be marked enabled
> > -	 * on resubmission.
> > +	 * GuC will implicitly mark the context as non-schedulable when it sends
> > +	 * the reset notification. Make sure our state reflects this change. The
> > +	 * context will be marked enabled on resubmission.
> > +	 *
> > +	 * XXX: If the context is reset as a result of the request cancellation
> > +	 * this G2H is received after the schedule disable complete G2H which is
> > +	 * likely wrong as this creates a race between the request cancellation
> > +	 * code re-submitting the context and this G2H handler. This likely
> > +	 * should be fixed in the GuC but until if / when that gets fixed we
> > +	 * need to workaround this. Convert this function to a NOP if a pending
> > +	 * enable is in flight as this indicates that a request cancellation has
> > +	 * occurred.
> >   	 */
> 
> IMO this comment sounds like we're not clear on expected behavior. Either
> the ordering is wrong, in which case we have a GuC bug and this is a
> temporary WA, or the ordering is allowed and we need to cope with it. The
> way the comment is written sounds like we're not sure.
> 

Comments written prior to confirmation that GuC behavior was wrong, will
reword.

> Code changes look ok.
>

Ty. I'll think we have to carry this until we upgrade the GuC firmware
with a the proper behavior - until then without this workaround
canceling non-preemptable requests is 100% broken, hence why I added a
selftest. Will add a FIXME / XXX comment so we can remove this in the
future.

Matt

> Daniele
> 
> > -	clr_context_enabled(ce);
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	if (likely(!context_pending_enable(ce))) {
> > +		clr_context_enabled(ce);
> > +	} else {
> > +		skip = true;
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(skip))
> > +		goto out_put;
> >   	rq = intel_context_find_active_request(ce);
> >   	if (!rq) {
> > @@ -861,6 +879,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   out_replay:
> >   	guc_reset_state(ce, head, stalled);
> >   	__unwind_incomplete_requests(ce);
> > +out_put:
> >   	intel_context_put(ce);
> >   }
> > @@ -1605,6 +1624,13 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >   			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
> >   					true);
> >   		}
> > +
> > +		/*
> > +		 * XXX: Racey if context is reset, see comment in
> > +		 * __guc_reset_context().
> > +		 */
> > +		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> > +
> >   		guc_context_unblock(ce);
> >   	}
> >   }
> > @@ -2719,7 +2745,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> >   {
> >   	trace_intel_context_reset(ce);
> > -	if (likely(!intel_context_is_banned(ce))) {
> > +	/*
> > +	 * XXX: Racey if request cancellation has occurred, see comment in
> > +	 * __guc_reset_context().
> > +	 */
> > +	if (likely(!intel_context_is_banned(ce) &&
> > +		   !context_blocked(ce))) {
> >   		capture_error_state(guc, ce);
> >   		guc_context_replay(ce);
> >   	}
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking Matthew Brost
@ 2021-08-25 16:52   ` Daniele Ceraolo Spurio
  2021-08-25 19:22     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25 16:52 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Rework and simplify the locking with GuC subission. Drop
> sched_state_no_lock and move all fields under the guc_state.sched_state
> and protect all these fields with guc_state.lock . This requires
> changing the locking hierarchy from guc_state.lock -> sched_engine.lock
> to sched_engine.lock -> guc_state.lock.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
>   drivers/gpu/drm/i915/i915_trace.h             |   6 +-
>   3 files changed, 89 insertions(+), 108 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index c06171ee8792..d5d643b04d54 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -161,7 +161,7 @@ struct intel_context {
>   		 * sched_state: scheduling state of this context using GuC
>   		 * submission
>   		 */
> -		u16 sched_state;
> +		u32 sched_state;
>   		/*
>   		 * fences: maintains of list of requests that have a submit
>   		 * fence related to GuC submission
> @@ -178,9 +178,6 @@ struct intel_context {
>   		struct list_head requests;
>   	} guc_active;
>   
> -	/* GuC scheduling state flags that do not require a lock. */
> -	atomic_t guc_sched_state_no_lock;
> -
>   	/* GuC LRC descriptor ID */
>   	u16 guc_id;
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 053f4485d6e9..509b298e7cf3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> -/*
> - * Below is a set of functions which control the GuC scheduling state which do
> - * not require a lock as all state transitions are mutually exclusive. i.e. It
> - * is not possible for the context pinning code and submission, for the same
> - * context, to be executing simultaneously. We still need an atomic as it is
> - * possible for some of the bits to changing at the same time though.
> - */
> -#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
> -#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
> -#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
> -static inline bool context_enabled(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_ENABLED);
> -}
> -
> -static inline void set_context_enabled(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_enabled(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline bool context_pending_enable(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
> -}
> -
> -static inline void set_context_pending_enable(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> -		  &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_pending_enable(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline bool context_registered(struct intel_context *ce)
> -{
> -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> -		SCHED_STATE_NO_LOCK_REGISTERED);
> -}
> -
> -static inline void set_context_registered(struct intel_context *ce)
> -{
> -	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
> -		  &ce->guc_sched_state_no_lock);
> -}
> -
> -static inline void clr_context_registered(struct intel_context *ce)
> -{
> -	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
> -		   &ce->guc_sched_state_no_lock);
> -}
> -
>   /*
>    * Below is a set of functions which control the GuC scheduling state which
> - * require a lock, aside from the special case where the functions are called
> - * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
> - * path to be executing on the context.
> + * require a lock.
>    */
>   #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
>   #define SCHED_STATE_DESTROYED				BIT(1)
>   #define SCHED_STATE_PENDING_DISABLE			BIT(2)
>   #define SCHED_STATE_BANNED				BIT(3)
> -#define SCHED_STATE_BLOCKED_SHIFT			4
> +#define SCHED_STATE_ENABLED				BIT(4)
> +#define SCHED_STATE_PENDING_ENABLE			BIT(5)
> +#define SCHED_STATE_REGISTERED				BIT(6)
> +#define SCHED_STATE_BLOCKED_SHIFT			7
>   #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
>   #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
>   static inline void init_sched_state(struct intel_context *ce)
>   {
>   	lockdep_assert_held(&ce->guc_state.lock);
> -	atomic_set(&ce->guc_sched_state_no_lock, 0);
>   	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>   }
>   
> @@ -162,9 +99,8 @@ static bool sched_state_is_init(struct intel_context *ce)
>   	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
>   	 * suspend.
>   	 */
> -	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> -		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> -		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> +	return !(ce->guc_state.sched_state &=
> +		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
>   }
>   
>   static inline bool
> @@ -237,6 +173,57 @@ static inline void clr_context_banned(struct intel_context *ce)
>   	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
>   }
>   
> +static inline bool context_enabled(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
> +}
> +
> +static inline void set_context_enabled(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
> +}
> +
> +static inline void clr_context_enabled(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
> +}
> +
> +static inline bool context_pending_enable(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline void set_context_pending_enable(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline void clr_context_pending_enable(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
> +}
> +
> +static inline bool context_registered(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
> +}
> +
> +static inline void set_context_registered(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
> +}
> +
> +static inline void clr_context_registered(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
> +}
> +
>   static inline u32 context_blocked(struct intel_context *ce)
>   {
>   	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
> @@ -245,7 +232,6 @@ static inline u32 context_blocked(struct intel_context *ce)
>   
>   static inline void incr_context_blocked(struct intel_context *ce)
>   {
> -	lockdep_assert_held(&ce->engine->sched_engine->lock);
>   	lockdep_assert_held(&ce->guc_state.lock);
>   
>   	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
> @@ -255,7 +241,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
>   
>   static inline void decr_context_blocked(struct intel_context *ce)
>   {
> -	lockdep_assert_held(&ce->engine->sched_engine->lock);
>   	lockdep_assert_held(&ce->guc_state.lock);
>   
>   	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
> @@ -450,6 +435,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	u32 g2h_len_dw = 0;
>   	bool enabled;
>   
> +	lockdep_assert_held(&rq->engine->sched_engine->lock);
> +
>   	/*
>   	 * Corner case where requests were sitting in the priority list or a
>   	 * request resubmitted after the context was banned.
> @@ -457,7 +444,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	if (unlikely(intel_context_is_banned(ce))) {
>   		i915_request_put(i915_request_mark_eio(rq));
>   		intel_engine_signal_breadcrumbs(ce->engine);
> -		goto out;
> +		return 0;
>   	}
>   
>   	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> @@ -470,9 +457,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
>   		err = guc_lrc_desc_pin(ce, false);
>   		if (unlikely(err))
> -			goto out;
> +			return err;
>   	}
>   
> +	spin_lock(&ce->guc_state.lock);
> +
>   	/*
>   	 * The request / context will be run on the hardware when scheduling
>   	 * gets enabled in the unblock.
> @@ -507,6 +496,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   		trace_i915_request_guc_submit(rq);
>   
>   out:
> +	spin_unlock(&ce->guc_state.lock);
>   	return err;
>   }
>   
> @@ -727,8 +717,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
>   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>   
> -	guc_flush_submissions(guc);
> -
>   	flush_work(&guc->ct.requests.worker);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
> @@ -1133,7 +1121,11 @@ static int steal_guc_id(struct intel_guc *guc)
>   
>   		list_del_init(&ce->guc_id_link);
>   		guc_id = ce->guc_id;
> +
> +		spin_lock(&ce->guc_state.lock);
>   		clr_context_registered(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
>   		set_context_guc_id_invalid(ce);
>   		return guc_id;
>   	} else {
> @@ -1169,6 +1161,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   try_again:
>   	spin_lock_irqsave(&guc->contexts_lock, flags);
>   
> +	might_lock(&ce->guc_state.lock);
> +
>   	if (context_guc_id_invalid(ce)) {
>   		ret = assign_guc_id(guc, &ce->guc_id);
>   		if (ret)
> @@ -1248,8 +1242,13 @@ static int register_context(struct intel_context *ce, bool loop)
>   	trace_intel_context_register(ce);
>   
>   	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
> -	if (likely(!ret))
> +	if (likely(!ret)) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>   		set_context_registered(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	}
>   
>   	return ret;
>   }
> @@ -1525,7 +1524,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
>   static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
>   	unsigned long flags;
>   	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	intel_wakeref_t wakeref;
> @@ -1534,13 +1532,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
> -	/*
> -	 * Sync with submission path, increment before below changes to context
> -	 * state.
> -	 */
> -	spin_lock(&sched_engine->lock);
>   	incr_context_blocked(ce);
> -	spin_unlock(&sched_engine->lock);
>   
>   	enabled = context_enabled(ce);
>   	if (unlikely(!enabled || submission_disabled(guc))) {
> @@ -1569,7 +1561,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   static void guc_context_unblock(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
>   	unsigned long flags;
>   	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	intel_wakeref_t wakeref;
> @@ -1594,13 +1585,7 @@ static void guc_context_unblock(struct intel_context *ce)
>   		intel_context_get(ce);
>   	}
>   
> -	/*
> -	 * Sync with submission path, decrement after above changes to context
> -	 * state.
> -	 */
> -	spin_lock(&sched_engine->lock);
>   	decr_context_blocked(ce);
> -	spin_unlock(&sched_engine->lock);
>   
>   	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   
> @@ -1710,7 +1695,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   
>   	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
>   	    !lrc_desc_registered(guc, ce->guc_id)) {
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);

We do take this lock a few lines below this. Would it be worth just 
moving that up and do everything under the lock?

Anyway, all calls to the updated set/clr functions have been updated to 
be correctly locked, so:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   		clr_context_enabled(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   		goto unpin;
>   	}
>   
> @@ -1760,7 +1747,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> -	clr_context_registered(ce);
>   	deregister_context(ce, ce->guc_id, true);
>   }
>   
> @@ -1833,8 +1819,10 @@ static void guc_context_destroy(struct kref *kref)
>   	/* Seal race with Reset */
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	disabled = submission_disabled(guc);
> -	if (likely(!disabled))
> +	if (likely(!disabled)) {
>   		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
>   	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   	if (unlikely(disabled)) {
>   		release_guc_id(guc, ce);
> @@ -2697,8 +2685,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>   		     (!context_pending_enable(ce) &&
>   		     !context_pending_disable(ce)))) {
>   		drm_err(&guc_to_gt(guc)->i915->drm,
> -			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
> -			atomic_read(&ce->guc_sched_state_no_lock),
> +			"Bad context sched_state 0x%x, desc_idx %u",
>   			ce->guc_state.sched_state, desc_idx);
>   		return -EPROTO;
>   	}
> @@ -2713,7 +2700,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>   		}
>   #endif
>   
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
>   		clr_context_pending_enable(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   	} else if (context_pending_disable(ce)) {
>   		bool banned;
>   
> @@ -2987,9 +2976,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   			   atomic_read(&ce->pin_count));
>   		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
>   			   atomic_read(&ce->guc_id_ref));
> -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> -			   ce->guc_state.sched_state,
> -			   atomic_read(&ce->guc_sched_state_no_lock));
> +		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> +			   ce->guc_state.sched_state);
>   
>   		guc_log_context_priority(p, ce);
>   	}
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 806ad688274b..0a77eb2944b5 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
>   			     __field(u32, guc_id)
>   			     __field(int, pin_count)
>   			     __field(u32, sched_state)
> -			     __field(u32, guc_sched_state_no_lock)
>   			     __field(u8, guc_prio)
>   			     ),
>   
> @@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
>   			   __entry->guc_id = ce->guc_id;
>   			   __entry->pin_count = atomic_read(&ce->pin_count);
>   			   __entry->sched_state = ce->guc_state.sched_state;
> -			   __entry->guc_sched_state_no_lock =
> -			   atomic_read(&ce->guc_sched_state_no_lock);
>   			   __entry->guc_prio = ce->guc_prio;
>   			   ),
>   
> -		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
> +		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
>   			      __entry->guc_id, __entry->pin_count,
>   			      __entry->sched_state,
> -			      __entry->guc_sched_state_no_lock,
>   			      __entry->guc_prio)
>   );
>   


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking
  2021-08-25 16:52   ` Daniele Ceraolo Spurio
@ 2021-08-25 19:22     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-25 19:22 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Wed, Aug 25, 2021 at 09:52:06AM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Rework and simplify the locking with GuC subission. Drop
> > sched_state_no_lock and move all fields under the guc_state.sched_state
> > and protect all these fields with guc_state.lock . This requires
> > changing the locking hierarchy from guc_state.lock -> sched_engine.lock
> > to sched_engine.lock -> guc_state.lock.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   5 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 186 ++++++++----------
> >   drivers/gpu/drm/i915/i915_trace.h             |   6 +-
> >   3 files changed, 89 insertions(+), 108 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index c06171ee8792..d5d643b04d54 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -161,7 +161,7 @@ struct intel_context {
> >   		 * sched_state: scheduling state of this context using GuC
> >   		 * submission
> >   		 */
> > -		u16 sched_state;
> > +		u32 sched_state;
> >   		/*
> >   		 * fences: maintains of list of requests that have a submit
> >   		 * fence related to GuC submission
> > @@ -178,9 +178,6 @@ struct intel_context {
> >   		struct list_head requests;
> >   	} guc_active;
> > -	/* GuC scheduling state flags that do not require a lock. */
> > -	atomic_t guc_sched_state_no_lock;
> > -
> >   	/* GuC LRC descriptor ID */
> >   	u16 guc_id;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 053f4485d6e9..509b298e7cf3 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -72,86 +72,23 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > -/*
> > - * Below is a set of functions which control the GuC scheduling state which do
> > - * not require a lock as all state transitions are mutually exclusive. i.e. It
> > - * is not possible for the context pinning code and submission, for the same
> > - * context, to be executing simultaneously. We still need an atomic as it is
> > - * possible for some of the bits to changing at the same time though.
> > - */
> > -#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
> > -#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
> > -#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
> > -static inline bool context_enabled(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_ENABLED);
> > -}
> > -
> > -static inline void set_context_enabled(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_enabled(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline bool context_pending_enable(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
> > -}
> > -
> > -static inline void set_context_pending_enable(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> > -		  &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_pending_enable(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline bool context_registered(struct intel_context *ce)
> > -{
> > -	return (atomic_read(&ce->guc_sched_state_no_lock) &
> > -		SCHED_STATE_NO_LOCK_REGISTERED);
> > -}
> > -
> > -static inline void set_context_registered(struct intel_context *ce)
> > -{
> > -	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
> > -		  &ce->guc_sched_state_no_lock);
> > -}
> > -
> > -static inline void clr_context_registered(struct intel_context *ce)
> > -{
> > -	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
> > -		   &ce->guc_sched_state_no_lock);
> > -}
> > -
> >   /*
> >    * Below is a set of functions which control the GuC scheduling state which
> > - * require a lock, aside from the special case where the functions are called
> > - * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
> > - * path to be executing on the context.
> > + * require a lock.
> >    */
> >   #define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
> >   #define SCHED_STATE_DESTROYED				BIT(1)
> >   #define SCHED_STATE_PENDING_DISABLE			BIT(2)
> >   #define SCHED_STATE_BANNED				BIT(3)
> > -#define SCHED_STATE_BLOCKED_SHIFT			4
> > +#define SCHED_STATE_ENABLED				BIT(4)
> > +#define SCHED_STATE_PENDING_ENABLE			BIT(5)
> > +#define SCHED_STATE_REGISTERED				BIT(6)
> > +#define SCHED_STATE_BLOCKED_SHIFT			7
> >   #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
> >   #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
> >   static inline void init_sched_state(struct intel_context *ce)
> >   {
> >   	lockdep_assert_held(&ce->guc_state.lock);
> > -	atomic_set(&ce->guc_sched_state_no_lock, 0);
> >   	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
> >   }
> > @@ -162,9 +99,8 @@ static bool sched_state_is_init(struct intel_context *ce)
> >   	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
> >   	 * suspend.
> >   	 */
> > -	return !(atomic_read(&ce->guc_sched_state_no_lock) &
> > -		 ~SCHED_STATE_NO_LOCK_REGISTERED) &&
> > -		!(ce->guc_state.sched_state &= ~SCHED_STATE_BLOCKED_MASK);
> > +	return !(ce->guc_state.sched_state &=
> > +		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
> >   }
> >   static inline bool
> > @@ -237,6 +173,57 @@ static inline void clr_context_banned(struct intel_context *ce)
> >   	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
> >   }
> > +static inline bool context_enabled(struct intel_context *ce)
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline void set_context_enabled(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline void clr_context_enabled(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
> > +}
> > +
> > +static inline bool context_pending_enable(struct intel_context *ce)
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline void set_context_pending_enable(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline void clr_context_pending_enable(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
> > +}
> > +
> > +static inline bool context_registered(struct intel_context *ce)
> > +{
> > +	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
> > +}
> > +
> > +static inline void set_context_registered(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
> > +}
> > +
> > +static inline void clr_context_registered(struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&ce->guc_state.lock);
> > +	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
> > +}
> > +
> >   static inline u32 context_blocked(struct intel_context *ce)
> >   {
> >   	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
> > @@ -245,7 +232,6 @@ static inline u32 context_blocked(struct intel_context *ce)
> >   static inline void incr_context_blocked(struct intel_context *ce)
> >   {
> > -	lockdep_assert_held(&ce->engine->sched_engine->lock);
> >   	lockdep_assert_held(&ce->guc_state.lock);
> >   	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
> > @@ -255,7 +241,6 @@ static inline void incr_context_blocked(struct intel_context *ce)
> >   static inline void decr_context_blocked(struct intel_context *ce)
> >   {
> > -	lockdep_assert_held(&ce->engine->sched_engine->lock);
> >   	lockdep_assert_held(&ce->guc_state.lock);
> >   	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
> > @@ -450,6 +435,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	u32 g2h_len_dw = 0;
> >   	bool enabled;
> > +	lockdep_assert_held(&rq->engine->sched_engine->lock);
> > +
> >   	/*
> >   	 * Corner case where requests were sitting in the priority list or a
> >   	 * request resubmitted after the context was banned.
> > @@ -457,7 +444,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	if (unlikely(intel_context_is_banned(ce))) {
> >   		i915_request_put(i915_request_mark_eio(rq));
> >   		intel_engine_signal_breadcrumbs(ce->engine);
> > -		goto out;
> > +		return 0;
> >   	}
> >   	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> > @@ -470,9 +457,11 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
> >   		err = guc_lrc_desc_pin(ce, false);
> >   		if (unlikely(err))
> > -			goto out;
> > +			return err;
> >   	}
> > +	spin_lock(&ce->guc_state.lock);
> > +
> >   	/*
> >   	 * The request / context will be run on the hardware when scheduling
> >   	 * gets enabled in the unblock.
> > @@ -507,6 +496,7 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   		trace_i915_request_guc_submit(rq);
> >   out:
> > +	spin_unlock(&ce->guc_state.lock);
> >   	return err;
> >   }
> > @@ -727,8 +717,6 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> > -	guc_flush_submissions(guc);
> > -
> >   	flush_work(&guc->ct.requests.worker);
> >   	scrub_guc_desc_for_outstanding_g2h(guc);
> > @@ -1133,7 +1121,11 @@ static int steal_guc_id(struct intel_guc *guc)
> >   		list_del_init(&ce->guc_id_link);
> >   		guc_id = ce->guc_id;
> > +
> > +		spin_lock(&ce->guc_state.lock);
> >   		clr_context_registered(ce);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> >   		set_context_guc_id_invalid(ce);
> >   		return guc_id;
> >   	} else {
> > @@ -1169,6 +1161,8 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   try_again:
> >   	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	might_lock(&ce->guc_state.lock);
> > +
> >   	if (context_guc_id_invalid(ce)) {
> >   		ret = assign_guc_id(guc, &ce->guc_id);
> >   		if (ret)
> > @@ -1248,8 +1242,13 @@ static int register_context(struct intel_context *ce, bool loop)
> >   	trace_intel_context_register(ce);
> >   	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
> > -	if (likely(!ret))
> > +	if (likely(!ret)) {
> > +		unsigned long flags;
> > +
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   		set_context_registered(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	}
> >   	return ret;
> >   }
> > @@ -1525,7 +1524,6 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
> >   static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
> >   	unsigned long flags;
> >   	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	intel_wakeref_t wakeref;
> > @@ -1534,13 +1532,7 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	/*
> > -	 * Sync with submission path, increment before below changes to context
> > -	 * state.
> > -	 */
> > -	spin_lock(&sched_engine->lock);
> >   	incr_context_blocked(ce);
> > -	spin_unlock(&sched_engine->lock);
> >   	enabled = context_enabled(ce);
> >   	if (unlikely(!enabled || submission_disabled(guc))) {
> > @@ -1569,7 +1561,6 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >   static void guc_context_unblock(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
> >   	unsigned long flags;
> >   	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	intel_wakeref_t wakeref;
> > @@ -1594,13 +1585,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >   		intel_context_get(ce);
> >   	}
> > -	/*
> > -	 * Sync with submission path, decrement after above changes to context
> > -	 * state.
> > -	 */
> > -	spin_lock(&sched_engine->lock);
> >   	decr_context_blocked(ce);
> > -	spin_unlock(&sched_engine->lock);
> >   	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > @@ -1710,7 +1695,9 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> >   	    !lrc_desc_registered(guc, ce->guc_id)) {
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> 
> We do take this lock a few lines below this. Would it be worth just moving
> that up and do everything under the lock?
> 

Good catch. Yes, we should move everything under the which actually
makes all of this code quite a bit simpler too. Will fix.

Matt

> Anyway, all calls to the updated set/clr functions have been updated to be
> correctly locked, so:
> 
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> Daniele
> 
> >   		clr_context_enabled(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >   		goto unpin;
> >   	}
> > @@ -1760,7 +1747,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > -	clr_context_registered(ce);
> >   	deregister_context(ce, ce->guc_id, true);
> >   }
> > @@ -1833,8 +1819,10 @@ static void guc_context_destroy(struct kref *kref)
> >   	/* Seal race with Reset */
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   	disabled = submission_disabled(guc);
> > -	if (likely(!disabled))
> > +	if (likely(!disabled)) {
> >   		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> >   	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >   	if (unlikely(disabled)) {
> >   		release_guc_id(guc, ce);
> > @@ -2697,8 +2685,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> >   		     (!context_pending_enable(ce) &&
> >   		     !context_pending_disable(ce)))) {
> >   		drm_err(&guc_to_gt(guc)->i915->drm,
> > -			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
> > -			atomic_read(&ce->guc_sched_state_no_lock),
> > +			"Bad context sched_state 0x%x, desc_idx %u",
> >   			ce->guc_state.sched_state, desc_idx);
> >   		return -EPROTO;
> >   	}
> > @@ -2713,7 +2700,9 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
> >   		}
> >   #endif
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   		clr_context_pending_enable(ce);
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >   	} else if (context_pending_disable(ce)) {
> >   		bool banned;
> > @@ -2987,9 +2976,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   			   atomic_read(&ce->pin_count));
> >   		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> >   			   atomic_read(&ce->guc_id_ref));
> > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > -			   ce->guc_state.sched_state,
> > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > +		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> > +			   ce->guc_state.sched_state);
> >   		guc_log_context_priority(p, ce);
> >   	}
> > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > index 806ad688274b..0a77eb2944b5 100644
> > --- a/drivers/gpu/drm/i915/i915_trace.h
> > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > @@ -903,7 +903,6 @@ DECLARE_EVENT_CLASS(intel_context,
> >   			     __field(u32, guc_id)
> >   			     __field(int, pin_count)
> >   			     __field(u32, sched_state)
> > -			     __field(u32, guc_sched_state_no_lock)
> >   			     __field(u8, guc_prio)
> >   			     ),
> > @@ -911,15 +910,12 @@ DECLARE_EVENT_CLASS(intel_context,
> >   			   __entry->guc_id = ce->guc_id;
> >   			   __entry->pin_count = atomic_read(&ce->pin_count);
> >   			   __entry->sched_state = ce->guc_state.sched_state;
> > -			   __entry->guc_sched_state_no_lock =
> > -			   atomic_read(&ce->guc_sched_state_no_lock);
> >   			   __entry->guc_prio = ce->guc_prio;
> >   			   ),
> > -		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
> > +		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
> >   			      __entry->guc_id, __entry->pin_count,
> >   			      __entry->sched_state,
> > -			      __entry->guc_sched_state_no_lock,
> >   			      __entry->guc_prio)
> >   );
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  2021-08-25  0:07   ` Daniele Ceraolo Spurio
@ 2021-08-25 20:03     ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-25 20:03 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Tue, Aug 24, 2021 at 05:07:13PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > GuC submission has exposed an existing memory corruption in
> > live_lrc_isolation. We believe that some writes to the watchdog offsets
> > in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> > address space. With GuC submission there are additional objects which
> > can move the context redzone into the space that is trashed. To
> > workaround this avoid poisoning the watchdog.
> 
> This is kind of a worrying explanation, as it implies an HW issue. AFAICS we
> no longer increase the context size with GuC submission, so the redzone
> should be in the same place relative to the base address of the context;
> although it is true that we have more objects in memory due to support the
> GuC, hitting the redzone consistently feels too much like a coincidence.
> When we write the watchdog regs there is a risk we're triggering a watchdog
> interrupt, which will cause the GuC to handle that; on a media reset, the
> GuC overwrites the context with the golden context in the ADS, are we sure
> that's not what is causing this problem?
> Looking in the ADS we set the context memcpy size to:
> 
> real_size = intel_engine_context_size(gt, engine_class);
> 
> but then we only initialize real_size - SKIP_SIZE(gt->i915), which IMO could
> be the real cause of the bug as the GuC memcpy starts at SKIP_SIZE().
> 

Good analysis Daniele. This definitely seems to be the issue as the
below patch appears to have fixed the failing selftest:

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 9f5f43a16182..c19ce71c9de9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -358,6 +358,11 @@ static int guc_prep_golden_context(struct intel_guc *guc,
        u8 engine_class, guc_class;
        struct guc_gt_system_info *info, local_info;

+       /* Skip execlist and PPGTT registers + HWSP */
+       const u32 lr_hw_context_size = 80 * sizeof(u32);
+       const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE +
+               lr_hw_context_size;
+
        /*
         * Reserve the memory for the golden contexts and point GuC at it but
         * leave it empty for now. The context data will be filled in later
@@ -396,7 +401,7 @@ static int guc_prep_golden_context(struct intel_guc *guc,
                if (!blob)
                        continue;

-               blob->ads.eng_state_size[guc_class] = real_size;
+               blob->ads.eng_state_size[guc_class] = real_size - skip_size;
                blob->ads.golden_context_lrca[guc_class] = addr_ggtt;
                addr_ggtt += alloc_size;
        }
@@ -476,7 +481,8 @@ static void guc_init_golden_context(struct intel_guc *guc)
                        continue;
                }

-               GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != real_size);
+               GEM_BUG_ON(blob->ads.eng_state_size[guc_class] !=
+                          real_size - skip_size);
                GEM_BUG_ON(blob->ads.golden_context_lrca[guc_class] != addr_ggtt);
                addr_ggtt += alloc_size;

This being said, IMO this actually a bug in the GuC firmware as it
basically is doing:

memcpy(some_guc_dest, blob->ads.golden_context_lrca +
       guc_calculated_skip_size,
       blob->ads.eng_state_size);

IMO if the GuC is applying an internally calculated offset to
blob->ads.golden_context_lrca it should substract that calculated size
from blob->ads.eng_state_size.

e.g. the GuC should be doing:

memcpy(some_guc_dest, blob->ads.golden_context_lrca +
       guc_calculated_skip_size,
       blob->ads.eng_state_size - guc_calculated_skip_size);

We can bring this up with the GuC firmware team today, but in the
meantime I'll include the above patch in the respin of this series as a
workaround.

Matt 	

> Daniele
> 
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add VLK ref in code to workaround
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
> >   1 file changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > index b0977a3b699b..cdc6ae48a1e1 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
> >   	goto err_after;
> >   }
> > +static u32 safe_offset(u32 offset, u32 reg)
> > +{
> > +	/* XXX skip testing of watchdog - VLK-22772 */
> > +	if (offset == 0x178 || offset == 0x17c)
> > +		reg = 0;
> > +
> > +	return reg;
> > +}
> > +
> > +static int get_offset_mask(struct intel_engine_cs *engine)
> > +{
> > +	if (GRAPHICS_VER(engine->i915) < 12)
> > +		return 0xfff;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +	case RENDER_CLASS:
> > +		return 0x07ff;
> > +	case COPY_ENGINE_CLASS:
> > +		return 0x0fff;
> > +	case VIDEO_DECODE_CLASS:
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		return 0x3fff;
> > +	}
> > +}
> > +
> >   static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >   {
> >   	struct i915_vma *batch;
> > @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >   		len = (len + 1) / 2;
> >   		*cs++ = MI_LOAD_REGISTER_IMM(len);
> >   		while (len--) {
> > -			*cs++ = hw[dw];
> > +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> > +					    hw[dw]);
> >   			*cs++ = poison;
> >   			dw += 2;
> >   		}
> 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
@ 2021-08-25 21:51   ` Daniele Ceraolo Spurio
  2021-08-25 22:53     ` Matthew Brost
  2021-08-25 23:04     ` Matthew Brost
  0 siblings, 2 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-25 21:51 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Move GuC management fields in context under guc_active struct as this is
> where the lock that protects theses fields lives. Also only set guc_prio
> field once during context init.

Can you explain what we gain by setting that only on first pin? AFAICS 
re-setting it doesn't hurt and we would cover the case where a context 
priority gets updated while the context is idle. I know the request 
submission would eventually update the prio so there is no bug, but that 
then requires an extra H2G.

>
> Fixes: ee242ca704d3 ("drm/i915/guc: Implement GuC priority management")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 68 +++++++++++--------
>   drivers/gpu/drm/i915/i915_trace.h             |  2 +-
>   3 files changed, 45 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 524a35a78bf4..9fb0480ccf3b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -112,6 +112,7 @@ struct intel_context {
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
>   #define CONTEXT_NOPREEMPT		8
>   #define CONTEXT_LRCA_DIRTY		9
> +#define CONTEXT_GUC_INIT		10
>   
>   	struct {
>   		u64 timeout_us;
> @@ -178,6 +179,11 @@ struct intel_context {
>   		spinlock_t lock;
>   		/** requests: active requests on this context */
>   		struct list_head requests;
> +		/*
> +		 * GuC priority management
> +		 */
> +		u8 prio;
> +		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
>   	} guc_active;
>   
>   	/* GuC LRC descriptor ID */
> @@ -191,12 +197,6 @@ struct intel_context {
>   	 */
>   	struct list_head guc_id_link;
>   
> -	/*
> -	 * GuC priority management
> -	 */
> -	u8 guc_prio;
> -	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> -
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3e90985b0c1b..bb90bedb1305 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1369,8 +1369,6 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
>   	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
>   }
>   
> -static inline u8 map_i915_prio_to_guc_prio(int prio);
> -
>   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   {
>   	struct intel_engine_cs *engine = ce->engine;
> @@ -1378,8 +1376,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	struct intel_guc *guc = &engine->gt->uc.guc;
>   	u32 desc_idx = ce->guc_id;
>   	struct guc_lrc_desc *desc;
> -	const struct i915_gem_context *ctx;
> -	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
>   	bool context_registered;
>   	intel_wakeref_t wakeref;
>   	int ret = 0;
> @@ -1396,12 +1392,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   
>   	context_registered = lrc_desc_registered(guc, desc_idx);
>   
> -	rcu_read_lock();
> -	ctx = rcu_dereference(ce->gem_context);
> -	if (ctx)
> -		prio = ctx->sched.priority;
> -	rcu_read_unlock();
> -
>   	reset_lrc_desc(guc, desc_idx);
>   	set_lrc_desc_registered(guc, desc_idx, ce);
>   
> @@ -1410,8 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->engine_submit_mask = adjust_engine_mask(engine->class,
>   						      engine->mask);
>   	desc->hw_context_desc = ce->lrc.lrca;
> -	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
> -	desc->priority = ce->guc_prio;
> +	desc->priority = ce->guc_active.prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
>   
> @@ -1813,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   
>   static void __guc_context_destroy(struct intel_context *ce)
>   {
> -	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> +	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
>   	GEM_BUG_ON(ce->guc_state.number_committed_requests);
>   
>   	lrc_fini(ce);
> @@ -1926,14 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
>   
>   	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
>   		   prio > GUC_CLIENT_PRIORITY_NORMAL);
> +	lockdep_assert_held(&ce->guc_active.lock);
>   
> -	if (ce->guc_prio == prio || submission_disabled(guc) ||
> -	    !context_registered(ce))
> +	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
> +	    !context_registered(ce)) {
> +		ce->guc_active.prio = prio;
>   		return;
> +	}
>   
>   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
>   
> -	ce->guc_prio = prio;
> +	ce->guc_active.prio = prio;
>   	trace_intel_context_set_prio(ce);
>   }
>   
> @@ -1953,24 +1945,24 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
>   					     u8 guc_prio)
>   {
>   	lockdep_assert_held(&ce->guc_active.lock);
> -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
>   
> -	++ce->guc_prio_count[guc_prio];
> +	++ce->guc_active.prio_count[guc_prio];
>   
>   	/* Overflow protection */
> -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
>   }
>   
>   static inline void sub_context_inflight_prio(struct intel_context *ce,
>   					     u8 guc_prio)
>   {
>   	lockdep_assert_held(&ce->guc_active.lock);
> -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
>   
>   	/* Underflow protection */
> -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
>   
> -	--ce->guc_prio_count[guc_prio];
> +	--ce->guc_active.prio_count[guc_prio];
>   }
>   
>   static inline void update_context_prio(struct intel_context *ce)
> @@ -1983,8 +1975,8 @@ static inline void update_context_prio(struct intel_context *ce)
>   
>   	lockdep_assert_held(&ce->guc_active.lock);
>   
> -	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
> -		if (ce->guc_prio_count[i]) {
> +	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
> +		if (ce->guc_active.prio_count[i]) {
>   			guc_context_set_prio(guc, ce, i);
>   			break;
>   		}
> @@ -2123,6 +2115,20 @@ static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
>   		!submission_disabled(ce_to_guc(ce));
>   }
>   
> +static void guc_context_init(struct intel_context *ce)
> +{
> +	const struct i915_gem_context *ctx;
> +	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
> +
> +	rcu_read_lock();
> +	ctx = rcu_dereference(ce->gem_context);
> +	if (ctx)
> +		prio = ctx->sched.priority;
> +	rcu_read_unlock();
> +
> +	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
> +}
> +
>   static int guc_request_alloc(struct i915_request *rq)
>   {
>   	struct intel_context *ce = rq->context;
> @@ -2154,6 +2160,9 @@ static int guc_request_alloc(struct i915_request *rq)
>   
>   	rq->reserved_space -= GUC_REQUEST_SIZE;
>   
> +	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))

Where is CONTEXT_GUC_INIT set? Can't find it

Daniele

> +		guc_context_init(ce);
> +
>   	/*
>   	 * Call pin_guc_id here rather than in the pinning step as with
>   	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> @@ -3031,13 +3040,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
>   {
>   	int i;
>   
> -	drm_printf(p, "\t\tPriority: %d\n",
> -		   ce->guc_prio);
> +	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
>   	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
>   	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
>   	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
>   		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
> -			   i, ce->guc_prio_count[i]);
> +			   i, ce->guc_active.prio_count[i]);
>   	}
>   	drm_printf(p, "\n");
>   }
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 0a77eb2944b5..6f882e72ed11 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
>   			   __entry->guc_id = ce->guc_id;
>   			   __entry->pin_count = atomic_read(&ce->pin_count);
>   			   __entry->sched_state = ce->guc_state.sched_state;
> -			   __entry->guc_prio = ce->guc_prio;
> +			   __entry->guc_prio = ce->guc_active.prio;
>   			   ),
>   
>   		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active
  2021-08-25 21:51   ` Daniele Ceraolo Spurio
@ 2021-08-25 22:53     ` Matthew Brost
  2021-08-25 23:04     ` Matthew Brost
  1 sibling, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-25 22:53 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Wed, Aug 25, 2021 at 02:51:11PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Move GuC management fields in context under guc_active struct as this is
> > where the lock that protects theses fields lives. Also only set guc_prio
> > field once during context init.
> 
> Can you explain what we gain by setting that only on first pin? AFAICS
> re-setting it doesn't hurt and we would cover the case where a context
> priority gets updated while the context is idle. I know the request
> submission would eventually update the prio so there is no bug, but that
> then requires an extra H2G.
> 

Contexts really shouldn't be getting registred and deregistered, so
real need to set this field on each register. Also the priority really
shouldn't be getting all the regularly. IMO this is the correct place,
so I moved it. Lastly, a subsequent patch will also use
guc_context_init() so the helper makes a bit more sense.

Matt

> > 
> > Fixes: ee242ca704d3 ("drm/i915/guc: Implement GuC priority management")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 68 +++++++++++--------
> >   drivers/gpu/drm/i915/i915_trace.h             |  2 +-
> >   3 files changed, 45 insertions(+), 37 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 524a35a78bf4..9fb0480ccf3b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -112,6 +112,7 @@ struct intel_context {
> >   #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
> >   #define CONTEXT_NOPREEMPT		8
> >   #define CONTEXT_LRCA_DIRTY		9
> > +#define CONTEXT_GUC_INIT		10
> >   	struct {
> >   		u64 timeout_us;
> > @@ -178,6 +179,11 @@ struct intel_context {
> >   		spinlock_t lock;
> >   		/** requests: active requests on this context */
> >   		struct list_head requests;
> > +		/*
> > +		 * GuC priority management
> > +		 */
> > +		u8 prio;
> > +		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
> >   	} guc_active;
> >   	/* GuC LRC descriptor ID */
> > @@ -191,12 +197,6 @@ struct intel_context {
> >   	 */
> >   	struct list_head guc_id_link;
> > -	/*
> > -	 * GuC priority management
> > -	 */
> > -	u8 guc_prio;
> > -	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > -
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3e90985b0c1b..bb90bedb1305 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1369,8 +1369,6 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
> >   }
> > -static inline u8 map_i915_prio_to_guc_prio(int prio);
> > -
> >   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   {
> >   	struct intel_engine_cs *engine = ce->engine;
> > @@ -1378,8 +1376,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	struct intel_guc *guc = &engine->gt->uc.guc;
> >   	u32 desc_idx = ce->guc_id;
> >   	struct guc_lrc_desc *desc;
> > -	const struct i915_gem_context *ctx;
> > -	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
> >   	bool context_registered;
> >   	intel_wakeref_t wakeref;
> >   	int ret = 0;
> > @@ -1396,12 +1392,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	context_registered = lrc_desc_registered(guc, desc_idx);
> > -	rcu_read_lock();
> > -	ctx = rcu_dereference(ce->gem_context);
> > -	if (ctx)
> > -		prio = ctx->sched.priority;
> > -	rcu_read_unlock();
> > -
> >   	reset_lrc_desc(guc, desc_idx);
> >   	set_lrc_desc_registered(guc, desc_idx, ce);
> > @@ -1410,8 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> >   						      engine->mask);
> >   	desc->hw_context_desc = ce->lrc.lrca;
> > -	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
> > -	desc->priority = ce->guc_prio;
> > +	desc->priority = ce->guc_active.prio;
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > @@ -1813,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   static void __guc_context_destroy(struct intel_context *ce)
> >   {
> > -	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> > +	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> >   	GEM_BUG_ON(ce->guc_state.number_committed_requests);
> >   	lrc_fini(ce);
> > @@ -1926,14 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
> >   	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
> >   		   prio > GUC_CLIENT_PRIORITY_NORMAL);
> > +	lockdep_assert_held(&ce->guc_active.lock);
> > -	if (ce->guc_prio == prio || submission_disabled(guc) ||
> > -	    !context_registered(ce))
> > +	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
> > +	    !context_registered(ce)) {
> > +		ce->guc_active.prio = prio;
> >   		return;
> > +	}
> >   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
> > -	ce->guc_prio = prio;
> > +	ce->guc_active.prio = prio;
> >   	trace_intel_context_set_prio(ce);
> >   }
> > @@ -1953,24 +1945,24 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
> >   					     u8 guc_prio)
> >   {
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> > +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> > -	++ce->guc_prio_count[guc_prio];
> > +	++ce->guc_active.prio_count[guc_prio];
> >   	/* Overflow protection */
> > -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> > +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> >   }
> >   static inline void sub_context_inflight_prio(struct intel_context *ce,
> >   					     u8 guc_prio)
> >   {
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> > +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> >   	/* Underflow protection */
> > -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> > +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> > -	--ce->guc_prio_count[guc_prio];
> > +	--ce->guc_active.prio_count[guc_prio];
> >   }
> >   static inline void update_context_prio(struct intel_context *ce)
> > @@ -1983,8 +1975,8 @@ static inline void update_context_prio(struct intel_context *ce)
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
> > -		if (ce->guc_prio_count[i]) {
> > +	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
> > +		if (ce->guc_active.prio_count[i]) {
> >   			guc_context_set_prio(guc, ce, i);
> >   			break;
> >   		}
> > @@ -2123,6 +2115,20 @@ static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
> >   		!submission_disabled(ce_to_guc(ce));
> >   }
> > +static void guc_context_init(struct intel_context *ce)
> > +{
> > +	const struct i915_gem_context *ctx;
> > +	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
> > +
> > +	rcu_read_lock();
> > +	ctx = rcu_dereference(ce->gem_context);
> > +	if (ctx)
> > +		prio = ctx->sched.priority;
> > +	rcu_read_unlock();
> > +
> > +	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
> > +}
> > +
> >   static int guc_request_alloc(struct i915_request *rq)
> >   {
> >   	struct intel_context *ce = rq->context;
> > @@ -2154,6 +2160,9 @@ static int guc_request_alloc(struct i915_request *rq)
> >   	rq->reserved_space -= GUC_REQUEST_SIZE;
> > +	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
> 
> Where is CONTEXT_GUC_INIT set? Can't find it
> 
> Daniele
> 
> > +		guc_context_init(ce);
> > +
> >   	/*
> >   	 * Call pin_guc_id here rather than in the pinning step as with
> >   	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> > @@ -3031,13 +3040,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
> >   {
> >   	int i;
> > -	drm_printf(p, "\t\tPriority: %d\n",
> > -		   ce->guc_prio);
> > +	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
> >   	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
> >   	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
> >   	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
> >   		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
> > -			   i, ce->guc_prio_count[i]);
> > +			   i, ce->guc_active.prio_count[i]);
> >   	}
> >   	drm_printf(p, "\n");
> >   }
> > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > index 0a77eb2944b5..6f882e72ed11 100644
> > --- a/drivers/gpu/drm/i915/i915_trace.h
> > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > @@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
> >   			   __entry->guc_id = ce->guc_id;
> >   			   __entry->pin_count = atomic_read(&ce->pin_count);
> >   			   __entry->sched_state = ce->guc_state.sched_state;
> > -			   __entry->guc_prio = ce->guc_prio;
> > +			   __entry->guc_prio = ce->guc_active.prio;
> >   			   ),
> >   		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active
  2021-08-25 21:51   ` Daniele Ceraolo Spurio
  2021-08-25 22:53     ` Matthew Brost
@ 2021-08-25 23:04     ` Matthew Brost
  1 sibling, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-25 23:04 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Wed, Aug 25, 2021 at 02:51:11PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Move GuC management fields in context under guc_active struct as this is
> > where the lock that protects theses fields lives. Also only set guc_prio
> > field once during context init.
> 
> Can you explain what we gain by setting that only on first pin? AFAICS
> re-setting it doesn't hurt and we would cover the case where a context
> priority gets updated while the context is idle. I know the request
> submission would eventually update the prio so there is no bug, but that
> then requires an extra H2G.
> 
> > 
> > Fixes: ee242ca704d3 ("drm/i915/guc: Implement GuC priority management")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 68 +++++++++++--------
> >   drivers/gpu/drm/i915/i915_trace.h             |  2 +-
> >   3 files changed, 45 insertions(+), 37 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 524a35a78bf4..9fb0480ccf3b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -112,6 +112,7 @@ struct intel_context {
> >   #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
> >   #define CONTEXT_NOPREEMPT		8
> >   #define CONTEXT_LRCA_DIRTY		9
> > +#define CONTEXT_GUC_INIT		10
> >   	struct {
> >   		u64 timeout_us;
> > @@ -178,6 +179,11 @@ struct intel_context {
> >   		spinlock_t lock;
> >   		/** requests: active requests on this context */
> >   		struct list_head requests;
> > +		/*
> > +		 * GuC priority management
> > +		 */
> > +		u8 prio;
> > +		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
> >   	} guc_active;
> >   	/* GuC LRC descriptor ID */
> > @@ -191,12 +197,6 @@ struct intel_context {
> >   	 */
> >   	struct list_head guc_id_link;
> > -	/*
> > -	 * GuC priority management
> > -	 */
> > -	u8 guc_prio;
> > -	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > -
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 3e90985b0c1b..bb90bedb1305 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1369,8 +1369,6 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
> >   }
> > -static inline u8 map_i915_prio_to_guc_prio(int prio);
> > -
> >   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   {
> >   	struct intel_engine_cs *engine = ce->engine;
> > @@ -1378,8 +1376,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	struct intel_guc *guc = &engine->gt->uc.guc;
> >   	u32 desc_idx = ce->guc_id;
> >   	struct guc_lrc_desc *desc;
> > -	const struct i915_gem_context *ctx;
> > -	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
> >   	bool context_registered;
> >   	intel_wakeref_t wakeref;
> >   	int ret = 0;
> > @@ -1396,12 +1392,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	context_registered = lrc_desc_registered(guc, desc_idx);
> > -	rcu_read_lock();
> > -	ctx = rcu_dereference(ce->gem_context);
> > -	if (ctx)
> > -		prio = ctx->sched.priority;
> > -	rcu_read_unlock();
> > -
> >   	reset_lrc_desc(guc, desc_idx);
> >   	set_lrc_desc_registered(guc, desc_idx, ce);
> > @@ -1410,8 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> >   						      engine->mask);
> >   	desc->hw_context_desc = ce->lrc.lrca;
> > -	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
> > -	desc->priority = ce->guc_prio;
> > +	desc->priority = ce->guc_active.prio;
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > @@ -1813,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   static void __guc_context_destroy(struct intel_context *ce)
> >   {
> > -	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> > -		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> > +	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> > +		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> >   	GEM_BUG_ON(ce->guc_state.number_committed_requests);
> >   	lrc_fini(ce);
> > @@ -1926,14 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
> >   	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
> >   		   prio > GUC_CLIENT_PRIORITY_NORMAL);
> > +	lockdep_assert_held(&ce->guc_active.lock);
> > -	if (ce->guc_prio == prio || submission_disabled(guc) ||
> > -	    !context_registered(ce))
> > +	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
> > +	    !context_registered(ce)) {
> > +		ce->guc_active.prio = prio;
> >   		return;
> > +	}
> >   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
> > -	ce->guc_prio = prio;
> > +	ce->guc_active.prio = prio;
> >   	trace_intel_context_set_prio(ce);
> >   }
> > @@ -1953,24 +1945,24 @@ static inline void add_context_inflight_prio(struct intel_context *ce,
> >   					     u8 guc_prio)
> >   {
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> > +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> > -	++ce->guc_prio_count[guc_prio];
> > +	++ce->guc_active.prio_count[guc_prio];
> >   	/* Overflow protection */
> > -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> > +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> >   }
> >   static inline void sub_context_inflight_prio(struct intel_context *ce,
> >   					     u8 guc_prio)
> >   {
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
> > +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> >   	/* Underflow protection */
> > -	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
> > +	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> > -	--ce->guc_prio_count[guc_prio];
> > +	--ce->guc_active.prio_count[guc_prio];
> >   }
> >   static inline void update_context_prio(struct intel_context *ce)
> > @@ -1983,8 +1975,8 @@ static inline void update_context_prio(struct intel_context *ce)
> >   	lockdep_assert_held(&ce->guc_active.lock);
> > -	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
> > -		if (ce->guc_prio_count[i]) {
> > +	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
> > +		if (ce->guc_active.prio_count[i]) {
> >   			guc_context_set_prio(guc, ce, i);
> >   			break;
> >   		}
> > @@ -2123,6 +2115,20 @@ static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
> >   		!submission_disabled(ce_to_guc(ce));
> >   }
> > +static void guc_context_init(struct intel_context *ce)
> > +{
> > +	const struct i915_gem_context *ctx;
> > +	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
> > +
> > +	rcu_read_lock();
> > +	ctx = rcu_dereference(ce->gem_context);
> > +	if (ctx)
> > +		prio = ctx->sched.priority;
> > +	rcu_read_unlock();
> > +
> > +	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
> > +}
> > +
> >   static int guc_request_alloc(struct i915_request *rq)
> >   {
> >   	struct intel_context *ce = rq->context;
> > @@ -2154,6 +2160,9 @@ static int guc_request_alloc(struct i915_request *rq)
> >   	rq->reserved_space -= GUC_REQUEST_SIZE;
> > +	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
> 
> Where is CONTEXT_GUC_INIT set? Can't find it
> 

Missed this commit. Opps, this should be set in guc_context_init.

Matt

> Daniele
> 
> > +		guc_context_init(ce);
> > +
> >   	/*
> >   	 * Call pin_guc_id here rather than in the pinning step as with
> >   	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> > @@ -3031,13 +3040,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
> >   {
> >   	int i;
> > -	drm_printf(p, "\t\tPriority: %d\n",
> > -		   ce->guc_prio);
> > +	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
> >   	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
> >   	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
> >   	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
> >   		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
> > -			   i, ce->guc_prio_count[i]);
> > +			   i, ce->guc_active.prio_count[i]);
> >   	}
> >   	drm_printf(p, "\n");
> >   }
> > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > index 0a77eb2944b5..6f882e72ed11 100644
> > --- a/drivers/gpu/drm/i915/i915_trace.h
> > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > @@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
> >   			   __entry->guc_id = ce->guc_id;
> >   			   __entry->pin_count = atomic_read(&ce->pin_count);
> >   			   __entry->sched_state = ce->guc_state.sched_state;
> > -			   __entry->guc_prio = ce->guc_prio;
> > +			   __entry->guc_prio = ce->guc_active.prio;
> >   			   ),
> >   		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-26  0:44   ` Daniele Ceraolo Spurio
@ 2021-08-26  0:41     ` Matthew Brost
  2021-08-26  0:48       ` Daniele Ceraolo Spurio
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Brost @ 2021-08-26  0:41 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx, dri-devel, daniel.vetter

On Wed, Aug 25, 2021 at 05:44:11PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > Lock the xarray and take ref to the context if needed.
> > 
> > v2:
> >   (Checkpatch)
> >    - Add new line after declaration
> >   (Daniel Vetter)
> >    - Correct put / get accounting in xa_for_loops
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 103 +++++++++++++++---
> >   1 file changed, 88 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 509b298e7cf3..5f77f25322ca 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -606,8 +606,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   	unsigned long index, flags;
> >   	bool pending_disable, pending_enable, deregister, destroyed, banned;
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> > -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +		/*
> > +		 * Corner case where the ref count on the object is zero but and
> > +		 * deregister G2H was lost. In this case we don't touch the ref
> > +		 * count and finish the destroy of the context.
> > +		 */
> > +		bool do_put = kref_get_unless_zero(&ce->ref);
> > +
> > +		xa_unlock(&guc->context_lookup);
> > +
> > +		spin_lock(&ce->guc_state.lock);
> >   		/*
> >   		 * Once we are at this point submission_disabled() is guaranteed
> > @@ -623,7 +633,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   		banned = context_banned(ce);
> >   		init_sched_state(ce);
> > -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> > +		GEM_BUG_ON(!do_put && !destroyed);
> >   		if (pending_enable || destroyed || deregister) {
> >   			decr_outstanding_submission_g2h(guc);
> > @@ -646,13 +658,19 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			}
> >   			intel_context_sched_disable_unpin(ce);
> >   			decr_outstanding_submission_g2h(guc);
> > -			spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +
> > +			spin_lock(&ce->guc_state.lock);
> >   			guc_blocked_fence_complete(ce);
> > -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +			spin_unlock(&ce->guc_state.lock);
> >   			intel_context_put(ce);
> >   		}
> > +
> > +		if (do_put)
> > +			intel_context_put(ce);
> 
> is it safe to do the put outside the xa_lock, in case the refcount goes to
> zero with this? I know it is unlikely because the refcount was > 0 if do_put
> is true, but it might've gone down between us checking earlier and now.
> 

It is safe as xa_for_each indicates it is safe to destroy / delete
objects from the array while traversing it. 

> > +		xa_lock(&guc->context_lookup);
> >   	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> >   static inline bool
> > @@ -873,16 +891,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> >   {
> >   	struct intel_context *ce;
> >   	unsigned long index;
> > +	unsigned long flags;
> >   	if (unlikely(!guc_submission_initialized(guc))) {
> >   		/* Reset called during driver load? GuC not yet initialised! */
> >   		return;
> >   	}
> > -	xa_for_each(&guc->context_lookup, index, ce)
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > +	xa_for_each(&guc->context_lookup, index, ce) {
> > +		if (!kref_get_unless_zero(&ce->ref))
> > +			continue;
> > +
> > +		xa_unlock(&guc->context_lookup);
> > +
> >   		if (intel_context_is_pinned(ce))
> >   			__guc_reset_context(ce, stalled);
> > +		intel_context_put(ce);
> > +
> > +		xa_lock(&guc->context_lookup);
> > +	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > +
> >   	/* GuC is blown away, drop all references to contexts */
> >   	xa_destroy(&guc->context_lookup);
> >   }
> > @@ -957,11 +988,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> >   {
> >   	struct intel_context *ce;
> >   	unsigned long index;
> > +	unsigned long flags;
> > +
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > +	xa_for_each(&guc->context_lookup, index, ce) {
> > +		if (!kref_get_unless_zero(&ce->ref))
> > +			continue;
> > +
> > +		xa_unlock(&guc->context_lookup);
> > -	xa_for_each(&guc->context_lookup, index, ce)
> >   		if (intel_context_is_pinned(ce))
> >   			guc_cancel_context_requests(ce);
> > +		intel_context_put(ce);
> > +
> > +		xa_lock(&guc->context_lookup);
> > +	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > +
> >   	guc_cancel_sched_engine_requests(guc->sched_engine);
> >   	/* GuC is blown away, drop all references to contexts */
> > @@ -2850,21 +2894,28 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> >   	struct intel_context *ce;
> >   	struct i915_request *rq;
> >   	unsigned long index;
> > +	unsigned long flags;
> >   	/* Reset called during driver load? GuC not yet initialised! */
> >   	if (unlikely(!guc_submission_initialized(guc)))
> >   		return;
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> > -		if (!intel_context_is_pinned(ce))
> > +		if (!kref_get_unless_zero(&ce->ref))
> >   			continue;
> > +		xa_unlock(&guc->context_lookup);
> > +
> > +		if (!intel_context_is_pinned(ce))
> > +			goto next;
> > +
> >   		if (intel_engine_is_virtual(ce->engine)) {
> >   			if (!(ce->engine->mask & engine->mask))
> > -				continue;
> > +				goto next;
> >   		} else {
> >   			if (ce->engine != engine)
> > -				continue;
> > +				goto next;
> >   		}
> >   		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> > @@ -2874,9 +2925,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
> >   			intel_engine_set_hung_context(engine, ce);
> >   			/* Can only cope with one hang at a time... */
> > -			return;
> > +			intel_context_put(ce);
> > +			xa_lock(&guc->context_lookup);
> > +			goto done;
> >   		}
> > +next:
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> > +
> 
> nit: extra newline
> 

Checkpatch got that one too. Already fixed.

Matt

> Daniele
> 
> >   	}
> > +done:
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> >   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > @@ -2892,23 +2951,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> >   	if (unlikely(!guc_submission_initialized(guc)))
> >   		return;
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> > -		if (!intel_context_is_pinned(ce))
> > +		if (!kref_get_unless_zero(&ce->ref))
> >   			continue;
> > +		xa_unlock(&guc->context_lookup);
> > +
> > +		if (!intel_context_is_pinned(ce))
> > +			goto next;
> > +
> >   		if (intel_engine_is_virtual(ce->engine)) {
> >   			if (!(ce->engine->mask & engine->mask))
> > -				continue;
> > +				goto next;
> >   		} else {
> >   			if (ce->engine != engine)
> > -				continue;
> > +				goto next;
> >   		}
> > -		spin_lock_irqsave(&ce->guc_active.lock, flags);
> > +		spin_lock(&ce->guc_active.lock);
> >   		intel_engine_dump_active_requests(&ce->guc_active.requests,
> >   						  hung_rq, m);
> > -		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> > +		spin_unlock(&ce->guc_active.lock);
> > +
> > +next:
> > +		intel_context_put(ce);
> > +		xa_lock(&guc->context_lookup);
> >   	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> >   void intel_guc_submission_print_info(struct intel_guc *guc,
> > @@ -2962,7 +3032,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   {
> >   	struct intel_context *ce;
> >   	unsigned long index;
> > +	unsigned long flags;
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> >   		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> >   		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > @@ -2981,6 +3053,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   		guc_log_context_priority(p, ce);
> >   	}
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> >   static struct intel_context *
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
@ 2021-08-26  0:44   ` Daniele Ceraolo Spurio
  2021-08-26  0:41     ` Matthew Brost
  0 siblings, 1 reply; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-26  0:44 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Lock the xarray and take ref to the context if needed.
>
> v2:
>   (Checkpatch)
>    - Add new line after declaration
>   (Daniel Vetter)
>    - Correct put / get accounting in xa_for_loops
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 103 +++++++++++++++---
>   1 file changed, 88 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 509b298e7cf3..5f77f25322ca 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -606,8 +606,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   	unsigned long index, flags;
>   	bool pending_disable, pending_enable, deregister, destroyed, banned;
>   
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		/*
> +		 * Corner case where the ref count on the object is zero but and
> +		 * deregister G2H was lost. In this case we don't touch the ref
> +		 * count and finish the destroy of the context.
> +		 */
> +		bool do_put = kref_get_unless_zero(&ce->ref);
> +
> +		xa_unlock(&guc->context_lookup);
> +
> +		spin_lock(&ce->guc_state.lock);
>   
>   		/*
>   		 * Once we are at this point submission_disabled() is guaranteed
> @@ -623,7 +633,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   		banned = context_banned(ce);
>   		init_sched_state(ce);
>   
> -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		GEM_BUG_ON(!do_put && !destroyed);
>   
>   		if (pending_enable || destroyed || deregister) {
>   			decr_outstanding_submission_g2h(guc);
> @@ -646,13 +658,19 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   			}
>   			intel_context_sched_disable_unpin(ce);
>   			decr_outstanding_submission_g2h(guc);
> -			spin_lock_irqsave(&ce->guc_state.lock, flags);
> +
> +			spin_lock(&ce->guc_state.lock);
>   			guc_blocked_fence_complete(ce);
> -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +			spin_unlock(&ce->guc_state.lock);
>   
>   			intel_context_put(ce);
>   		}
> +
> +		if (do_put)
> +			intel_context_put(ce);

is it safe to do the put outside the xa_lock, in case the refcount goes 
to zero with this? I know it is unlikely because the refcount was > 0 if 
do_put is true, but it might've gone down between us checking earlier 
and now.

> +		xa_lock(&guc->context_lookup);
>   	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
>   static inline bool
> @@ -873,16 +891,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>   {
>   	struct intel_context *ce;
>   	unsigned long index;
> +	unsigned long flags;
>   
>   	if (unlikely(!guc_submission_initialized(guc))) {
>   		/* Reset called during driver load? GuC not yet initialised! */
>   		return;
>   	}
>   
> -	xa_for_each(&guc->context_lookup, index, ce)
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	xa_for_each(&guc->context_lookup, index, ce) {
> +		if (!kref_get_unless_zero(&ce->ref))
> +			continue;
> +
> +		xa_unlock(&guc->context_lookup);
> +
>   		if (intel_context_is_pinned(ce))
>   			__guc_reset_context(ce, stalled);
>   
> +		intel_context_put(ce);
> +
> +		xa_lock(&guc->context_lookup);
> +	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +
>   	/* GuC is blown away, drop all references to contexts */
>   	xa_destroy(&guc->context_lookup);
>   }
> @@ -957,11 +988,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>   {
>   	struct intel_context *ce;
>   	unsigned long index;
> +	unsigned long flags;
> +
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	xa_for_each(&guc->context_lookup, index, ce) {
> +		if (!kref_get_unless_zero(&ce->ref))
> +			continue;
> +
> +		xa_unlock(&guc->context_lookup);
>   
> -	xa_for_each(&guc->context_lookup, index, ce)
>   		if (intel_context_is_pinned(ce))
>   			guc_cancel_context_requests(ce);
>   
> +		intel_context_put(ce);
> +
> +		xa_lock(&guc->context_lookup);
> +	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +
>   	guc_cancel_sched_engine_requests(guc->sched_engine);
>   
>   	/* GuC is blown away, drop all references to contexts */
> @@ -2850,21 +2894,28 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   	struct intel_context *ce;
>   	struct i915_request *rq;
>   	unsigned long index;
> +	unsigned long flags;
>   
>   	/* Reset called during driver load? GuC not yet initialised! */
>   	if (unlikely(!guc_submission_initialized(guc)))
>   		return;
>   
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		if (!intel_context_is_pinned(ce))
> +		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> +		xa_unlock(&guc->context_lookup);
> +
> +		if (!intel_context_is_pinned(ce))
> +			goto next;
> +
>   		if (intel_engine_is_virtual(ce->engine)) {
>   			if (!(ce->engine->mask & engine->mask))
> -				continue;
> +				goto next;
>   		} else {
>   			if (ce->engine != engine)
> -				continue;
> +				goto next;
>   		}
>   
>   		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> @@ -2874,9 +2925,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> -			return;
> +			intel_context_put(ce);
> +			xa_lock(&guc->context_lookup);
> +			goto done;
>   		}
> +next:
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
> +

nit: extra newline

Daniele

>   	}
> +done:
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
>   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> @@ -2892,23 +2951,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>   	if (unlikely(!guc_submission_initialized(guc)))
>   		return;
>   
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		if (!intel_context_is_pinned(ce))
> +		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> +		xa_unlock(&guc->context_lookup);
> +
> +		if (!intel_context_is_pinned(ce))
> +			goto next;
> +
>   		if (intel_engine_is_virtual(ce->engine)) {
>   			if (!(ce->engine->mask & engine->mask))
> -				continue;
> +				goto next;
>   		} else {
>   			if (ce->engine != engine)
> -				continue;
> +				goto next;
>   		}
>   
> -		spin_lock_irqsave(&ce->guc_active.lock, flags);
> +		spin_lock(&ce->guc_active.lock);
>   		intel_engine_dump_active_requests(&ce->guc_active.requests,
>   						  hung_rq, m);
> -		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> +		spin_unlock(&ce->guc_active.lock);
> +
> +next:
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
>   	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
>   void intel_guc_submission_print_info(struct intel_guc *guc,
> @@ -2962,7 +3032,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   {
>   	struct intel_context *ce;
>   	unsigned long index;
> +	unsigned long flags;
>   
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
>   		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
>   		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> @@ -2981,6 +3053,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   
>   		guc_log_context_priority(p, ce);
>   	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
>   static struct intel_context *


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup
  2021-08-26  0:41     ` Matthew Brost
@ 2021-08-26  0:48       ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-26  0:48 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter



On 8/25/2021 5:41 PM, Matthew Brost wrote:
> On Wed, Aug 25, 2021 at 05:44:11PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 8/18/2021 11:16 PM, Matthew Brost wrote:
>>> Lock the xarray and take ref to the context if needed.
>>>
>>> v2:
>>>    (Checkpatch)
>>>     - Add new line after declaration
>>>    (Daniel Vetter)
>>>     - Correct put / get accounting in xa_for_loops
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 103 +++++++++++++++---
>>>    1 file changed, 88 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 509b298e7cf3..5f77f25322ca 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -606,8 +606,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    	unsigned long index, flags;
>>>    	bool pending_disable, pending_enable, deregister, destroyed, banned;
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>>    	xa_for_each(&guc->context_lookup, index, ce) {
>>> -		spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> +		/*
>>> +		 * Corner case where the ref count on the object is zero but and
>>> +		 * deregister G2H was lost. In this case we don't touch the ref
>>> +		 * count and finish the destroy of the context.
>>> +		 */
>>> +		bool do_put = kref_get_unless_zero(&ce->ref);
>>> +
>>> +		xa_unlock(&guc->context_lookup);
>>> +
>>> +		spin_lock(&ce->guc_state.lock);
>>>    		/*
>>>    		 * Once we are at this point submission_disabled() is guaranteed
>>> @@ -623,7 +633,9 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    		banned = context_banned(ce);
>>>    		init_sched_state(ce);
>>> -		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +		spin_unlock(&ce->guc_state.lock);
>>> +
>>> +		GEM_BUG_ON(!do_put && !destroyed);
>>>    		if (pending_enable || destroyed || deregister) {
>>>    			decr_outstanding_submission_g2h(guc);
>>> @@ -646,13 +658,19 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    			}
>>>    			intel_context_sched_disable_unpin(ce);
>>>    			decr_outstanding_submission_g2h(guc);
>>> -			spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> +
>>> +			spin_lock(&ce->guc_state.lock);
>>>    			guc_blocked_fence_complete(ce);
>>> -			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +			spin_unlock(&ce->guc_state.lock);
>>>    			intel_context_put(ce);
>>>    		}
>>> +
>>> +		if (do_put)
>>> +			intel_context_put(ce);
>> is it safe to do the put outside the xa_lock, in case the refcount goes to
>> zero with this? I know it is unlikely because the refcount was > 0 if do_put
>> is true, but it might've gone down between us checking earlier and now.
>>
> It is safe as xa_for_each indicates it is safe to destroy / delete
> objects from the array while traversing it.

ok.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>>> +		xa_lock(&guc->context_lookup);
>>>    	}
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>>    }
>>>    static inline bool
>>> @@ -873,16 +891,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>>>    {
>>>    	struct intel_context *ce;
>>>    	unsigned long index;
>>> +	unsigned long flags;
>>>    	if (unlikely(!guc_submission_initialized(guc))) {
>>>    		/* Reset called during driver load? GuC not yet initialised! */
>>>    		return;
>>>    	}
>>> -	xa_for_each(&guc->context_lookup, index, ce)
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>> +	xa_for_each(&guc->context_lookup, index, ce) {
>>> +		if (!kref_get_unless_zero(&ce->ref))
>>> +			continue;
>>> +
>>> +		xa_unlock(&guc->context_lookup);
>>> +
>>>    		if (intel_context_is_pinned(ce))
>>>    			__guc_reset_context(ce, stalled);
>>> +		intel_context_put(ce);
>>> +
>>> +		xa_lock(&guc->context_lookup);
>>> +	}
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>> +
>>>    	/* GuC is blown away, drop all references to contexts */
>>>    	xa_destroy(&guc->context_lookup);
>>>    }
>>> @@ -957,11 +988,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>>>    {
>>>    	struct intel_context *ce;
>>>    	unsigned long index;
>>> +	unsigned long flags;
>>> +
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>> +	xa_for_each(&guc->context_lookup, index, ce) {
>>> +		if (!kref_get_unless_zero(&ce->ref))
>>> +			continue;
>>> +
>>> +		xa_unlock(&guc->context_lookup);
>>> -	xa_for_each(&guc->context_lookup, index, ce)
>>>    		if (intel_context_is_pinned(ce))
>>>    			guc_cancel_context_requests(ce);
>>> +		intel_context_put(ce);
>>> +
>>> +		xa_lock(&guc->context_lookup);
>>> +	}
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>> +
>>>    	guc_cancel_sched_engine_requests(guc->sched_engine);
>>>    	/* GuC is blown away, drop all references to contexts */
>>> @@ -2850,21 +2894,28 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>>>    	struct intel_context *ce;
>>>    	struct i915_request *rq;
>>>    	unsigned long index;
>>> +	unsigned long flags;
>>>    	/* Reset called during driver load? GuC not yet initialised! */
>>>    	if (unlikely(!guc_submission_initialized(guc)))
>>>    		return;
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>>    	xa_for_each(&guc->context_lookup, index, ce) {
>>> -		if (!intel_context_is_pinned(ce))
>>> +		if (!kref_get_unless_zero(&ce->ref))
>>>    			continue;
>>> +		xa_unlock(&guc->context_lookup);
>>> +
>>> +		if (!intel_context_is_pinned(ce))
>>> +			goto next;
>>> +
>>>    		if (intel_engine_is_virtual(ce->engine)) {
>>>    			if (!(ce->engine->mask & engine->mask))
>>> -				continue;
>>> +				goto next;
>>>    		} else {
>>>    			if (ce->engine != engine)
>>> -				continue;
>>> +				goto next;
>>>    		}
>>>    		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
>>> @@ -2874,9 +2925,17 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>>>    			intel_engine_set_hung_context(engine, ce);
>>>    			/* Can only cope with one hang at a time... */
>>> -			return;
>>> +			intel_context_put(ce);
>>> +			xa_lock(&guc->context_lookup);
>>> +			goto done;
>>>    		}
>>> +next:
>>> +		intel_context_put(ce);
>>> +		xa_lock(&guc->context_lookup);
>>> +
>> nit: extra newline
>>
> Checkpatch got that one too. Already fixed.
>
> Matt
>
>> Daniele
>>
>>>    	}
>>> +done:
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>>    }
>>>    void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>> @@ -2892,23 +2951,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>>    	if (unlikely(!guc_submission_initialized(guc)))
>>>    		return;
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>>    	xa_for_each(&guc->context_lookup, index, ce) {
>>> -		if (!intel_context_is_pinned(ce))
>>> +		if (!kref_get_unless_zero(&ce->ref))
>>>    			continue;
>>> +		xa_unlock(&guc->context_lookup);
>>> +
>>> +		if (!intel_context_is_pinned(ce))
>>> +			goto next;
>>> +
>>>    		if (intel_engine_is_virtual(ce->engine)) {
>>>    			if (!(ce->engine->mask & engine->mask))
>>> -				continue;
>>> +				goto next;
>>>    		} else {
>>>    			if (ce->engine != engine)
>>> -				continue;
>>> +				goto next;
>>>    		}
>>> -		spin_lock_irqsave(&ce->guc_active.lock, flags);
>>> +		spin_lock(&ce->guc_active.lock);
>>>    		intel_engine_dump_active_requests(&ce->guc_active.requests,
>>>    						  hung_rq, m);
>>> -		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
>>> +		spin_unlock(&ce->guc_active.lock);
>>> +
>>> +next:
>>> +		intel_context_put(ce);
>>> +		xa_lock(&guc->context_lookup);
>>>    	}
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>>    }
>>>    void intel_guc_submission_print_info(struct intel_guc *guc,
>>> @@ -2962,7 +3032,9 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>>    {
>>>    	struct intel_context *ce;
>>>    	unsigned long index;
>>> +	unsigned long flags;
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>>    	xa_for_each(&guc->context_lookup, index, ce) {
>>>    		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
>>>    		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
>>> @@ -2981,6 +3053,7 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>>    		guc_log_context_priority(p, ce);
>>>    	}
>>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>>>    }
>>>    static struct intel_context *


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
@ 2021-08-26  0:50   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-26  0:50 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Drop pin count check trick between a sched_disable and re-pin, now rely
> on the lock and counter of the number of committed requests to determine
> if scheduling should be disabled on the context.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 49 ++++++++++++-------
>   2 files changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index d5d643b04d54..524a35a78bf4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -169,6 +169,8 @@ struct intel_context {
>   		struct list_head fences;
>   		/* GuC context blocked fence */
>   		struct i915_sw_fence blocked_fence;
> +		/* GuC committed requests */
> +		int number_committed_requests;
>   	} guc_state;
>   
>   	struct {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 5f77f25322ca..3e90985b0c1b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -248,6 +248,25 @@ static inline void decr_context_blocked(struct intel_context *ce)
>   	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
>   }
>   
> +static inline bool context_has_committed_requests(struct intel_context *ce)
> +{
> +	return !!ce->guc_state.number_committed_requests;
> +}
> +
> +static inline void incr_context_committed_requests(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	++ce->guc_state.number_committed_requests;
> +	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
> +}
> +
> +static inline void decr_context_committed_requests(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	--ce->guc_state.number_committed_requests;
> +	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
> +}
> +
>   static inline bool context_guc_id_invalid(struct intel_context *ce)
>   {
>   	return ce->guc_id == GUC_INVALID_LRC_ID;
> @@ -1751,14 +1770,11 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	/*
> -	 * We have to check if the context has been disabled by another thread.
> -	 * We also have to check if the context has been pinned again as another
> -	 * pin operation is allowed to pass this function. Checking the pin
> -	 * count, within ce->guc_state.lock, synchronizes this function with
> -	 * guc_request_alloc ensuring a request doesn't slip through the
> -	 * 'context_pending_disable' fence. Checking within the spin lock (can't
> -	 * sleep) ensures another process doesn't pin this context and generate
> -	 * a request before we set the 'context_pending_disable' flag here.
> +	 * We have to check if the context has been disabled by another thread,
> +	 * check if submssion has been disabled to seal a race with reset and
> +	 * finally check if any more requests have been committed to the
> +	 * context ensursing that a request doesn't slip through the
> +	 * 'context_pending_disable' fence.
>   	 */
>   	enabled = context_enabled(ce);
>   	if (unlikely(!enabled || submission_disabled(guc))) {
> @@ -1767,7 +1783,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   		goto unpin;
>   	}
> -	if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +	if (unlikely(context_has_committed_requests(ce))) {
> +		intel_context_sched_disable_unpin(ce);
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   		return;
>   	}
> @@ -1800,6 +1817,7 @@ static void __guc_context_destroy(struct intel_context *ce)
>   		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
>   		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
>   		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> +	GEM_BUG_ON(ce->guc_state.number_committed_requests);
>   
>   	lrc_fini(ce);
>   	intel_context_fini(ce);
> @@ -2030,6 +2048,10 @@ static void remove_from_context(struct i915_request *rq)
>   
>   	spin_unlock_irq(&ce->guc_active.lock);
>   
> +	spin_lock_irq(&ce->guc_state.lock);
> +	decr_context_committed_requests(ce);
> +	spin_unlock_irq(&ce->guc_state.lock);
> +
>   	atomic_dec(&ce->guc_id_ref);
>   	i915_request_notify_execute_cb_imm(rq);
>   }
> @@ -2177,15 +2199,7 @@ static int guc_request_alloc(struct i915_request *rq)
>   	 * schedule enable or context registration if either G2H is pending
>   	 * respectfully. Once a G2H returns, the fence is released that is
>   	 * blocking these requests (see guc_signal_context_fence).
> -	 *
> -	 * We can safely check the below fields outside of the lock as it isn't
> -	 * possible for these fields to transition from being clear to set but
> -	 * converse is possible, hence the need for the check within the lock.
>   	 */
> -	if (likely(!context_wait_for_deregister_to_register(ce) &&
> -		   !context_pending_disable(ce)))
> -		return 0;
> -
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	if (context_wait_for_deregister_to_register(ce) ||
>   	    context_pending_disable(ce)) {
> @@ -2194,6 +2208,7 @@ static int guc_request_alloc(struct i915_request *rq)
>   
>   		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
>   	}
> +	incr_context_committed_requests(ce);
>   	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   
>   	return 0;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state Matthew Brost
@ 2021-08-26  0:54   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-26  0:54 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Now that we have locking hierarchy of sched_engine->lock ->
> ce->guc_state everything from guc_active can be moved into guc_state and
> protected the guc_state.lock.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 10 +--
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  7 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 88 +++++++++----------
>   drivers/gpu/drm/i915/i915_trace.h             |  2 +-
>   4 files changed, 49 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 87b84c1d5393..adfe49b53b1b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -394,9 +394,7 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   	spin_lock_init(&ce->guc_state.lock);
>   	INIT_LIST_HEAD(&ce->guc_state.fences);
> -
> -	spin_lock_init(&ce->guc_active.lock);
> -	INIT_LIST_HEAD(&ce->guc_active.requests);
> +	INIT_LIST_HEAD(&ce->guc_state.requests);
>   
>   	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   	INIT_LIST_HEAD(&ce->guc_id.link);
> @@ -521,15 +519,15 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   
>   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
>   
> -	spin_lock_irqsave(&ce->guc_active.lock, flags);
> -	list_for_each_entry_reverse(rq, &ce->guc_active.requests,
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
>   				    sched.link) {
>   		if (i915_request_completed(rq))
>   			break;
>   
>   		active = rq;
>   	}
> -	spin_unlock_irqrestore(&ce->guc_active.lock, flags);
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   
>   	return active;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 7a1d1537cf67..66286ce36c84 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -172,11 +172,6 @@ struct intel_context {
>   		struct i915_sw_fence blocked_fence;
>   		/* GuC committed requests */
>   		int number_committed_requests;
> -	} guc_state;
> -
> -	struct {
> -		/** lock: protects everything in guc_active */
> -		spinlock_t lock;
>   		/** requests: active requests on this context */
>   		struct list_head requests;
>   		/*
> @@ -184,7 +179,7 @@ struct intel_context {
>   		 */
>   		u8 prio;
>   		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
> -	} guc_active;
> +	} guc_state;
>   
>   	struct {
>   		/* GuC LRC descriptor ID */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index c4c018348ac0..4b9a2f3774d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -827,9 +827,9 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   	unsigned long flags;
>   
>   	spin_lock_irqsave(&sched_engine->lock, flags);
> -	spin_lock(&ce->guc_active.lock);
> +	spin_lock(&ce->guc_state.lock);
>   	list_for_each_entry_safe_reverse(rq, rn,
> -					 &ce->guc_active.requests,
> +					 &ce->guc_state.requests,
>   					 sched.link) {
>   		if (i915_request_completed(rq))
>   			continue;
> @@ -848,7 +848,7 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   		list_add(&rq->sched.link, pl);
>   		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>   	}
> -	spin_unlock(&ce->guc_active.lock);
> +	spin_unlock(&ce->guc_state.lock);
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> @@ -945,10 +945,10 @@ static void guc_cancel_context_requests(struct intel_context *ce)
>   
>   	/* Mark all executing requests as skipped. */
>   	spin_lock_irqsave(&sched_engine->lock, flags);
> -	spin_lock(&ce->guc_active.lock);
> -	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
> +	spin_lock(&ce->guc_state.lock);
> +	list_for_each_entry(rq, &ce->guc_state.requests, sched.link)
>   		i915_request_put(i915_request_mark_eio(rq));
> -	spin_unlock(&ce->guc_active.lock);
> +	spin_unlock(&ce->guc_state.lock);
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> @@ -1400,7 +1400,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->engine_submit_mask = adjust_engine_mask(engine->class,
>   						      engine->mask);
>   	desc->hw_context_desc = ce->lrc.lrca;
> -	desc->priority = ce->guc_active.prio;
> +	desc->priority = ce->guc_state.prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
>   
> @@ -1802,10 +1802,10 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   
>   static void __guc_context_destroy(struct intel_context *ce)
>   {
> -	GEM_BUG_ON(ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> -		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> -		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> -		   ce->guc_active.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
> +	GEM_BUG_ON(ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
> +		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
> +		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
> +		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
>   	GEM_BUG_ON(ce->guc_state.number_committed_requests);
>   
>   	lrc_fini(ce);
> @@ -1915,17 +1915,17 @@ static void guc_context_set_prio(struct intel_guc *guc,
>   
>   	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
>   		   prio > GUC_CLIENT_PRIORITY_NORMAL);
> -	lockdep_assert_held(&ce->guc_active.lock);
> +	lockdep_assert_held(&ce->guc_state.lock);
>   
> -	if (ce->guc_active.prio == prio || submission_disabled(guc) ||
> +	if (ce->guc_state.prio == prio || submission_disabled(guc) ||
>   	    !context_registered(ce)) {
> -		ce->guc_active.prio = prio;
> +		ce->guc_state.prio = prio;
>   		return;
>   	}
>   
>   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
>   
> -	ce->guc_active.prio = prio;
> +	ce->guc_state.prio = prio;
>   	trace_intel_context_set_prio(ce);
>   }
>   
> @@ -1944,25 +1944,25 @@ static inline u8 map_i915_prio_to_guc_prio(int prio)
>   static inline void add_context_inflight_prio(struct intel_context *ce,
>   					     u8 guc_prio)
>   {
> -	lockdep_assert_held(&ce->guc_active.lock);
> -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
>   
> -	++ce->guc_active.prio_count[guc_prio];
> +	++ce->guc_state.prio_count[guc_prio];
>   
>   	/* Overflow protection */
> -	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> +	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
>   }
>   
>   static inline void sub_context_inflight_prio(struct intel_context *ce,
>   					     u8 guc_prio)
>   {
> -	lockdep_assert_held(&ce->guc_active.lock);
> -	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_active.prio_count));
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
>   
>   	/* Underflow protection */
> -	GEM_WARN_ON(!ce->guc_active.prio_count[guc_prio]);
> +	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
>   
> -	--ce->guc_active.prio_count[guc_prio];
> +	--ce->guc_state.prio_count[guc_prio];
>   }
>   
>   static inline void update_context_prio(struct intel_context *ce)
> @@ -1973,10 +1973,10 @@ static inline void update_context_prio(struct intel_context *ce)
>   	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);
>   	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH > GUC_CLIENT_PRIORITY_NORMAL);
>   
> -	lockdep_assert_held(&ce->guc_active.lock);
> +	lockdep_assert_held(&ce->guc_state.lock);
>   
> -	for (i = 0; i < ARRAY_SIZE(ce->guc_active.prio_count); ++i) {
> -		if (ce->guc_active.prio_count[i]) {
> +	for (i = 0; i < ARRAY_SIZE(ce->guc_state.prio_count); ++i) {
> +		if (ce->guc_state.prio_count[i]) {
>   			guc_context_set_prio(guc, ce, i);
>   			break;
>   		}
> @@ -1996,8 +1996,8 @@ static void add_to_context(struct i915_request *rq)
>   
>   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
>   
> -	spin_lock(&ce->guc_active.lock);
> -	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
> +	spin_lock(&ce->guc_state.lock);
> +	list_move_tail(&rq->sched.link, &ce->guc_state.requests);
>   
>   	if (rq->guc_prio == GUC_PRIO_INIT) {
>   		rq->guc_prio = new_guc_prio;
> @@ -2009,12 +2009,12 @@ static void add_to_context(struct i915_request *rq)
>   	}
>   	update_context_prio(ce);
>   
> -	spin_unlock(&ce->guc_active.lock);
> +	spin_unlock(&ce->guc_state.lock);
>   }
>   
>   static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
>   {
> -	lockdep_assert_held(&ce->guc_active.lock);
> +	lockdep_assert_held(&ce->guc_state.lock);
>   
>   	if (rq->guc_prio != GUC_PRIO_INIT &&
>   	    rq->guc_prio != GUC_PRIO_FINI) {
> @@ -2028,7 +2028,7 @@ static void remove_from_context(struct i915_request *rq)
>   {
>   	struct intel_context *ce = rq->context;
>   
> -	spin_lock_irq(&ce->guc_active.lock);
> +	spin_lock_irq(&ce->guc_state.lock);
>   
>   	list_del_init(&rq->sched.link);
>   	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> @@ -2038,10 +2038,8 @@ static void remove_from_context(struct i915_request *rq)
>   
>   	guc_prio_fini(rq, ce);
>   
> -	spin_unlock_irq(&ce->guc_active.lock);
> -
> -	spin_lock_irq(&ce->guc_state.lock);
>   	decr_context_committed_requests(ce);
> +
>   	spin_unlock_irq(&ce->guc_state.lock);
>   
>   	atomic_dec(&ce->guc_id.ref);
> @@ -2126,7 +2124,7 @@ static void guc_context_init(struct intel_context *ce)
>   		prio = ctx->sched.priority;
>   	rcu_read_unlock();
>   
> -	ce->guc_active.prio = map_i915_prio_to_guc_prio(prio);
> +	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
>   }
>   
>   static int guc_request_alloc(struct i915_request *rq)
> @@ -2359,7 +2357,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   	     !new_guc_prio_higher(rq->guc_prio, new_guc_prio)))
>   		return;
>   
> -	spin_lock(&ce->guc_active.lock);
> +	spin_lock(&ce->guc_state.lock);
>   	if (rq->guc_prio != GUC_PRIO_FINI) {
>   		if (rq->guc_prio != GUC_PRIO_INIT)
>   			sub_context_inflight_prio(ce, rq->guc_prio);
> @@ -2367,16 +2365,16 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   		add_context_inflight_prio(ce, rq->guc_prio);
>   		update_context_prio(ce);
>   	}
> -	spin_unlock(&ce->guc_active.lock);
> +	spin_unlock(&ce->guc_state.lock);
>   }
>   
>   static void guc_retire_inflight_request_prio(struct i915_request *rq)
>   {
>   	struct intel_context *ce = rq->context;
>   
> -	spin_lock(&ce->guc_active.lock);
> +	spin_lock(&ce->guc_state.lock);
>   	guc_prio_fini(rq, ce);
> -	spin_unlock(&ce->guc_active.lock);
> +	spin_unlock(&ce->guc_state.lock);
>   }
>   
>   static void sanitize_hwsp(struct intel_engine_cs *engine)
> @@ -2942,7 +2940,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> -		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
> +		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> @@ -2993,10 +2991,10 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>   				goto next;
>   		}
>   
> -		spin_lock(&ce->guc_active.lock);
> -		intel_engine_dump_active_requests(&ce->guc_active.requests,
> +		spin_lock(&ce->guc_state.lock);
> +		intel_engine_dump_active_requests(&ce->guc_state.requests,
>   						  hung_rq, m);
> -		spin_unlock(&ce->guc_active.lock);
> +		spin_unlock(&ce->guc_state.lock);
>   
>   next:
>   		intel_context_put(ce);
> @@ -3040,12 +3038,12 @@ static inline void guc_log_context_priority(struct drm_printer *p,
>   {
>   	int i;
>   
> -	drm_printf(p, "\t\tPriority: %d\n", ce->guc_active.prio);
> +	drm_printf(p, "\t\tPriority: %d\n", ce->guc_state.prio);
>   	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
>   	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
>   	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
>   		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
> -			   i, ce->guc_active.prio_count[i]);
> +			   i, ce->guc_state.prio_count[i]);
>   	}
>   	drm_printf(p, "\n");
>   }
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 0574f5c7a985..ec7fe12b94aa 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -910,7 +910,7 @@ DECLARE_EVENT_CLASS(intel_context,
>   			   __entry->guc_id = ce->guc_id.id;
>   			   __entry->pin_count = atomic_read(&ce->pin_count);
>   			   __entry->sched_state = ce->guc_state.sched_state;
> -			   __entry->guc_prio = ce->guc_active.prio;
> +			   __entry->guc_prio = ce->guc_state.prio;
>   			   ),
>   
>   		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc
  2021-08-19  6:16 ` [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc Matthew Brost
@ 2021-08-26  1:03   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 76+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-08-26  1:03 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter



On 8/18/2021 11:16 PM, Matthew Brost wrote:
> Add GuC kernel doc for all structures added thus far for GuC submission
> and update the main GuC submission section with the new interface
> details.
>
> v2:
>   - Drop guc_active.lock DOC
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h | 44 ++++++---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 19 +++-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 97 ++++++++++++++-----
>   drivers/gpu/drm/i915/i915_request.h           | 18 ++--
>   4 files changed, 128 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 66286ce36c84..80bbdc7810f6 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -156,40 +156,52 @@ struct intel_context {
>   	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
>   
>   	struct {
> -		/** lock: protects everything in guc_state */
> +		/** @lock: protects everything in guc_state */
>   		spinlock_t lock;
>   		/**
> -		 * sched_state: scheduling state of this context using GuC
> +		 * @sched_state: scheduling state of this context using GuC
>   		 * submission
>   		 */
>   		u32 sched_state;
>   		/*
> -		 * fences: maintains of list of requests that have a submit
> -		 * fence related to GuC submission
> +		 * @fences: maintains a list of requests are currently being
> +		 * fenced until a GuC operation completes
>   		 */
>   		struct list_head fences;
> -		/* GuC context blocked fence */
> +		/**
> +		 * @blocked_fence: fence used to signal when the blocking of a
> +		 * contexts submissions is complete.
> +		 */
>   		struct i915_sw_fence blocked_fence;
> -		/* GuC committed requests */
> +		/** @number_committed_requests: number of committed requests */
>   		int number_committed_requests;
> -		/** requests: active requests on this context */
> +		/** @requests: list of active requests on this context */
>   		struct list_head requests;
> -		/*
> -		 * GuC priority management
> -		 */
> +		/** @prio: the contexts current guc priority */
>   		u8 prio;
> +		/**
> +		 * @prio_count: a counter of the number requests inflight in
> +		 * each priority bucket
> +		 */
>   		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
>   	} guc_state;
>   
>   	struct {
> -		/* GuC LRC descriptor ID */
> +		/**
> +		 * @id: unique handle which is used to communicate information
> +		 * with the GuC about this context, protected by
> +		 * guc->contexts_lock
> +		 */
>   		u16 id;
> -
> -		/* GuC LRC descriptor reference count */
> +		/**
> +		 * @ref: the number of references to the guc_id, when
> +		 * transitioning in and out of zero protected by
> +		 * guc->contexts_lock
> +		 */
>   		atomic_t ref;
> -
> -		/*
> -		 * GuC ID link - in list when unpinned but guc_id still valid in GuC
> +		/**
> +		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> +		 * still valid, protected by guc->contexts_lock
>   		 */
>   		struct list_head link;
>   	} guc_id;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 2e27fe59786b..112dd29a63fe 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -41,6 +41,10 @@ struct intel_guc {
>   	spinlock_t irq_lock;
>   	unsigned int msg_enabled_mask;
>   
> +	/**
> +	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
> +	 * submission, used to determine if the GT is idle
> +	 */
>   	atomic_t outstanding_submission_g2h;
>   
>   	struct {
> @@ -49,12 +53,16 @@ struct intel_guc {
>   		void (*disable)(struct intel_guc *guc);
>   	} interrupts;
>   
> -	/*
> -	 * contexts_lock protects the pool of free guc ids and a linked list of
> -	 * guc ids available to be stolen
> +	/**
> +	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> +	 * ce->guc_id.ref when transitioning in and out of zero
>   	 */
>   	spinlock_t contexts_lock;
> +	/** @guc_ids: used to allocate new guc_ids */
>   	struct ida guc_ids;
> +	/**
> +	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> +	 */
>   	struct list_head guc_id_list;
>   
>   	bool submission_supported;
> @@ -70,7 +78,10 @@ struct intel_guc {
>   	struct i915_vma *lrc_desc_pool;
>   	void *lrc_desc_pool_vaddr;
>   
> -	/* guc_id to intel_context lookup */
> +	/**
> +	 * @context_lookup: used to resolve intel_context from guc_id, if a
> +	 * context is present in this structure it is registered with the GuC
> +	 */
>   	struct xarray context_lookup;
>   
>   	/* Control params for fw initialization */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 4b9a2f3774d5..7e0a32e729c2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -28,21 +28,6 @@
>   /**
>    * DOC: GuC-based command submission
>    *
> - * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
> - * firmware is moving to an updated submission interface and we plan to
> - * turn submission back on when that lands. The below documentation (and related
> - * code) matches the old submission model and will be updated as part of the
> - * upgrade to the new flow.
> - *
> - * GuC stage descriptor:
> - * During initialization, the driver allocates a static pool of 1024 such
> - * descriptors, and shares them with the GuC. Currently, we only use one
> - * descriptor. This stage descriptor lets the GuC know about the workqueue and
> - * process descriptor. Theoretically, it also lets the GuC know about our HW
> - * contexts (context ID, etc...), but we actually employ a kind of submission
> - * where the GuC uses the LRCA sent via the work item instead. This is called
> - * a "proxy" submission.
> - *
>    * The Scratch registers:
>    * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
>    * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
> @@ -51,14 +36,82 @@
>    * processes the request. The kernel driver polls waiting for this update and
>    * then proceeds.
>    *
> - * Work Items:
> - * There are several types of work items that the host may place into a
> - * workqueue, each with its own requirements and limitations. Currently only
> - * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> - * represents in-order queue. The kernel driver packs ring tail pointer and an
> - * ELSP context descriptor dword into Work Item.
> - * See guc_add_request()
> + * Command Transport buffers (CTBs):
> + * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
> + * G2H) are a message interface between the i915 and GuC used to controls
> + * submissions.
> + *
> + * Context registration:
> + * Before a context can be submitted it must be registered with the GuC via a
> + * H2G. A unique guc_id is associated with each context. The context is either
> + * registered at request creation time (normal operation) or at submission time
> + * (abnormal operation, e.g. after a reset).
> + *
> + * Context submission:
> + * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
> + * or context submit H2G is used to submit a context.

I would elaborate a bit more here. something like:

"i915 must enable the scheduling of the context within the GuC for the 
GuC to actually consider it. Therefore, the first time a disabled 
context is submitted we use a schedule enable H2G, while follow up 
submissions are done via the context submit H2G, which informs the GuC 
that a previously enabled context has new work available."


> + *
> + * Context unpin:
> + * To unpin a context a H2G is used to disable scheduling and when the
> + * corresponding G2H returns indicating the scheduling disable operation has
> + * completed it is safe to unpin the context. While a disable is in flight it
> + * isn't safe to resubmit the context so a fence is used to stall all future
> + * requests until the G2H is returned.
> + *
> + * Context deregistration:
> + * Before a context can be destroyed or we steal its guc_id we must deregister

or "if" we steal ... ?

Looks ok part from these nits.

Daniele

> + * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
> + * submit anything to this guc_id until the deregister completes so a fence is
> + * used to stall all requests associated with this guc_ids until the
> + * corresponding G2H returns indicating the guc_id has been deregistered.
> + *
> + * guc_ids:
> + * Unique number associated with private GuC context data passed in during
> + * context registration / submission / deregistration. 64k available. Simple ida
> + * is used for allocation.
> + *
> + * Stealing guc_ids:
> + * If no guc_ids are available they can be stolen from another context at
> + * request creation time if that context is unpinned. If a guc_id can't be found
> + * we punt this problem to the user as we believe this is near impossible to hit
> + * during normal use cases.
> + *
> + * Locking:
> + * In the GuC submission code we have 3 basic spin locks which protect
> + * everything. Details about each below.
> + *
> + * sched_engine->lock
> + * This is the submission lock for all contexts that share a i915 schedule
> + * engine (sched_engine), thus only 1 context which share a sched_engine can be
> + * submitting at a time. Currently only 1 sched_engine used for all of GuC
> + * submission but that could change in the future.
> + *
> + * guc->contexts_lock
> + * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> + * submission can hold this at a time.
> + *
> + * ce->guc_state.lock
> + * Protects everything under ce->guc_state. Ensures that a context is in the
> + * correct state before issuing a H2G. e.g. We don't issue a schedule disable
> + * on disabled context (bad idea), we don't issue schedule enable when a
> + * schedule disable is inflight, etc... Also protects list of inflight requests
> + * on the context and the priority management state. Lock individual to each
> + * context.
> + *
> + * Lock ordering rules:
> + * sched_engine->lock -> ce->guc_state.lock
> + * guc->contexts_lock -> ce->guc_state.lock
>    *
> + * Reset races:
> + * When a GPU full reset is triggered it is assumed that some G2H responses to
> + * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
> + * fatal as we do certain operations upon receiving a G2H (e.g. destroy
> + * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
> + * context state and cleanup appropriately, however this is quite racey. To
> + * avoid races the rules are check for submission being disabled (i.e. check for
> + * mid reset) with the appropriate lock being held. If submission is disabled
> + * don't send the H2G or update the context state. The reset code must disable
> + * submission and grab all these locks before scrubbing for the missing G2H.
>    */
>   
>   /* GuC Virtual Engine */
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index d818cfbfc41d..177eaf55adff 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -290,18 +290,20 @@ struct i915_request {
>   		struct hrtimer timer;
>   	} watchdog;
>   
> -	/*
> -	 * Requests may need to be stalled when using GuC submission waiting for
> -	 * certain GuC operations to complete. If that is the case, stalled
> -	 * requests are added to a per context list of stalled requests. The
> -	 * below list_head is the link in that list.
> +	/**
> +	 * @guc_fence_link: Requests may need to be stalled when using GuC
> +	 * submission waiting for certain GuC operations to complete. If that is
> +	 * the case, stalled requests are added to a per context list of stalled
> +	 * requests. The below list_head is the link in that list. Protected by
> +	 * ce->guc_state.lock.
>   	 */
>   	struct list_head guc_fence_link;
>   
>   	/**
> -	 * Priority level while the request is inflight. Differs from i915
> -	 * scheduler priority. See comment above
> -	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
> +	 * @guc_prio: Priority level while the request is inflight. Differs from
> +	 * i915 scheduler priority. See comment above
> +	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
> +	 * ce->guc_active.lock.
>   	 */
>   #define	GUC_PRIO_INIT	0xff
>   #define	GUC_PRIO_FINI	0xfe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
  2021-08-26  3:23 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
@ 2021-08-26  3:23 ` Matthew Brost
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Brost @ 2021-08-26  3:23 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

When unblocking a context, do not enable scheduling if the context is
banned, guc_id invalid, or not registered.

v2:
 (Daniele)
  - Add helper for unblock

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 22 ++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b30fdccc71d4..56f11accd6cc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -148,6 +148,7 @@ static inline void clr_context_registered(struct intel_context *ce)
 #define SCHED_STATE_BLOCKED_SHIFT			4
 #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
+
 static inline void init_sched_state(struct intel_context *ce)
 {
 	/* Only should be called from guc_lrc_desc_pin() */
@@ -1569,6 +1570,23 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	return &ce->guc_blocked;
 }
 
+#define SCHED_STATE_MULTI_BLOCKED_MASK \
+	(SCHED_STATE_BLOCKED_MASK & ~SCHED_STATE_BLOCKED)
+#define SCHED_STATE_NO_UNBLOCK \
+	(SCHED_STATE_MULTI_BLOCKED_MASK | \
+	 SCHED_STATE_PENDING_DISABLE | \
+	 SCHED_STATE_BANNED)
+
+static bool context_cant_unblock(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+
+	return (ce->guc_state.sched_state & SCHED_STATE_NO_UNBLOCK) ||
+		context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id) ||
+		!intel_context_is_pinned(ce);
+}
+
 static void guc_context_unblock(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -1583,9 +1601,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
-		     !intel_context_is_pinned(ce) ||
-		     context_pending_disable(ce) ||
-		     context_blocked(ce) > 1)) {
+		     context_cant_unblock(ce))) {
 		enable = false;
 	} else {
 		enable = true;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2021-08-26  3:29 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19  6:16 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 01/27] drm/i915/guc: Fix blocked context accounting Matthew Brost
2021-08-24 23:24   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 02/27] drm/i915/guc: Fix outstanding G2H accounting Matthew Brost
2021-08-19 21:31   ` Daniele Ceraolo Spurio
2021-08-19 21:30     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 03/27] drm/i915/guc: Unwind context requests in reverse order Matthew Brost
2021-08-19 23:54   ` Daniele Ceraolo Spurio
2021-08-19 23:53     ` Matthew Brost
2021-08-20  0:03       ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 04/27] drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context Matthew Brost
2021-08-20  0:01   ` Daniele Ceraolo Spurio
2021-08-19 23:58     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 05/27] drm/i915/guc: Process all G2H message at once in work queue Matthew Brost
2021-08-20  0:06   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 06/27] drm/i915/guc: Workaround reset G2H is received after schedule done G2H Matthew Brost
2021-08-24 23:31   ` Daniele Ceraolo Spurio
2021-08-25  4:05     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 07/27] Revert "drm/i915/gt: Propagate change in error status to children on unhold" Matthew Brost
2021-08-20 19:47   ` Jason Ekstrand
2021-08-19  6:16 ` [Intel-gfx] [PATCH 08/27] drm/i915/selftests: Add a cancel request selftest that triggers a reset Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 09/27] drm/i915/guc: Kick tasklet after queuing a request Matthew Brost
2021-08-20 18:31   ` Daniele Ceraolo Spurio
2021-08-20 18:36     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost
2021-08-20 18:42   ` Daniele Ceraolo Spurio
2021-08-20 18:42     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation Matthew Brost
2021-08-25  0:07   ` Daniele Ceraolo Spurio
2021-08-25 20:03     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 12/27] drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H Matthew Brost
2021-08-25  0:58   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 13/27] drm/i915/guc: Take context ref when cancelling request Matthew Brost
2021-08-21  0:07   ` Daniele Ceraolo Spurio
2021-08-24 15:42     ` Matthew Brost
2021-08-25  1:21       ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 14/27] drm/i915/guc: Don't touch guc_state.sched_state without a lock Matthew Brost
2021-08-25  1:20   ` Daniele Ceraolo Spurio
2021-08-25  1:44     ` Matthew Brost
2021-08-25  1:51       ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 15/27] drm/i915/guc: Reset LRC descriptor if register returns -ENODEV Matthew Brost
2021-08-21  0:14   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 16/27] drm/i915: Allocate error capture in nowait context Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 17/27] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
2021-08-21  0:25   ` Daniele Ceraolo Spurio
2021-08-24 15:44     ` Matthew Brost
2021-08-25  1:22       ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 18/27] drm/i915/guc: Release submit fence from an irq_work Matthew Brost
2021-08-25  1:44   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 19/27] drm/i915/guc: Move guc_blocked fence to struct guc_state Matthew Brost
2021-08-21  0:30   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 20/27] drm/i915/guc: Rework and simplify locking Matthew Brost
2021-08-25 16:52   ` Daniele Ceraolo Spurio
2021-08-25 19:22     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 21/27] drm/i915/guc: Proper xarray usage for contexts_lookup Matthew Brost
2021-08-26  0:44   ` Daniele Ceraolo Spurio
2021-08-26  0:41     ` Matthew Brost
2021-08-26  0:48       ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 22/27] drm/i915/guc: Drop pin count check trick between sched_disable and re-pin Matthew Brost
2021-08-26  0:50   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 23/27] drm/i915/guc: Move GuC priority fields in context under guc_active Matthew Brost
2021-08-25 21:51   ` Daniele Ceraolo Spurio
2021-08-25 22:53     ` Matthew Brost
2021-08-25 23:04     ` Matthew Brost
2021-08-19  6:16 ` [Intel-gfx] [PATCH 24/27] drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure Matthew Brost
2021-08-25  2:00   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 25/27] drm/i915/guc: Drop guc_active move everything into guc_state Matthew Brost
2021-08-26  0:54   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 26/27] drm/i915/guc: Add GuC kernel doc Matthew Brost
2021-08-26  1:03   ` Daniele Ceraolo Spurio
2021-08-19  6:16 ` [Intel-gfx] [PATCH 27/27] drm/i915/guc: Drop static inline functions intel_guc_submission.c Matthew Brost
2021-08-19  7:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Clean up GuC CI failures, simplify locking, and kernel DOC (rev3) Patchwork
2021-08-19  7:20 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-08-19  7:51 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-08-19  9:08 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-08-26  3:23 [Intel-gfx] [PATCH 00/27] Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
2021-08-26  3:23 ` [Intel-gfx] [PATCH 10/27] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).