dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Fix stealing guc_ids + test
@ 2021-12-11 17:35 Matthew Brost
  2021-12-11 17:35 ` [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered Matthew Brost
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Patches 1 & 2 address bugs in stealing of guc_ids and patch 7 tests this
path.

Patches 4-6 address some issues with the CTs exposed by the selftest in
patch 7. Basically if a lot of contexts were all deregistered all at
once, the CT channel could deadlock.

Patch 3 is a small fix that is already review but just included for CI.

v2: Address comments, resend for CI

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

John Harrison (1):
  drm/i915/guc: Don't hog IRQs when destroying contexts

Matthew Brost (6):
  drm/i915/guc: Use correct context lock when callig
    clr_context_registered
  drm/i915/guc: Only assign guc_id.id when stealing guc_id
  drm/i915/guc: Remove racey GEM_BUG_ON
  drm/i915/guc: Add extra debug on CT deadlock
  drm/i915/guc: Kick G2H tasklet if no credits
  drm/i915/guc: Selftest for stealing of guc ids

 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  12 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  18 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  69 ++++---
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 173 ++++++++++++++++++
 4 files changed, 244 insertions(+), 28 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-11 17:35 ` [PATCH 2/7] drm/i915/guc: Only assign guc_id.id when stealing guc_id Matthew Brost
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

s/ce/cn/ when grabbing guc_state.lock before calling
clr_context_registered.

Fixes: 0f7976506de61 ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1f9d4fde421f..9b7b4f4e0d91 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1937,9 +1937,9 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 		list_del_init(&cn->guc_id.link);
 		ce->guc_id = cn->guc_id;
 
-		spin_lock(&ce->guc_state.lock);
+		spin_lock(&cn->guc_state.lock);
 		clr_context_registered(cn);
-		spin_unlock(&ce->guc_state.lock);
+		spin_unlock(&cn->guc_state.lock);
 
 		set_context_guc_id_invalid(cn);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] drm/i915/guc: Only assign guc_id.id when stealing guc_id
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
  2021-12-11 17:35 ` [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-11 17:35 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Previously assigned whole guc_id structure (list, spin lock) which is
incorrect, only assign the guc_id.id.

Fixes: 0f7976506de61 ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9b7b4f4e0d91..0fb2eeff0262 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1935,7 +1935,7 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 		GEM_BUG_ON(intel_context_is_parent(cn));
 
 		list_del_init(&cn->guc_id.link);
-		ce->guc_id = cn->guc_id;
+		ce->guc_id.id = cn->guc_id.id;
 
 		spin_lock(&cn->guc_state.lock);
 		clr_context_registered(cn);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
  2021-12-11 17:35 ` [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered Matthew Brost
  2021-12-11 17:35 ` [PATCH 2/7] drm/i915/guc: Only assign guc_id.id when stealing guc_id Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-11 17:35 ` [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

A full GT reset can race with the last context put resulting in the
context ref count being zero but the destroyed bit not yet being set.
Remove GEM_BUG_ON in scrub_guc_desc_for_outstanding_g2h that asserts the
destroyed bit must be set in ref count is zero.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0fb2eeff0262..36c2965db49b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1040,8 +1040,6 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 
 		spin_unlock(&ce->guc_state.lock);
 
-		GEM_BUG_ON(!do_put && !destroyed);
-
 		if (pending_enable || destroyed || deregister) {
 			decr_outstanding_submission_g2h(guc);
 			if (deregister)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
                   ` (2 preceding siblings ...)
  2021-12-11 17:35 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-11 17:35 ` [PATCH 5/7] drm/i915/guc: Add extra debug on CT deadlock Matthew Brost
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

From: John Harrison <John.C.Harrison@Intel.com>

While attempting to debug a CT deadlock issue in various CI failures
(most easily reproduced with gem_ctx_create/basic-files), I was seeing
CPU deadlock errors being reported. This were because the context
destroy loop was blocking waiting on H2G space from inside an IRQ
spinlock. There no was deadlock as such, it's just that the H2G queue
was full of context destroy commands and GuC was taking a long time to
process them. However, the kernel was seeing the large amount of time
spent inside the IRQ lock as a dead CPU. Various Bad Things(tm) would
then happen (heartbeat failures, CT deadlock errors, outstanding H2G
WARNs, etc.).

Re-working the loop to only acquire the spinlock around the list
management (which is all it is meant to protect) rather than the
entire destroy operation seems to fix all the above issues.

v2:
 (John Harrison)
  - Fix typo in comment message

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 ++++++++++++-------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 36c2965db49b..96fcf869e3ff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2644,7 +2644,6 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	unsigned long flags;
 	bool disabled;
 
-	lockdep_assert_held(&guc->submission_state.lock);
 	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
@@ -2660,7 +2659,7 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
-		__release_guc_id(guc, ce);
+		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 		return;
 	}
@@ -2694,36 +2693,48 @@ static void __guc_context_destroy(struct intel_context *ce)
 
 static void guc_flush_destroyed_contexts(struct intel_guc *guc)
 {
-	struct intel_context *ce, *cn;
+	struct intel_context *ce;
 	unsigned long flags;
 
 	GEM_BUG_ON(!submission_disabled(guc) &&
 		   guc_submission_initialized(guc));
 
-	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	list_for_each_entry_safe(ce, cn,
-				 &guc->submission_state.destroyed_contexts,
-				 destroyed_link) {
-		list_del_init(&ce->destroyed_link);
-		__release_guc_id(guc, ce);
+	while (!list_empty(&guc->submission_state.destroyed_contexts)) {
+		spin_lock_irqsave(&guc->submission_state.lock, flags);
+		ce = list_first_entry_or_null(&guc->submission_state.destroyed_contexts,
+					      struct intel_context,
+					      destroyed_link);
+		if (ce)
+			list_del_init(&ce->destroyed_link);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+		if (!ce)
+			break;
+
+		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static void deregister_destroyed_contexts(struct intel_guc *guc)
 {
-	struct intel_context *ce, *cn;
+	struct intel_context *ce;
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	list_for_each_entry_safe(ce, cn,
-				 &guc->submission_state.destroyed_contexts,
-				 destroyed_link) {
-		list_del_init(&ce->destroyed_link);
+	while (!list_empty(&guc->submission_state.destroyed_contexts)) {
+		spin_lock_irqsave(&guc->submission_state.lock, flags);
+		ce = list_first_entry_or_null(&guc->submission_state.destroyed_contexts,
+					      struct intel_context,
+					      destroyed_link);
+		if (ce)
+			list_del_init(&ce->destroyed_link);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+		if (!ce)
+			break;
+
 		guc_lrc_desc_unpin(ce);
 	}
-	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static void destroyed_worker_func(struct work_struct *w)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/7] drm/i915/guc: Add extra debug on CT deadlock
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
                   ` (3 preceding siblings ...)
  2021-12-11 17:35 ` [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-11 17:35 ` [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits Matthew Brost
  2021-12-11 17:35 ` [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids Matthew Brost
  6 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Print CT state (H2G + G2H head / tail pointers, credits) on CT
deadlock.

v2:
 (John Harrison)
  - Add units to debug messages

Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index a0cc34be7b56..741be9abab68 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -523,6 +523,15 @@ static inline bool ct_deadlocked(struct intel_guc_ct *ct)
 		CT_ERROR(ct, "Communication stalled for %lld ms, desc status=%#x,%#x\n",
 			 ktime_ms_delta(ktime_get(), ct->stall_time),
 			 send->status, recv->status);
+		CT_ERROR(ct, "H2G Space: %u (Bytes)\n",
+			 atomic_read(&ct->ctbs.send.space) * 4);
+		CT_ERROR(ct, "Head: %u (Dwords)\n", ct->ctbs.send.desc->head);
+		CT_ERROR(ct, "Tail: %u (Dwords)\n", ct->ctbs.send.desc->tail);
+		CT_ERROR(ct, "G2H Space: %u (Bytes)\n",
+			 atomic_read(&ct->ctbs.recv.space) * 4);
+		CT_ERROR(ct, "Head: %u\n (Dwords)", ct->ctbs.recv.desc->head);
+		CT_ERROR(ct, "Tail: %u\n (Dwords)", ct->ctbs.recv.desc->tail);
+
 		ct->ctbs.send.broken = true;
 	}
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
                   ` (4 preceding siblings ...)
  2021-12-11 17:35 ` [PATCH 5/7] drm/i915/guc: Add extra debug on CT deadlock Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-14  0:27   ` John Harrison
  2021-12-11 17:35 ` [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids Matthew Brost
  6 siblings, 1 reply; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Let's be paranoid and kick the G2H tasklet, which dequeues messages, if
G2H credits are exhausted.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 741be9abab68..aa6dd6415202 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -591,12 +591,19 @@ static inline bool h2g_has_room(struct intel_guc_ct *ct, u32 len_dw)
 
 static int has_room_nb(struct intel_guc_ct *ct, u32 h2g_dw, u32 g2h_dw)
 {
+	bool h2g = h2g_has_room(ct, h2g_dw);
+	bool g2h = g2h_has_room(ct, g2h_dw);
+
 	lockdep_assert_held(&ct->ctbs.send.lock);
 
-	if (unlikely(!h2g_has_room(ct, h2g_dw) || !g2h_has_room(ct, g2h_dw))) {
+	if (unlikely(!h2g || !g2h)) {
 		if (ct->stall_time == KTIME_MAX)
 			ct->stall_time = ktime_get();
 
+		/* Be paranoid and kick G2H tasklet to free credits */
+		if (!g2h)
+			tasklet_hi_schedule(&ct->receive_tasklet);
+
 		if (unlikely(ct_deadlocked(ct)))
 			return -EPIPE;
 		else
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids
  2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
                   ` (5 preceding siblings ...)
  2021-12-11 17:35 ` [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits Matthew Brost
@ 2021-12-11 17:35 ` Matthew Brost
  2021-12-14  0:26   ` John Harrison
  6 siblings, 1 reply; 13+ messages in thread
From: Matthew Brost @ 2021-12-11 17:35 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Testing the stealing of guc ids is hard from user space as we have 64k
guc_ids. Add a selftest, which artificially reduces the number of guc
ids, and forces a steal. Description of test below.

The test creates a spinner which is used to block all subsequent
submissions until it completes. Next, a loop creates a context and a NOP
request each iteration until the guc_ids are exhausted (request creation
returns -EAGAIN). The spinner is ended, unblocking all requests created
in the loop. At this point all guc_ids are exhausted but are available
to steal. Try to create another request which should successfully steal
a guc_id. Wait on last request to complete, idle GPU, verify a guc_id
was stolen via a counter, and exit the test. Test also artificially
reduces the number of guc_ids so the test runs in a timely manner.

v2:
 (John Harrison)
  - s/stole/stolen
  - Fix some wording in test description
  - Rework indexing into context array
  - Add test description to commit message
  - Fix typo in commit message
 (Checkpatch)
  - s/guc/(guc) in NUMBER_MULTI_LRC_GUC_ID

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  12 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  16 +-
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 173 ++++++++++++++++++
 3 files changed, 196 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 1cb46098030d..f9240d4baa69 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -94,6 +94,11 @@ struct intel_guc {
 		 * @guc_ids: used to allocate new guc_ids, single-lrc
 		 */
 		struct ida guc_ids;
+		/**
+		 * @num_guc_ids: Number of guc_ids, selftest feature to be able
+		 * to reduce this number while testing.
+		 */
+		int num_guc_ids;
 		/**
 		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
 		 */
@@ -202,6 +207,13 @@ struct intel_guc {
 		 */
 		struct delayed_work work;
 	} timestamp;
+
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @number_guc_id_stolen: The number of guc_ids that have been stolen
+	 */
+	int number_guc_id_stolen;
+#endif
 };
 
 static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 96fcf869e3ff..99414b49ca6d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -145,7 +145,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
  * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
  * multi-lrc.
  */
-#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
+#define NUMBER_MULTI_LRC_GUC_ID(guc)	\
+	((guc)->submission_state.num_guc_ids / 16)
 
 /*
  * Below is a set of functions which control the GuC scheduling state which
@@ -1775,7 +1776,7 @@ int intel_guc_submission_init(struct intel_guc *guc)
 		  destroyed_worker_func);
 
 	guc->submission_state.guc_ids_bitmap =
-		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
 	if (!guc->submission_state.guc_ids_bitmap)
 		return -ENOMEM;
 
@@ -1869,13 +1870,13 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 
 	if (intel_context_is_parent(ce))
 		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
-					      NUMBER_MULTI_LRC_GUC_ID,
+					      NUMBER_MULTI_LRC_GUC_ID(guc),
 					      order_base_2(ce->parallel.number_children
 							   + 1));
 	else
 		ret = ida_simple_get(&guc->submission_state.guc_ids,
-				     NUMBER_MULTI_LRC_GUC_ID,
-				     GUC_MAX_LRC_DESCRIPTORS,
+				     NUMBER_MULTI_LRC_GUC_ID(guc),
+				     guc->submission_state.num_guc_ids,
 				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
 				     __GFP_NOWARN);
 	if (unlikely(ret < 0))
@@ -1941,6 +1942,10 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 
 		set_context_guc_id_invalid(cn);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		guc->number_guc_id_stolen++;
+#endif
+
 		return 0;
 	} else {
 		return -EAGAIN;
@@ -3779,6 +3784,7 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
+	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
index fb0e4a7bd8ca..90f5eb66281c 100644
--- a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -3,8 +3,21 @@
  * Copyright �� 2021 Intel Corporation
  */
 
+#include "selftests/igt_spinner.h"
 #include "selftests/intel_scheduler_helpers.h"
 
+static int request_add_spin(struct i915_request *rq, struct igt_spinner *spin)
+{
+	int err = 0;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (spin && !igt_wait_for_spinner(spin, rq))
+		err = -ETIMEDOUT;
+
+	return err;
+}
+
 static struct i915_request *nop_user_request(struct intel_context *ce,
 					     struct i915_request *from)
 {
@@ -110,10 +123,170 @@ static int intel_guc_scrub_ctbs(void *arg)
 	return ret;
 }
 
+/*
+ * intel_guc_steal_guc_ids - Test to exhaust all guc_ids and then steal one
+ *
+ * This test creates a spinner which is used to block all subsequent submissions
+ * until it completes. Next, a loop creates a context and a NOP request each
+ * iteration until the guc_ids are exhausted (request creation returns -EAGAIN).
+ * The spinner is ended, unblocking all requests created in the loop. At this
+ * point all guc_ids are exhausted but are available to steal. Try to create
+ * another request which should successfully steal a guc_id. Wait on last
+ * request to complete, idle GPU, verify a guc_id was stolen via a counter, and
+ * exit the test. Test also artificially reduces the number of guc_ids so the
+ * test runs in a timely manner.
+ */
+static int intel_guc_steal_guc_ids(void *arg)
+{
+	struct intel_gt *gt = arg;
+	struct intel_guc *guc = &gt->uc.guc;
+	int ret, sv, context_index = 0;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context **ce;
+	struct igt_spinner spin;
+	struct i915_request *spin_rq = NULL, *rq, *last = NULL;
+	int number_guc_id_stolen = guc->number_guc_id_stolen;
+
+	ce = kzalloc(sizeof(*ce) * GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL);
+	if (!ce) {
+		pr_err("Context array allocation failed\n");
+		return -ENOMEM;
+	}
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+	sv = guc->submission_state.num_guc_ids;
+	guc->submission_state.num_guc_ids = 4096;
+
+	/* Create spinner to block requests in below loop */
+	ce[context_index] = intel_context_create(engine);
+	if (IS_ERR(ce[context_index])) {
+		ce[context_index] = NULL;
+		ret = PTR_ERR(ce[context_index]);
+		pr_err("Failed to create context: %d\n", ret);
+		goto err_wakeref;
+	}
+	ret = igt_spinner_init(&spin, engine->gt);
+	if (ret) {
+		pr_err("Failed to create spinner: %d\n", ret);
+		goto err_contexts;
+	}
+	spin_rq = igt_spinner_create_request(&spin, ce[context_index],
+					     MI_ARB_CHECK);
+	if (IS_ERR(spin_rq)) {
+		ret = PTR_ERR(spin_rq);
+		pr_err("Failed to create spinner request: %d\n", ret);
+		goto err_contexts;
+	}
+	ret = request_add_spin(spin_rq, &spin);
+	if (ret) {
+		pr_err("Failed to add Spinner request: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Use all guc_ids */
+	while (ret != -EAGAIN) {
+		ce[++context_index] = intel_context_create(engine);
+		if (IS_ERR(ce[context_index])) {
+			ce[context_index] = NULL;
+			ret = PTR_ERR(ce[context_index--]);
+			pr_err("Failed to create context: %d\n", ret);
+			goto err_spin_rq;
+		}
+
+		rq = nop_user_request(ce[context_index], spin_rq);
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			rq = NULL;
+			if (ret != -EAGAIN) {
+				pr_err("Failed to create request, %d: %d\n",
+				       context_index, ret);
+				goto err_spin_rq;
+			}
+		} else {
+			if (last)
+				i915_request_put(last);
+			last = rq;
+		}
+	}
+
+	/* Release blocked requests */
+	igt_spinner_end(&spin);
+	ret = intel_selftest_wait_for_rq(spin_rq);
+	if (ret) {
+		pr_err("Spin request failed to complete: %d\n", ret);
+		i915_request_put(last);
+		goto err_spin_rq;
+	}
+	i915_request_put(spin_rq);
+	igt_spinner_fini(&spin);
+	spin_rq = NULL;
+
+	/* Wait for last request */
+	ret = i915_request_wait(last, 0, HZ * 30);
+	i915_request_put(last);
+	if (ret < 0) {
+		pr_err("Last request failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Try to steal guc_id */
+	rq = nop_user_request(ce[context_index], NULL);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed to steal guc_id, %d: %d\n", context_index, ret);
+		goto err_spin_rq;
+	}
+
+	/* Wait for request with stolen guc_id */
+	ret = i915_request_wait(rq, 0, HZ);
+	i915_request_put(rq);
+	if (ret < 0) {
+		pr_err("Request with stolen guc_id failed to complete: %d\n",
+		       ret);
+		goto err_spin_rq;
+	}
+
+	/* Wait for idle */
+	ret = intel_gt_wait_for_idle(gt, HZ * 30);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Verify a guc_id got stolen */
+	if (guc->number_guc_id_stolen == number_guc_id_stolen) {
+		pr_err("No guc_ids stolenn");
+		ret = -EINVAL;
+	} else {
+		ret = 0;
+	}
+
+err_spin_rq:
+	if (spin_rq) {
+		igt_spinner_end(&spin);
+		intel_selftest_wait_for_rq(spin_rq);
+		i915_request_put(spin_rq);
+		igt_spinner_fini(&spin);
+		intel_gt_wait_for_idle(gt, HZ * 30);
+	}
+err_contexts:
+	for (; context_index >= 0 && ce[context_index]; --context_index)
+		intel_context_put(ce[context_index]);
+err_wakeref:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	kfree(ce);
+	guc->submission_state.num_guc_ids = sv;
+
+	return ret;
+}
+
 int intel_guc_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
 		SUBTEST(intel_guc_scrub_ctbs),
+		SUBTEST(intel_guc_steal_guc_ids),
 	};
 	struct intel_gt *gt = &i915->gt;
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids
  2021-12-11 17:35 ` [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids Matthew Brost
@ 2021-12-14  0:26   ` John Harrison
  2021-12-14 16:57     ` Matthew Brost
  0 siblings, 1 reply; 13+ messages in thread
From: John Harrison @ 2021-12-14  0:26 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 12/11/2021 09:35, Matthew Brost wrote:
> Testing the stealing of guc ids is hard from user space as we have 64k
> guc_ids. Add a selftest, which artificially reduces the number of guc
> ids, and forces a steal. Description of test below.
Last sentence seems redundant.

>
> The test creates a spinner which is used to block all subsequent
> submissions until it completes. Next, a loop creates a context and a NOP
> request each iteration until the guc_ids are exhausted (request creation
> returns -EAGAIN). The spinner is ended, unblocking all requests created
> in the loop. At this point all guc_ids are exhausted but are available
> to steal. Try to create another request which should successfully steal
> a guc_id. Wait on last request to complete, idle GPU, verify a guc_id
> was stolen via a counter, and exit the test. Test also artificially
> reduces the number of guc_ids so the test runs in a timely manner.
>
> v2:
>   (John Harrison)
>    - s/stole/stolen
>    - Fix some wording in test description
>    - Rework indexing into context array
>    - Add test description to commit message
>    - Fix typo in commit message
>   (Checkpatch)
>    - s/guc/(guc) in NUMBER_MULTI_LRC_GUC_ID
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  12 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  16 +-
>   drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 173 ++++++++++++++++++
>   3 files changed, 196 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 1cb46098030d..f9240d4baa69 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -94,6 +94,11 @@ struct intel_guc {
>   		 * @guc_ids: used to allocate new guc_ids, single-lrc
>   		 */
>   		struct ida guc_ids;
> +		/**
> +		 * @num_guc_ids: Number of guc_ids, selftest feature to be able
> +		 * to reduce this number while testing.
> +		 */
> +		int num_guc_ids;
>   		/**
>   		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
>   		 */
> @@ -202,6 +207,13 @@ struct intel_guc {
>   		 */
>   		struct delayed_work work;
>   	} timestamp;
> +
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +	/**
> +	 * @number_guc_id_stolen: The number of guc_ids that have been stolen
> +	 */
> +	int number_guc_id_stolen;
> +#endif
>   };
>   
>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 96fcf869e3ff..99414b49ca6d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -145,7 +145,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>    * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
>    * multi-lrc.
>    */
> -#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> +#define NUMBER_MULTI_LRC_GUC_ID(guc)	\
> +	((guc)->submission_state.num_guc_ids / 16)
>   
>   /*
>    * Below is a set of functions which control the GuC scheduling state which
> @@ -1775,7 +1776,7 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   		  destroyed_worker_func);
>   
>   	guc->submission_state.guc_ids_bitmap =
> -		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
>   	if (!guc->submission_state.guc_ids_bitmap)
>   		return -ENOMEM;
>   
> @@ -1869,13 +1870,13 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   
>   	if (intel_context_is_parent(ce))
>   		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> -					      NUMBER_MULTI_LRC_GUC_ID,
> +					      NUMBER_MULTI_LRC_GUC_ID(guc),
>   					      order_base_2(ce->parallel.number_children
>   							   + 1));
>   	else
>   		ret = ida_simple_get(&guc->submission_state.guc_ids,
> -				     NUMBER_MULTI_LRC_GUC_ID,
> -				     GUC_MAX_LRC_DESCRIPTORS,
> +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> +				     guc->submission_state.num_guc_ids,
>   				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
>   				     __GFP_NOWARN);
>   	if (unlikely(ret < 0))
> @@ -1941,6 +1942,10 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   
>   		set_context_guc_id_invalid(cn);
>   
> +#ifdef CONFIG_DRM_I915_SELFTEST
> +		guc->number_guc_id_stolen++;
> +#endif
> +
>   		return 0;
>   	} else {
>   		return -EAGAIN;
> @@ -3779,6 +3784,7 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>   
>   void intel_guc_submission_init_early(struct intel_guc *guc)
>   {
> +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>   	guc->submission_supported = __guc_submission_supported(guc);
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> index fb0e4a7bd8ca..90f5eb66281c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> @@ -3,8 +3,21 @@
>    * Copyright �� 2021 Intel Corporation
>    */
>   
> +#include "selftests/igt_spinner.h"
>   #include "selftests/intel_scheduler_helpers.h"
>   
> +static int request_add_spin(struct i915_request *rq, struct igt_spinner *spin)
> +{
> +	int err = 0;
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +	if (spin && !igt_wait_for_spinner(spin, rq))
> +		err = -ETIMEDOUT;
> +
> +	return err;
> +}
> +
>   static struct i915_request *nop_user_request(struct intel_context *ce,
>   					     struct i915_request *from)
>   {
> @@ -110,10 +123,170 @@ static int intel_guc_scrub_ctbs(void *arg)
>   	return ret;
>   }
>   
> +/*
> + * intel_guc_steal_guc_ids - Test to exhaust all guc_ids and then steal one
> + *
> + * This test creates a spinner which is used to block all subsequent submissions
> + * until it completes. Next, a loop creates a context and a NOP request each
> + * iteration until the guc_ids are exhausted (request creation returns -EAGAIN).
> + * The spinner is ended, unblocking all requests created in the loop. At this
> + * point all guc_ids are exhausted but are available to steal. Try to create
> + * another request which should successfully steal a guc_id. Wait on last
> + * request to complete, idle GPU, verify a guc_id was stolen via a counter, and
> + * exit the test. Test also artificially reduces the number of guc_ids so the
> + * test runs in a timely manner.
> + */
> +static int intel_guc_steal_guc_ids(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	int ret, sv, context_index = 0;
> +	intel_wakeref_t wakeref;
> +	struct intel_engine_cs *engine;
> +	struct intel_context **ce;
> +	struct igt_spinner spin;
> +	struct i915_request *spin_rq = NULL, *rq, *last = NULL;
> +	int number_guc_id_stolen = guc->number_guc_id_stolen;
> +
> +	ce = kzalloc(sizeof(*ce) * GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL);
> +	if (!ce) {
> +		pr_err("Context array allocation failed\n");
> +		return -ENOMEM;
> +	}
> +
> +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> +	engine = intel_selftest_find_any_engine(gt);
> +	sv = guc->submission_state.num_guc_ids;
> +	guc->submission_state.num_guc_ids = 4096;
> +
> +	/* Create spinner to block requests in below loop */
> +	ce[context_index] = intel_context_create(engine);
> +	if (IS_ERR(ce[context_index])) {
> +		ce[context_index] = NULL;
> +		ret = PTR_ERR(ce[context_index]);
Um, this is now null!

> +		pr_err("Failed to create context: %d\n", ret);
> +		goto err_wakeref;
> +	}
> +	ret = igt_spinner_init(&spin, engine->gt);
> +	if (ret) {
> +		pr_err("Failed to create spinner: %d\n", ret);
> +		goto err_contexts;
> +	}
> +	spin_rq = igt_spinner_create_request(&spin, ce[context_index],
> +					     MI_ARB_CHECK);
> +	if (IS_ERR(spin_rq)) {
> +		ret = PTR_ERR(spin_rq);
> +		pr_err("Failed to create spinner request: %d\n", ret);
> +		goto err_contexts;
> +	}
> +	ret = request_add_spin(spin_rq, &spin);
> +	if (ret) {
> +		pr_err("Failed to add Spinner request: %d\n", ret);
> +		goto err_spin_rq;
> +	}
> +
> +	/* Use all guc_ids */
> +	while (ret != -EAGAIN) {
> +		ce[++context_index] = intel_context_create(engine);
> +		if (IS_ERR(ce[context_index])) {
> +			ce[context_index] = NULL;
> +			ret = PTR_ERR(ce[context_index--]);
And again.

> +			pr_err("Failed to create context: %d\n", ret);
> +			goto err_spin_rq;
> +		}
> +
> +		rq = nop_user_request(ce[context_index], spin_rq);
> +		if (IS_ERR(rq)) {
> +			ret = PTR_ERR(rq);
> +			rq = NULL;
> +			if (ret != -EAGAIN) {
> +				pr_err("Failed to create request, %d: %d\n",
> +				       context_index, ret);
> +				goto err_spin_rq;
> +			}
> +		} else {
> +			if (last)
> +				i915_request_put(last);
> +			last = rq;
> +		}
> +	}
> +
> +	/* Release blocked requests */
> +	igt_spinner_end(&spin);
> +	ret = intel_selftest_wait_for_rq(spin_rq);
> +	if (ret) {
> +		pr_err("Spin request failed to complete: %d\n", ret);
> +		i915_request_put(last);
> +		goto err_spin_rq;
> +	}
> +	i915_request_put(spin_rq);
> +	igt_spinner_fini(&spin);
> +	spin_rq = NULL;
> +
> +	/* Wait for last request */
> +	ret = i915_request_wait(last, 0, HZ * 30);
> +	i915_request_put(last);
> +	if (ret < 0) {
> +		pr_err("Last request failed to complete: %d\n", ret);
> +		goto err_spin_rq;
> +	}
> +
> +	/* Try to steal guc_id */
> +	rq = nop_user_request(ce[context_index], NULL);
> +	if (IS_ERR(rq)) {
> +		ret = PTR_ERR(rq);
> +		pr_err("Failed to steal guc_id, %d: %d\n", context_index, ret);
> +		goto err_spin_rq;
> +	}
> +
> +	/* Wait for request with stolen guc_id */
> +	ret = i915_request_wait(rq, 0, HZ);
> +	i915_request_put(rq);
> +	if (ret < 0) {
> +		pr_err("Request with stolen guc_id failed to complete: %d\n",
> +		       ret);
> +		goto err_spin_rq;
> +	}
> +
> +	/* Wait for idle */
> +	ret = intel_gt_wait_for_idle(gt, HZ * 30);
> +	if (ret < 0) {
> +		pr_err("GT failed to idle: %d\n", ret);
> +		goto err_spin_rq;
> +	}
> +
> +	/* Verify a guc_id got stolen */
got stolen -> was stolen

> +	if (guc->number_guc_id_stolen == number_guc_id_stolen) {
> +		pr_err("No guc_ids stolenn");
ids stolenn -> id was stolen

John.

> +		ret = -EINVAL;
> +	} else {
> +		ret = 0;
> +	}
> +
> +err_spin_rq:
> +	if (spin_rq) {
> +		igt_spinner_end(&spin);
> +		intel_selftest_wait_for_rq(spin_rq);
> +		i915_request_put(spin_rq);
> +		igt_spinner_fini(&spin);
> +		intel_gt_wait_for_idle(gt, HZ * 30);
> +	}
> +err_contexts:
> +	for (; context_index >= 0 && ce[context_index]; --context_index)
> +		intel_context_put(ce[context_index]);
> +err_wakeref:
> +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> +	kfree(ce);
> +	guc->submission_state.num_guc_ids = sv;
> +
> +	return ret;
> +}
> +
>   int intel_guc_live_selftests(struct drm_i915_private *i915)
>   {
>   	static const struct i915_subtest tests[] = {
>   		SUBTEST(intel_guc_scrub_ctbs),
> +		SUBTEST(intel_guc_steal_guc_ids),
>   	};
>   	struct intel_gt *gt = &i915->gt;
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits
  2021-12-11 17:35 ` [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits Matthew Brost
@ 2021-12-14  0:27   ` John Harrison
  0 siblings, 0 replies; 13+ messages in thread
From: John Harrison @ 2021-12-14  0:27 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 12/11/2021 09:35, Matthew Brost wrote:
> Let's be paranoid and kick the G2H tasklet, which dequeues messages, if
> G2H credits are exhausted.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 741be9abab68..aa6dd6415202 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -591,12 +591,19 @@ static inline bool h2g_has_room(struct intel_guc_ct *ct, u32 len_dw)
>   
>   static int has_room_nb(struct intel_guc_ct *ct, u32 h2g_dw, u32 g2h_dw)
>   {
> +	bool h2g = h2g_has_room(ct, h2g_dw);
> +	bool g2h = g2h_has_room(ct, g2h_dw);
> +
>   	lockdep_assert_held(&ct->ctbs.send.lock);
>   
> -	if (unlikely(!h2g_has_room(ct, h2g_dw) || !g2h_has_room(ct, g2h_dw))) {
> +	if (unlikely(!h2g || !g2h)) {
>   		if (ct->stall_time == KTIME_MAX)
>   			ct->stall_time = ktime_get();
>   
> +		/* Be paranoid and kick G2H tasklet to free credits */
> +		if (!g2h)
> +			tasklet_hi_schedule(&ct->receive_tasklet);
> +
>   		if (unlikely(ct_deadlocked(ct)))
>   			return -EPIPE;
>   		else


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids
  2021-12-14  0:26   ` John Harrison
@ 2021-12-14 16:57     ` Matthew Brost
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-14 16:57 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, daniele.ceraolospurio, dri-devel

On Mon, Dec 13, 2021 at 04:26:07PM -0800, John Harrison wrote:
> On 12/11/2021 09:35, Matthew Brost wrote:
> > Testing the stealing of guc ids is hard from user space as we have 64k
> > guc_ids. Add a selftest, which artificially reduces the number of guc
> > ids, and forces a steal. Description of test below.
> Last sentence seems redundant.
> 

Will delete.

> > 
> > The test creates a spinner which is used to block all subsequent
> > submissions until it completes. Next, a loop creates a context and a NOP
> > request each iteration until the guc_ids are exhausted (request creation
> > returns -EAGAIN). The spinner is ended, unblocking all requests created
> > in the loop. At this point all guc_ids are exhausted but are available
> > to steal. Try to create another request which should successfully steal
> > a guc_id. Wait on last request to complete, idle GPU, verify a guc_id
> > was stolen via a counter, and exit the test. Test also artificially
> > reduces the number of guc_ids so the test runs in a timely manner.
> > 
> > v2:
> >   (John Harrison)
> >    - s/stole/stolen
> >    - Fix some wording in test description
> >    - Rework indexing into context array
> >    - Add test description to commit message
> >    - Fix typo in commit message
> >   (Checkpatch)
> >    - s/guc/(guc) in NUMBER_MULTI_LRC_GUC_ID
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  12 ++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  16 +-
> >   drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 173 ++++++++++++++++++
> >   3 files changed, 196 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 1cb46098030d..f9240d4baa69 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -94,6 +94,11 @@ struct intel_guc {
> >   		 * @guc_ids: used to allocate new guc_ids, single-lrc
> >   		 */
> >   		struct ida guc_ids;
> > +		/**
> > +		 * @num_guc_ids: Number of guc_ids, selftest feature to be able
> > +		 * to reduce this number while testing.
> > +		 */
> > +		int num_guc_ids;
> >   		/**
> >   		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> >   		 */
> > @@ -202,6 +207,13 @@ struct intel_guc {
> >   		 */
> >   		struct delayed_work work;
> >   	} timestamp;
> > +
> > +#ifdef CONFIG_DRM_I915_SELFTEST
> > +	/**
> > +	 * @number_guc_id_stolen: The number of guc_ids that have been stolen
> > +	 */
> > +	int number_guc_id_stolen;
> > +#endif
> >   };
> >   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 96fcf869e3ff..99414b49ca6d 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -145,7 +145,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >    * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> >    * multi-lrc.
> >    */
> > -#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> > +#define NUMBER_MULTI_LRC_GUC_ID(guc)	\
> > +	((guc)->submission_state.num_guc_ids / 16)
> >   /*
> >    * Below is a set of functions which control the GuC scheduling state which
> > @@ -1775,7 +1776,7 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   		  destroyed_worker_func);
> >   	guc->submission_state.guc_ids_bitmap =
> > -		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> >   	if (!guc->submission_state.guc_ids_bitmap)
> >   		return -ENOMEM;
> > @@ -1869,13 +1870,13 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	if (intel_context_is_parent(ce))
> >   		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > -					      NUMBER_MULTI_LRC_GUC_ID,
> > +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> >   					      order_base_2(ce->parallel.number_children
> >   							   + 1));
> >   	else
> >   		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > -				     NUMBER_MULTI_LRC_GUC_ID,
> > -				     GUC_MAX_LRC_DESCRIPTORS,
> > +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> > +				     guc->submission_state.num_guc_ids,
> >   				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> >   				     __GFP_NOWARN);
> >   	if (unlikely(ret < 0))
> > @@ -1941,6 +1942,10 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   		set_context_guc_id_invalid(cn);
> > +#ifdef CONFIG_DRM_I915_SELFTEST
> > +		guc->number_guc_id_stolen++;
> > +#endif
> > +
> >   		return 0;
> >   	} else {
> >   		return -EAGAIN;
> > @@ -3779,6 +3784,7 @@ static bool __guc_submission_selected(struct intel_guc *guc)
> >   void intel_guc_submission_init_early(struct intel_guc *guc)
> >   {
> > +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> >   	guc->submission_supported = __guc_submission_supported(guc);
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > index fb0e4a7bd8ca..90f5eb66281c 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
> > @@ -3,8 +3,21 @@
> >    * Copyright �� 2021 Intel Corporation
> >    */
> > +#include "selftests/igt_spinner.h"
> >   #include "selftests/intel_scheduler_helpers.h"
> > +static int request_add_spin(struct i915_request *rq, struct igt_spinner *spin)
> > +{
> > +	int err = 0;
> > +
> > +	i915_request_get(rq);
> > +	i915_request_add(rq);
> > +	if (spin && !igt_wait_for_spinner(spin, rq))
> > +		err = -ETIMEDOUT;
> > +
> > +	return err;
> > +}
> > +
> >   static struct i915_request *nop_user_request(struct intel_context *ce,
> >   					     struct i915_request *from)
> >   {
> > @@ -110,10 +123,170 @@ static int intel_guc_scrub_ctbs(void *arg)
> >   	return ret;
> >   }
> > +/*
> > + * intel_guc_steal_guc_ids - Test to exhaust all guc_ids and then steal one
> > + *
> > + * This test creates a spinner which is used to block all subsequent submissions
> > + * until it completes. Next, a loop creates a context and a NOP request each
> > + * iteration until the guc_ids are exhausted (request creation returns -EAGAIN).
> > + * The spinner is ended, unblocking all requests created in the loop. At this
> > + * point all guc_ids are exhausted but are available to steal. Try to create
> > + * another request which should successfully steal a guc_id. Wait on last
> > + * request to complete, idle GPU, verify a guc_id was stolen via a counter, and
> > + * exit the test. Test also artificially reduces the number of guc_ids so the
> > + * test runs in a timely manner.
> > + */
> > +static int intel_guc_steal_guc_ids(void *arg)
> > +{
> > +	struct intel_gt *gt = arg;
> > +	struct intel_guc *guc = &gt->uc.guc;
> > +	int ret, sv, context_index = 0;
> > +	intel_wakeref_t wakeref;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_context **ce;
> > +	struct igt_spinner spin;
> > +	struct i915_request *spin_rq = NULL, *rq, *last = NULL;
> > +	int number_guc_id_stolen = guc->number_guc_id_stolen;
> > +
> > +	ce = kzalloc(sizeof(*ce) * GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL);
> > +	if (!ce) {
> > +		pr_err("Context array allocation failed\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
> > +	engine = intel_selftest_find_any_engine(gt);
> > +	sv = guc->submission_state.num_guc_ids;
> > +	guc->submission_state.num_guc_ids = 4096;
> > +
> > +	/* Create spinner to block requests in below loop */
> > +	ce[context_index] = intel_context_create(engine);
> > +	if (IS_ERR(ce[context_index])) {
> > +		ce[context_index] = NULL;
> > +		ret = PTR_ERR(ce[context_index]);
> Um, this is now null!
> 

Ugh, wrong order. Will fix.

> > +		pr_err("Failed to create context: %d\n", ret);
> > +		goto err_wakeref;
> > +	}
> > +	ret = igt_spinner_init(&spin, engine->gt);
> > +	if (ret) {
> > +		pr_err("Failed to create spinner: %d\n", ret);
> > +		goto err_contexts;
> > +	}
> > +	spin_rq = igt_spinner_create_request(&spin, ce[context_index],
> > +					     MI_ARB_CHECK);
> > +	if (IS_ERR(spin_rq)) {
> > +		ret = PTR_ERR(spin_rq);
> > +		pr_err("Failed to create spinner request: %d\n", ret);
> > +		goto err_contexts;
> > +	}
> > +	ret = request_add_spin(spin_rq, &spin);
> > +	if (ret) {
> > +		pr_err("Failed to add Spinner request: %d\n", ret);
> > +		goto err_spin_rq;
> > +	}
> > +
> > +	/* Use all guc_ids */
> > +	while (ret != -EAGAIN) {
> > +		ce[++context_index] = intel_context_create(engine);
> > +		if (IS_ERR(ce[context_index])) {
> > +			ce[context_index] = NULL;
> > +			ret = PTR_ERR(ce[context_index--]);
> And again.
>

Same.
 
> > +			pr_err("Failed to create context: %d\n", ret);
> > +			goto err_spin_rq;
> > +		}
> > +
> > +		rq = nop_user_request(ce[context_index], spin_rq);
> > +		if (IS_ERR(rq)) {
> > +			ret = PTR_ERR(rq);
> > +			rq = NULL;
> > +			if (ret != -EAGAIN) {
> > +				pr_err("Failed to create request, %d: %d\n",
> > +				       context_index, ret);
> > +				goto err_spin_rq;
> > +			}
> > +		} else {
> > +			if (last)
> > +				i915_request_put(last);
> > +			last = rq;
> > +		}
> > +	}
> > +
> > +	/* Release blocked requests */
> > +	igt_spinner_end(&spin);
> > +	ret = intel_selftest_wait_for_rq(spin_rq);
> > +	if (ret) {
> > +		pr_err("Spin request failed to complete: %d\n", ret);
> > +		i915_request_put(last);
> > +		goto err_spin_rq;
> > +	}
> > +	i915_request_put(spin_rq);
> > +	igt_spinner_fini(&spin);
> > +	spin_rq = NULL;
> > +
> > +	/* Wait for last request */
> > +	ret = i915_request_wait(last, 0, HZ * 30);
> > +	i915_request_put(last);
> > +	if (ret < 0) {
> > +		pr_err("Last request failed to complete: %d\n", ret);
> > +		goto err_spin_rq;
> > +	}
> > +
> > +	/* Try to steal guc_id */
> > +	rq = nop_user_request(ce[context_index], NULL);
> > +	if (IS_ERR(rq)) {
> > +		ret = PTR_ERR(rq);
> > +		pr_err("Failed to steal guc_id, %d: %d\n", context_index, ret);
> > +		goto err_spin_rq;
> > +	}
> > +
> > +	/* Wait for request with stolen guc_id */
> > +	ret = i915_request_wait(rq, 0, HZ);
> > +	i915_request_put(rq);
> > +	if (ret < 0) {
> > +		pr_err("Request with stolen guc_id failed to complete: %d\n",
> > +		       ret);
> > +		goto err_spin_rq;
> > +	}
> > +
> > +	/* Wait for idle */
> > +	ret = intel_gt_wait_for_idle(gt, HZ * 30);
> > +	if (ret < 0) {
> > +		pr_err("GT failed to idle: %d\n", ret);
> > +		goto err_spin_rq;
> > +	}
> > +
> > +	/* Verify a guc_id got stolen */
> got stolen -> was stolen
> 

Yes.

> > +	if (guc->number_guc_id_stolen == number_guc_id_stolen) {
> > +		pr_err("No guc_ids stolenn");
> ids stolenn -> id was stolen
>

And yes.

> John.
> 
> > +		ret = -EINVAL;
> > +	} else {
> > +		ret = 0;
> > +	}
> > +
> > +err_spin_rq:
> > +	if (spin_rq) {
> > +		igt_spinner_end(&spin);
> > +		intel_selftest_wait_for_rq(spin_rq);
> > +		i915_request_put(spin_rq);
> > +		igt_spinner_fini(&spin);
> > +		intel_gt_wait_for_idle(gt, HZ * 30);
> > +	}
> > +err_contexts:
> > +	for (; context_index >= 0 && ce[context_index]; --context_index)
> > +		intel_context_put(ce[context_index]);
> > +err_wakeref:
> > +	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
> > +	kfree(ce);
> > +	guc->submission_state.num_guc_ids = sv;
> > +
> > +	return ret;
> > +}
> > +
> >   int intel_guc_live_selftests(struct drm_i915_private *i915)
> >   {
> >   	static const struct i915_subtest tests[] = {
> >   		SUBTEST(intel_guc_scrub_ctbs),
> > +		SUBTEST(intel_guc_steal_guc_ids),
> >   	};
> >   	struct intel_gt *gt = &i915->gt;
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON
  2021-12-14 17:04 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
@ 2021-12-14 17:04 ` Matthew Brost
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-14 17:04 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

A full GT reset can race with the last context put resulting in the
context ref count being zero but the destroyed bit not yet being set.
Remove GEM_BUG_ON in scrub_guc_desc_for_outstanding_g2h that asserts the
destroyed bit must be set in ref count is zero.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0fb2eeff0262..36c2965db49b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1040,8 +1040,6 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 
 		spin_unlock(&ce->guc_state.lock);
 
-		GEM_BUG_ON(!do_put && !destroyed);
-
 		if (pending_enable || destroyed || deregister) {
 			decr_outstanding_submission_g2h(guc);
 			if (deregister)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON
  2021-12-11  0:56 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
@ 2021-12-11  0:56 ` Matthew Brost
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-12-11  0:56 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

A full GT reset can race with the last context put resulting in the
context ref count being zero but the destroyed bit not yet being set.
Remove GEM_BUG_ON in scrub_guc_desc_for_outstanding_g2h that asserts the
destroyed bit must be set in ref count is zero.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0fb2eeff0262..36c2965db49b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1040,8 +1040,6 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 
 		spin_unlock(&ce->guc_state.lock);
 
-		GEM_BUG_ON(!do_put && !destroyed);
-
 		if (pending_enable || destroyed || deregister) {
 			decr_outstanding_submission_g2h(guc);
 			if (deregister)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-12-14 17:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-11 17:35 ` [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered Matthew Brost
2021-12-11 17:35 ` [PATCH 2/7] drm/i915/guc: Only assign guc_id.id when stealing guc_id Matthew Brost
2021-12-11 17:35 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost
2021-12-11 17:35 ` [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
2021-12-11 17:35 ` [PATCH 5/7] drm/i915/guc: Add extra debug on CT deadlock Matthew Brost
2021-12-11 17:35 ` [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits Matthew Brost
2021-12-14  0:27   ` John Harrison
2021-12-11 17:35 ` [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids Matthew Brost
2021-12-14  0:26   ` John Harrison
2021-12-14 16:57     ` Matthew Brost
  -- strict thread matches above, loose matches on Subject: below --
2021-12-14 17:04 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-14 17:04 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost
2021-12-11  0:56 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-11  0:56 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).