All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/26] Parallel submission aka multi-bb execbuf
@ 2021-10-04 22:06 ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

As discussed in [1] we are introducing a new parallel submission uAPI
for the i915 which allows more than 1 BB to be submitted in an execbuf
IOCTL. This is the implemenation for both GuC and execlists.

In addition to selftests in the series, an IGT is available implemented
in the first 4 patches [2].

The execbuf IOCTL changes have been done in a single large patch (#21)
as all the changes flow together and I believe a single patch will be
better if some one has to lookup this change in the future. Can split in
a series of smaller patches if desired.

This code is available in a public [3] repo for UMD teams to test there
code on.

v2: Drop complicated state machine to block in kernel if no guc_ids
available, perma-pin parallel contexts, reworker execbuf IOCTL to be a
series of loops inside the IOCTL rather than 1 large one on the outside,
address Daniel Vetter's comments
v3: Address John Harrison's comments, add a couple of patches which fix
bugs found internally

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/series/92028/
[2] https://patchwork.freedesktop.org/series/93071/
[3] https://gitlab.freedesktop.org/mbrost/mbrost-drm-intel/-/tree/drm-intel-parallel

Matthew Brost (26):
  drm/i915/guc: Move GuC guc_id allocation under submission state
    sub-struct
  drm/i915/guc: Take GT PM ref when deregistering context
  drm/i915/guc: Take engine PM when a context is pinned with GuC
    submission
  drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  drm/i915: Add logical engine mapping
  drm/i915: Expose logical engine instance to user
  drm/i915/guc: Introduce context parent-child relationship
  drm/i915/guc: Add multi-lrc context registration
  drm/i915/guc: Ensure GuC schedule operations do not operate on child
    contexts
  drm/i915/guc: Assign contexts in parent-child relationship consecutive
    guc_ids
  drm/i915/guc: Implement parallel context pin / unpin functions
  drm/i915/guc: Implement multi-lrc submission
  drm/i915/guc: Insert submit fences between requests in parent-child
    relationship
  drm/i915/guc: Implement multi-lrc reset
  drm/i915/guc: Update debugfs for GuC multi-lrc
  drm/i915: Fix bug in user proto-context creation that leaked contexts
  drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  drm/i915/guc: Add basic GuC multi-lrc selftest
  drm/i915/guc: Implement no mid batch preemption for multi-lrc
  drm/i915: Multi-BB execbuf
  drm/i915/guc: Handle errors in multi-lrc requests
  drm/i915: Make request conflict tracking understand parallel submits
  drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  drm/i915: Enable multi-bb execbuf
  drm/i915/execlists: Weak parallel submission support for execlists

 Documentation/gpu/rfc/i915_parallel_execbuf.h |  122 --
 Documentation/gpu/rfc/i915_scheduler.rst      |    4 +-
 drivers/gpu/drm/i915/gem/i915_gem_busy.c      |   60 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  225 ++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |    6 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  796 ++++++---
 drivers/gpu/drm/i915/gt/intel_context.c       |   50 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   54 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   73 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   66 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   13 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   37 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |    7 +
 .../drm/i915/gt/intel_execlists_submission.c  |   63 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   14 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |    7 +
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   12 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |   26 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   49 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |    2 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   27 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 1456 ++++++++++++++---
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  179 ++
 drivers/gpu/drm/i915/i915_query.c             |    2 +
 drivers/gpu/drm/i915/i915_request.c           |  143 +-
 drivers/gpu/drm/i915/i915_request.h           |   23 +
 drivers/gpu/drm/i915/i915_vma.c               |   21 +-
 drivers/gpu/drm/i915/i915_vma.h               |   13 +-
 drivers/gpu/drm/i915/intel_wakeref.h          |   12 +
 .../drm/i915/selftests/i915_live_selftests.h  |    1 +
 include/uapi/drm/i915_drm.h                   |  139 +-
 34 files changed, 3038 insertions(+), 701 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 00/26] Parallel submission aka multi-bb execbuf
@ 2021-10-04 22:06 ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

As discussed in [1] we are introducing a new parallel submission uAPI
for the i915 which allows more than 1 BB to be submitted in an execbuf
IOCTL. This is the implemenation for both GuC and execlists.

In addition to selftests in the series, an IGT is available implemented
in the first 4 patches [2].

The execbuf IOCTL changes have been done in a single large patch (#21)
as all the changes flow together and I believe a single patch will be
better if some one has to lookup this change in the future. Can split in
a series of smaller patches if desired.

This code is available in a public [3] repo for UMD teams to test there
code on.

v2: Drop complicated state machine to block in kernel if no guc_ids
available, perma-pin parallel contexts, reworker execbuf IOCTL to be a
series of loops inside the IOCTL rather than 1 large one on the outside,
address Daniel Vetter's comments
v3: Address John Harrison's comments, add a couple of patches which fix
bugs found internally

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/series/92028/
[2] https://patchwork.freedesktop.org/series/93071/
[3] https://gitlab.freedesktop.org/mbrost/mbrost-drm-intel/-/tree/drm-intel-parallel

Matthew Brost (26):
  drm/i915/guc: Move GuC guc_id allocation under submission state
    sub-struct
  drm/i915/guc: Take GT PM ref when deregistering context
  drm/i915/guc: Take engine PM when a context is pinned with GuC
    submission
  drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  drm/i915: Add logical engine mapping
  drm/i915: Expose logical engine instance to user
  drm/i915/guc: Introduce context parent-child relationship
  drm/i915/guc: Add multi-lrc context registration
  drm/i915/guc: Ensure GuC schedule operations do not operate on child
    contexts
  drm/i915/guc: Assign contexts in parent-child relationship consecutive
    guc_ids
  drm/i915/guc: Implement parallel context pin / unpin functions
  drm/i915/guc: Implement multi-lrc submission
  drm/i915/guc: Insert submit fences between requests in parent-child
    relationship
  drm/i915/guc: Implement multi-lrc reset
  drm/i915/guc: Update debugfs for GuC multi-lrc
  drm/i915: Fix bug in user proto-context creation that leaked contexts
  drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  drm/i915/guc: Add basic GuC multi-lrc selftest
  drm/i915/guc: Implement no mid batch preemption for multi-lrc
  drm/i915: Multi-BB execbuf
  drm/i915/guc: Handle errors in multi-lrc requests
  drm/i915: Make request conflict tracking understand parallel submits
  drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  drm/i915: Enable multi-bb execbuf
  drm/i915/execlists: Weak parallel submission support for execlists

 Documentation/gpu/rfc/i915_parallel_execbuf.h |  122 --
 Documentation/gpu/rfc/i915_scheduler.rst      |    4 +-
 drivers/gpu/drm/i915/gem/i915_gem_busy.c      |   60 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  225 ++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |    6 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  796 ++++++---
 drivers/gpu/drm/i915/gt/intel_context.c       |   50 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   54 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   73 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   66 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   13 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   37 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |    7 +
 .../drm/i915/gt/intel_execlists_submission.c  |   63 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   14 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |    7 +
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   12 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |   26 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   49 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |    2 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   27 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 1456 ++++++++++++++---
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  179 ++
 drivers/gpu/drm/i915/i915_query.c             |    2 +
 drivers/gpu/drm/i915/i915_request.c           |  143 +-
 drivers/gpu/drm/i915/i915_request.h           |   23 +
 drivers/gpu/drm/i915/i915_vma.c               |   21 +-
 drivers/gpu/drm/i915/i915_vma.h               |   13 +-
 drivers/gpu/drm/i915/intel_wakeref.h          |   12 +
 .../drm/i915/selftests/i915_live_selftests.h  |    1 +
 include/uapi/drm/i915_drm.h                   |  139 +-
 34 files changed, 3038 insertions(+), 701 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Move guc_id allocation under submission state sub-struct as a future
patch will reuse the spin lock as a global submission state lock. Moving
this into sub-struct makes ownership of fields / lock clear.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
 3 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 12252c411159..e7e3984aab78 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -197,18 +197,18 @@ struct intel_context {
 	struct {
 		/**
 		 * @id: handle which is used to uniquely identify this context
-		 * with the GuC, protected by guc->contexts_lock
+		 * with the GuC, protected by guc->submission_state.lock
 		 */
 		u16 id;
 		/**
 		 * @ref: the number of references to the guc_id, when
 		 * transitioning in and out of zero protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		atomic_t ref;
 		/**
 		 * @link: in guc->guc_id_list when the guc_id has no refs but is
-		 * still valid, protected by guc->contexts_lock
+		 * still valid, protected by guc->submission_state.lock
 		 */
 		struct list_head link;
 	} guc_id;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..65b5e8eeef96 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -70,17 +70,21 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/**
-	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
-	 * ce->guc_id.ref when transitioning in and out of zero
-	 */
-	spinlock_t contexts_lock;
-	/** @guc_ids: used to allocate unique ce->guc_id.id values */
-	struct ida guc_ids;
-	/**
-	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
-	 */
-	struct list_head guc_id_list;
+	struct {
+		/**
+		 * @lock: protects everything in submission_state
+		 */
+		spinlock_t lock;
+		/**
+		 * @guc_ids: used to allocate new guc_ids
+		 */
+		struct ida guc_ids;
+		/**
+		 * @guc_id_list: list of intel_context with valid guc_ids but no
+		 * refs
+		 */
+		struct list_head guc_id_list;
+	} submission_state;
 
 	/**
 	 * @submission_supported: tracks whether we support GuC submission on
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba0de35f6323..ad5c18119d92 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -68,16 +68,16 @@
  * fence is used to stall all requests associated with this guc_id until the
  * corresponding G2H returns indicating the guc_id has been deregistered.
  *
- * guc_ids:
+ * submission_state.guc_ids:
  * Unique number associated with private GuC context data passed in during
  * context registration / submission / deregistration. 64k available. Simple ida
  * is used for allocation.
  *
- * Stealing guc_ids:
- * If no guc_ids are available they can be stolen from another context at
- * request creation time if that context is unpinned. If a guc_id can't be found
- * we punt this problem to the user as we believe this is near impossible to hit
- * during normal use cases.
+ * Stealing submission_state.guc_ids:
+ * If no submission_state.guc_ids are available they can be stolen from another
+ * context at request creation time if that context is unpinned. If a guc_id
+ * can't be found we punt this problem to the user as we believe this is near
+ * impossible to hit during normal use cases.
  *
  * Locking:
  * In the GuC submission code we have 3 basic spin locks which protect
@@ -89,7 +89,7 @@
  * sched_engine can be submitting at a time. Currently only one sched_engine is
  * used for all of GuC submission but that could change in the future.
  *
- * guc->contexts_lock
+ * guc->submission_state.lock
  * Protects guc_id allocation for the given GuC, i.e. only one context can be
  * doing guc_id allocation operations at a time for each GuC in the system.
  *
@@ -103,7 +103,7 @@
  *
  * Lock ordering rules:
  * sched_engine->lock -> ce->guc_state.lock
- * guc->contexts_lock -> ce->guc_state.lock
+ * guc->submission_state.lock -> ce->guc_state.lock
  *
  * Reset races:
  * When a full GT reset is triggered it is assumed that some G2H responses to
@@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
-	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
-	ida_init(&guc->guc_ids);
+	spin_lock_init(&guc->submission_state.lock);
+	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
+	ida_init(&guc->submission_state.guc_ids);
 
 	return 0;
 }
@@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
 
 static int new_guc_id(struct intel_guc *guc)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
+	return ida_simple_get(&guc->submission_state.guc_ids, 0,
 			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
@@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		ida_simple_remove(&guc->submission_state.guc_ids,
+				  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	__release_guc_id(guc, ce);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int steal_guc_id(struct intel_guc *guc)
@@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
 	struct intel_context *ce;
 	int guc_id;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(&guc->submission_state.guc_id_list)) {
+		ce = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
@@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
@@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 
 	might_lock(&ce->guc_state.lock);
 
@@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/*
 	 * -EAGAIN indicates no guc_id are available, let's retire any
@@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
 	    !atomic_read(&ce->guc_id.ref))
-		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		list_add_tail(&ce->guc_id.link,
+			      &guc->submission_state.guc_id_list);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
@@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
 	 * returns indicating this context has been deregistered the guc_id is
 	 * returned to the pool of available guc_id.
 	 */
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	if (!list_empty(&ce->guc_id.link))
 		list_del_init(&ce->guc_id.link);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Move guc_id allocation under submission state sub-struct as a future
patch will reuse the spin lock as a global submission state lock. Moving
this into sub-struct makes ownership of fields / lock clear.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
 3 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 12252c411159..e7e3984aab78 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -197,18 +197,18 @@ struct intel_context {
 	struct {
 		/**
 		 * @id: handle which is used to uniquely identify this context
-		 * with the GuC, protected by guc->contexts_lock
+		 * with the GuC, protected by guc->submission_state.lock
 		 */
 		u16 id;
 		/**
 		 * @ref: the number of references to the guc_id, when
 		 * transitioning in and out of zero protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		atomic_t ref;
 		/**
 		 * @link: in guc->guc_id_list when the guc_id has no refs but is
-		 * still valid, protected by guc->contexts_lock
+		 * still valid, protected by guc->submission_state.lock
 		 */
 		struct list_head link;
 	} guc_id;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..65b5e8eeef96 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -70,17 +70,21 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/**
-	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
-	 * ce->guc_id.ref when transitioning in and out of zero
-	 */
-	spinlock_t contexts_lock;
-	/** @guc_ids: used to allocate unique ce->guc_id.id values */
-	struct ida guc_ids;
-	/**
-	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
-	 */
-	struct list_head guc_id_list;
+	struct {
+		/**
+		 * @lock: protects everything in submission_state
+		 */
+		spinlock_t lock;
+		/**
+		 * @guc_ids: used to allocate new guc_ids
+		 */
+		struct ida guc_ids;
+		/**
+		 * @guc_id_list: list of intel_context with valid guc_ids but no
+		 * refs
+		 */
+		struct list_head guc_id_list;
+	} submission_state;
 
 	/**
 	 * @submission_supported: tracks whether we support GuC submission on
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba0de35f6323..ad5c18119d92 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -68,16 +68,16 @@
  * fence is used to stall all requests associated with this guc_id until the
  * corresponding G2H returns indicating the guc_id has been deregistered.
  *
- * guc_ids:
+ * submission_state.guc_ids:
  * Unique number associated with private GuC context data passed in during
  * context registration / submission / deregistration. 64k available. Simple ida
  * is used for allocation.
  *
- * Stealing guc_ids:
- * If no guc_ids are available they can be stolen from another context at
- * request creation time if that context is unpinned. If a guc_id can't be found
- * we punt this problem to the user as we believe this is near impossible to hit
- * during normal use cases.
+ * Stealing submission_state.guc_ids:
+ * If no submission_state.guc_ids are available they can be stolen from another
+ * context at request creation time if that context is unpinned. If a guc_id
+ * can't be found we punt this problem to the user as we believe this is near
+ * impossible to hit during normal use cases.
  *
  * Locking:
  * In the GuC submission code we have 3 basic spin locks which protect
@@ -89,7 +89,7 @@
  * sched_engine can be submitting at a time. Currently only one sched_engine is
  * used for all of GuC submission but that could change in the future.
  *
- * guc->contexts_lock
+ * guc->submission_state.lock
  * Protects guc_id allocation for the given GuC, i.e. only one context can be
  * doing guc_id allocation operations at a time for each GuC in the system.
  *
@@ -103,7 +103,7 @@
  *
  * Lock ordering rules:
  * sched_engine->lock -> ce->guc_state.lock
- * guc->contexts_lock -> ce->guc_state.lock
+ * guc->submission_state.lock -> ce->guc_state.lock
  *
  * Reset races:
  * When a full GT reset is triggered it is assumed that some G2H responses to
@@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
-	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
-	ida_init(&guc->guc_ids);
+	spin_lock_init(&guc->submission_state.lock);
+	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
+	ida_init(&guc->submission_state.guc_ids);
 
 	return 0;
 }
@@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
 
 static int new_guc_id(struct intel_guc *guc)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
+	return ida_simple_get(&guc->submission_state.guc_ids, 0,
 			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
@@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		ida_simple_remove(&guc->submission_state.guc_ids,
+				  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	__release_guc_id(guc, ce);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int steal_guc_id(struct intel_guc *guc)
@@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
 	struct intel_context *ce;
 	int guc_id;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(&guc->submission_state.guc_id_list)) {
+		ce = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
@@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
@@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 
 	might_lock(&ce->guc_state.lock);
 
@@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/*
 	 * -EAGAIN indicates no guc_id are available, let's retire any
@@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
 	    !atomic_read(&ce->guc_id.ref))
-		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		list_add_tail(&ce->guc_id.link,
+			      &guc->submission_state.guc_id_list);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
@@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
 	 * returns indicating this context has been deregistered the guc_id is
 	 * returned to the pool of available guc_id.
 	 */
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	if (!list_empty(&ce->guc_id.link))
 		list_del_init(&ce->guc_id.link);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.

So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.

v2:
 (John Harrison)
  - Split structure changes out in different patch
 (Tvrtko)
  - Don't drop lock in deregister_destroyed_contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
 6 files changed, 121 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e9a0cad5c34d..1076066f41e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e7e3984aab78..4613d027cbc3 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,13 @@ struct intel_context {
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 8520c595f5e1..6fdeae668e6e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..05de6c1af25b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 65b5e8eeef96..25a598e2b6e8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -84,6 +84,17 @@ struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ad5c18119d92..17da2fea1bff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -90,8 +90,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Protects guc_id allocation for the given GuC, i.e. only one context can be
- * doing guc_id allocation operations at a time for each GuC in the system.
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
+	guc_flush_destroyed_contexts(guc);
 
 	/*
 	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
@@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	spin_lock_init(&guc->submission_state.lock);
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker,
+		  destroyed_worker_func);
 
 	return 0;
 }
@@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 		return;
 
 	guc_lrc_desc_pool_destroy(guc);
+	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
 
@@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id);
 }
 
@@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		guc_lrc_desc_unpin(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-		__guc_context_destroy(ce);
-		return;
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
-	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.

So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.

v2:
 (John Harrison)
  - Split structure changes out in different patch
 (Tvrtko)
  - Don't drop lock in deregister_destroyed_contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
 6 files changed, 121 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e9a0cad5c34d..1076066f41e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e7e3984aab78..4613d027cbc3 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,13 @@ struct intel_context {
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 8520c595f5e1..6fdeae668e6e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..05de6c1af25b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 65b5e8eeef96..25a598e2b6e8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -84,6 +84,17 @@ struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ad5c18119d92..17da2fea1bff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -90,8 +90,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Protects guc_id allocation for the given GuC, i.e. only one context can be
- * doing guc_id allocation operations at a time for each GuC in the system.
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
+	guc_flush_destroyed_contexts(guc);
 
 	/*
 	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
@@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	spin_lock_init(&guc->submission_state.lock);
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker,
+		  destroyed_worker_func);
 
 	return 0;
 }
@@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 		return;
 
 	guc_lrc_desc_pool_destroy(guc);
+	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
 
@@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id);
 }
 
@@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		guc_lrc_desc_unpin(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-		__guc_context_destroy(ce);
-		return;
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
-	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a scheduling of user context could be enabled.
Returning GT idle when it is not can cause all sorts of issues
throughout the stack.

v2:
 (Daniel Vetter)
  - Add might_lock annotations to pin / unpin function
v3:
 (CI)
  - Drop intel_engine_pm_might_put from unpin path as an async put is
    used
v4:
 (John Harrison)
  - Make intel_engine_pm_might_get/put work with GuC virtual engines
  - Update commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
 drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
 5 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 1076066f41e0..f601323b939f 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	if (err)
 		goto err_post_unpin;
 
+	intel_engine_pm_might_get(ce->engine);
+
 	if (unlikely(intel_context_is_closed(ce))) {
 		err = -ENOENT;
 		goto err_unlock;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 6fdeae668e6e..d68675925b79 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -6,9 +6,11 @@
 #ifndef INTEL_ENGINE_PM_H
 #define INTEL_ENGINE_PM_H
 
+#include "i915_drv.h"
 #include "i915_request.h"
 #include "intel_engine_types.h"
 #include "intel_wakeref.h"
+#include "intel_gt_pm.h"
 
 static inline bool
 intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
@@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
 	return intel_wakeref_get_if_active(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine)) {
+		intel_wakeref_might_get(&engine->wakeref);
+	} else {
+		struct intel_gt *gt = engine->gt;
+		struct intel_engine_cs *tengine;
+		intel_engine_mask_t tmp, mask = engine->mask;
+
+		for_each_engine_masked(tengine, gt, mask, tmp)
+			intel_wakeref_might_get(&tengine->wakeref);
+	}
+	intel_gt_pm_might_get(engine->gt);
+}
+
 static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
 {
 	intel_wakeref_put(&engine->wakeref);
@@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
 	intel_wakeref_unlock_wait(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine)) {
+		intel_wakeref_might_put(&engine->wakeref);
+	} else {
+		struct intel_gt *gt = engine->gt;
+		struct intel_engine_cs *tengine;
+		intel_engine_mask_t tmp, mask = engine->mask;
+
+		for_each_engine_masked(tengine, gt, mask, tmp)
+			intel_wakeref_might_put(&tengine->wakeref);
+	}
+	intel_gt_pm_might_put(engine->gt);
+}
+
 static inline struct i915_request *
 intel_engine_create_kernel_request(struct intel_engine_cs *engine)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 05de6c1af25b..bc898df7a48c 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
 	return intel_wakeref_get_if_active(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_get(struct intel_gt *gt)
+{
+	intel_wakeref_might_get(&gt->wakeref);
+}
+
 static inline void intel_gt_pm_put(struct intel_gt *gt)
 {
 	intel_wakeref_put(&gt->wakeref);
@@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_put(struct intel_gt *gt)
+{
+	intel_wakeref_might_put(&gt->wakeref);
+}
+
 #define with_intel_gt_pm(gt, tmp) \
 	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
 	     intel_gt_pm_put(gt), tmp = 0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 17da2fea1bff..8b82da50c2bc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
 
 static int guc_context_pin(struct intel_context *ce, void *vaddr)
 {
-	return __guc_context_pin(ce, ce->engine, vaddr);
+	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+
+	if (likely(!ret && !intel_context_is_barrier(ce)))
+		intel_engine_pm_get(ce->engine);
+
+	return ret;
 }
 
 static void guc_context_unpin(struct intel_context *ce)
@@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
 
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
+
+	if (likely(!intel_context_is_barrier(ce)))
+		intel_engine_pm_put_async(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
 static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	int ret = __guc_context_pin(ce, engine, vaddr);
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+
+	if (likely(!ret))
+		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+			intel_engine_pm_get(engine);
 
-	return __guc_context_pin(ce, engine, vaddr);
+	return ret;
+}
+
+static void guc_virtual_context_unpin(struct intel_context *ce)
+{
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+	struct intel_engine_cs *engine;
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+		intel_engine_pm_put_async(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
-	.unpin = guc_context_unpin,
+	.unpin = guc_virtual_context_unpin,
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index 545c8f277c46..4f4c2e15e736 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -123,6 +123,12 @@ enum {
 	__INTEL_WAKEREF_PUT_LAST_BIT__
 };
 
+static inline void
+intel_wakeref_might_get(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_put_flags: Release the wakeref
  * @wf: the wakeref
@@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
 			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
 }
 
+static inline void
+intel_wakeref_might_put(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_lock: Lock the wakeref (mutex)
  * @wf: the wakeref
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a scheduling of user context could be enabled.
Returning GT idle when it is not can cause all sorts of issues
throughout the stack.

v2:
 (Daniel Vetter)
  - Add might_lock annotations to pin / unpin function
v3:
 (CI)
  - Drop intel_engine_pm_might_put from unpin path as an async put is
    used
v4:
 (John Harrison)
  - Make intel_engine_pm_might_get/put work with GuC virtual engines
  - Update commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
 drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
 5 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 1076066f41e0..f601323b939f 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	if (err)
 		goto err_post_unpin;
 
+	intel_engine_pm_might_get(ce->engine);
+
 	if (unlikely(intel_context_is_closed(ce))) {
 		err = -ENOENT;
 		goto err_unlock;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 6fdeae668e6e..d68675925b79 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -6,9 +6,11 @@
 #ifndef INTEL_ENGINE_PM_H
 #define INTEL_ENGINE_PM_H
 
+#include "i915_drv.h"
 #include "i915_request.h"
 #include "intel_engine_types.h"
 #include "intel_wakeref.h"
+#include "intel_gt_pm.h"
 
 static inline bool
 intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
@@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
 	return intel_wakeref_get_if_active(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine)) {
+		intel_wakeref_might_get(&engine->wakeref);
+	} else {
+		struct intel_gt *gt = engine->gt;
+		struct intel_engine_cs *tengine;
+		intel_engine_mask_t tmp, mask = engine->mask;
+
+		for_each_engine_masked(tengine, gt, mask, tmp)
+			intel_wakeref_might_get(&tengine->wakeref);
+	}
+	intel_gt_pm_might_get(engine->gt);
+}
+
 static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
 {
 	intel_wakeref_put(&engine->wakeref);
@@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
 	intel_wakeref_unlock_wait(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine)) {
+		intel_wakeref_might_put(&engine->wakeref);
+	} else {
+		struct intel_gt *gt = engine->gt;
+		struct intel_engine_cs *tengine;
+		intel_engine_mask_t tmp, mask = engine->mask;
+
+		for_each_engine_masked(tengine, gt, mask, tmp)
+			intel_wakeref_might_put(&tengine->wakeref);
+	}
+	intel_gt_pm_might_put(engine->gt);
+}
+
 static inline struct i915_request *
 intel_engine_create_kernel_request(struct intel_engine_cs *engine)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 05de6c1af25b..bc898df7a48c 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
 	return intel_wakeref_get_if_active(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_get(struct intel_gt *gt)
+{
+	intel_wakeref_might_get(&gt->wakeref);
+}
+
 static inline void intel_gt_pm_put(struct intel_gt *gt)
 {
 	intel_wakeref_put(&gt->wakeref);
@@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_put(struct intel_gt *gt)
+{
+	intel_wakeref_might_put(&gt->wakeref);
+}
+
 #define with_intel_gt_pm(gt, tmp) \
 	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
 	     intel_gt_pm_put(gt), tmp = 0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 17da2fea1bff..8b82da50c2bc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
 
 static int guc_context_pin(struct intel_context *ce, void *vaddr)
 {
-	return __guc_context_pin(ce, ce->engine, vaddr);
+	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+
+	if (likely(!ret && !intel_context_is_barrier(ce)))
+		intel_engine_pm_get(ce->engine);
+
+	return ret;
 }
 
 static void guc_context_unpin(struct intel_context *ce)
@@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
 
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
+
+	if (likely(!intel_context_is_barrier(ce)))
+		intel_engine_pm_put_async(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
 static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	int ret = __guc_context_pin(ce, engine, vaddr);
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+
+	if (likely(!ret))
+		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+			intel_engine_pm_get(engine);
 
-	return __guc_context_pin(ce, engine, vaddr);
+	return ret;
+}
+
+static void guc_virtual_context_unpin(struct intel_context *ce)
+{
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+	struct intel_engine_cs *engine;
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+		intel_engine_pm_put_async(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
-	.unpin = guc_context_unpin,
+	.unpin = guc_virtual_context_unpin,
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index 545c8f277c46..4f4c2e15e736 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -123,6 +123,12 @@ enum {
 	__INTEL_WAKEREF_PUT_LAST_BIT__
 };
 
+static inline void
+intel_wakeref_might_get(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_put_flags: Release the wakeref
  * @wf: the wakeref
@@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
 			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
 }
 
+static inline void
+intel_wakeref_might_put(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_lock: Lock the wakeref (mutex)
  * @wf: the wakeref
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 04/26] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all user contexts are pinned as if don't have PM ref that
guarantees that all user contexts scheduling is disabled. By not calling
switch_to_kernel_context we save on issuing a request to the engine.

v2:
 (Daniel Vetter)
  - Add FIXME comment about pushing switch_to_kernel_context to backend
v3:
 (John Harrison)
  - Update commit message
  - Fix workding comment

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index dacd62773735..a1334b48dde7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -162,6 +162,19 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	unsigned long flags;
 	bool result = true;
 
+	/*
+	 * This is execlist specific behaviour intended to ensure the GPU is
+	 * idle by switching to a known 'safe' context. With GuC submission, the
+	 * same idle guarantee is achieved by other means (disabling
+	 * scheduling). Further, switching to a 'safe' context has no effect
+	 * with GuC submission as the scheduler can just switch back again.
+	 *
+	 * FIXME: Move this backend scheduler specific behaviour into the
+	 * scheduler backend.
+	 */
+	if (intel_engine_uses_guc(engine))
+		return true;
+
 	/* GPU is pointing to the void, as good as in the kernel context. */
 	if (intel_gt_is_wedged(engine->gt))
 		return true;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 04/26] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all user contexts are pinned as if don't have PM ref that
guarantees that all user contexts scheduling is disabled. By not calling
switch_to_kernel_context we save on issuing a request to the engine.

v2:
 (Daniel Vetter)
  - Add FIXME comment about pushing switch_to_kernel_context to backend
v3:
 (John Harrison)
  - Update commit message
  - Fix workding comment

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index dacd62773735..a1334b48dde7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -162,6 +162,19 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	unsigned long flags;
 	bool result = true;
 
+	/*
+	 * This is execlist specific behaviour intended to ensure the GPU is
+	 * idle by switching to a known 'safe' context. With GuC submission, the
+	 * same idle guarantee is achieved by other means (disabling
+	 * scheduling). Further, switching to a 'safe' context has no effect
+	 * with GuC submission as the scheduler can just switch back again.
+	 *
+	 * FIXME: Move this backend scheduler specific behaviour into the
+	 * scheduler backend.
+	 */
+	if (intel_engine_uses_guc(engine))
+		return true;
+
 	/* GPU is pointing to the void, as good as in the kernel context. */
 	if (intel_gt_is_wedged(engine->gt))
 		return true;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 05/26] drm/i915: Add logical engine mapping
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

v2:
 (Daniel Vetter)
  - Add kernel doc for new fields
v3
 (Tvrtko)
  - Update comment for new logical_mask field

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  7 +++
 .../drm/i915/gt/intel_execlists_submission.c  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
 5 files changed, 62 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..2eb798ad068b 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
 	GEM_DEBUG_WARN_ON(iir);
 }
 
-static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
+static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
+			      u8 logical_instance)
 {
 	const struct engine_info *info = &intel_engines[id];
 	struct drm_i915_private *i915 = gt->i915;
@@ -335,6 +336,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 
 	engine->class = info->class;
 	engine->instance = info->instance;
+	engine->logical_mask = BIT(logical_instance);
 	__sprint_engine_name(engine);
 
 	engine->props.heartbeat_interval_ms =
@@ -588,6 +590,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
 	return info->engine_mask;
 }
 
+static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
+				 u8 class, const u8 *map, u8 num_instances)
+{
+	int i, j;
+	u8 current_logical_id = 0;
+
+	for (j = 0; j < num_instances; ++j) {
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			if (!HAS_ENGINE(gt, i) ||
+			    intel_engines[i].class != class)
+				continue;
+
+			if (intel_engines[i].instance == map[j]) {
+				logical_ids[intel_engines[i].instance] =
+					current_logical_id++;
+				break;
+			}
+		}
+	}
+}
+
+static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
+{
+	int i;
+	u8 map[MAX_ENGINE_INSTANCE + 1];
+
+	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
+		map[i] = i;
+	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
+}
+
 /**
  * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
  * @gt: pointer to struct intel_gt
@@ -599,7 +632,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	struct drm_i915_private *i915 = gt->i915;
 	const unsigned int engine_mask = init_engine_mask(gt);
 	unsigned int mask = 0;
-	unsigned int i;
+	unsigned int i, class;
+	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
 	int err;
 
 	drm_WARN_ON(&i915->drm, engine_mask == 0);
@@ -609,15 +643,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	if (i915_inject_probe_failure(i915))
 		return -ENODEV;
 
-	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
-		if (!HAS_ENGINE(gt, i))
-			continue;
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		setup_logical_ids(gt, logical_ids, class);
 
-		err = intel_engine_setup(gt, i);
-		if (err)
-			goto cleanup;
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			u8 instance = intel_engines[i].instance;
+
+			if (intel_engines[i].class != class ||
+			    !HAS_ENGINE(gt, i))
+				continue;
 
-		mask |= BIT(i);
+			err = intel_engine_setup(gt, i,
+						 logical_ids[instance]);
+			if (err)
+				goto cleanup;
+
+			mask |= BIT(i);
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..68010da468a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -269,6 +269,13 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 
 	intel_engine_mask_t mask;
+	/**
+	 * @logical_mask: logical mask of engine, reported to user space via
+	 * query IOCTL and used to communicate with the GuC in logical space.
+	 * The logical instance of a physical engine can change based on product
+	 * / fusing and defined in the bspec.
+	 */
+	intel_engine_mask_t logical_mask;
 
 	u8 class;
 	u8 instance;
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..5ed1e222c308 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3877,6 +3877,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 		ve->siblings[ve->num_siblings++] = sibling;
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		/*
 		 * All physical engines must be compatible for their emission
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 2c6ea64af7ec..621c893a009f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
 	for_each_engine(engine, gt, id) {
 		u8 guc_class = engine_class_to_guc_class(engine->class);
 
-		system_info->mapping_table[guc_class][engine->instance] =
+		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
 			engine->instance;
 	}
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8b82da50c2bc..451d9ae861a6 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1423,23 +1423,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 	return __guc_action_deregister_context(guc, guc_id);
 }
 
-static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
-{
-	switch (class) {
-	case RENDER_CLASS:
-		return mask >> RCS0;
-	case VIDEO_ENHANCEMENT_CLASS:
-		return mask >> VECS0;
-	case VIDEO_DECODE_CLASS:
-		return mask >> VCS0;
-	case COPY_ENGINE_CLASS:
-		return mask >> BCS0;
-	default:
-		MISSING_CASE(class);
-		return 0;
-	}
-}
-
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1481,8 +1464,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	desc = __get_lrc_desc(guc, desc_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-	desc->engine_submit_mask = adjust_engine_mask(engine->class,
-						      engine->mask);
+	desc->engine_submit_mask = engine->logical_mask;
 	desc->hw_context_desc = ce->lrc.lrca;
 	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
@@ -3271,6 +3253,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		}
 
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		if (n != 0 && ve->base.class != sibling->class) {
 			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 05/26] drm/i915: Add logical engine mapping
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

v2:
 (Daniel Vetter)
  - Add kernel doc for new fields
v3
 (Tvrtko)
  - Update comment for new logical_mask field

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  7 +++
 .../drm/i915/gt/intel_execlists_submission.c  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
 5 files changed, 62 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..2eb798ad068b 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
 	GEM_DEBUG_WARN_ON(iir);
 }
 
-static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
+static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
+			      u8 logical_instance)
 {
 	const struct engine_info *info = &intel_engines[id];
 	struct drm_i915_private *i915 = gt->i915;
@@ -335,6 +336,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 
 	engine->class = info->class;
 	engine->instance = info->instance;
+	engine->logical_mask = BIT(logical_instance);
 	__sprint_engine_name(engine);
 
 	engine->props.heartbeat_interval_ms =
@@ -588,6 +590,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
 	return info->engine_mask;
 }
 
+static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
+				 u8 class, const u8 *map, u8 num_instances)
+{
+	int i, j;
+	u8 current_logical_id = 0;
+
+	for (j = 0; j < num_instances; ++j) {
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			if (!HAS_ENGINE(gt, i) ||
+			    intel_engines[i].class != class)
+				continue;
+
+			if (intel_engines[i].instance == map[j]) {
+				logical_ids[intel_engines[i].instance] =
+					current_logical_id++;
+				break;
+			}
+		}
+	}
+}
+
+static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
+{
+	int i;
+	u8 map[MAX_ENGINE_INSTANCE + 1];
+
+	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
+		map[i] = i;
+	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
+}
+
 /**
  * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
  * @gt: pointer to struct intel_gt
@@ -599,7 +632,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	struct drm_i915_private *i915 = gt->i915;
 	const unsigned int engine_mask = init_engine_mask(gt);
 	unsigned int mask = 0;
-	unsigned int i;
+	unsigned int i, class;
+	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
 	int err;
 
 	drm_WARN_ON(&i915->drm, engine_mask == 0);
@@ -609,15 +643,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	if (i915_inject_probe_failure(i915))
 		return -ENODEV;
 
-	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
-		if (!HAS_ENGINE(gt, i))
-			continue;
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		setup_logical_ids(gt, logical_ids, class);
 
-		err = intel_engine_setup(gt, i);
-		if (err)
-			goto cleanup;
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			u8 instance = intel_engines[i].instance;
+
+			if (intel_engines[i].class != class ||
+			    !HAS_ENGINE(gt, i))
+				continue;
 
-		mask |= BIT(i);
+			err = intel_engine_setup(gt, i,
+						 logical_ids[instance]);
+			if (err)
+				goto cleanup;
+
+			mask |= BIT(i);
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..68010da468a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -269,6 +269,13 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 
 	intel_engine_mask_t mask;
+	/**
+	 * @logical_mask: logical mask of engine, reported to user space via
+	 * query IOCTL and used to communicate with the GuC in logical space.
+	 * The logical instance of a physical engine can change based on product
+	 * / fusing and defined in the bspec.
+	 */
+	intel_engine_mask_t logical_mask;
 
 	u8 class;
 	u8 instance;
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..5ed1e222c308 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3877,6 +3877,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 		ve->siblings[ve->num_siblings++] = sibling;
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		/*
 		 * All physical engines must be compatible for their emission
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 2c6ea64af7ec..621c893a009f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
 	for_each_engine(engine, gt, id) {
 		u8 guc_class = engine_class_to_guc_class(engine->class);
 
-		system_info->mapping_table[guc_class][engine->instance] =
+		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
 			engine->instance;
 	}
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8b82da50c2bc..451d9ae861a6 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1423,23 +1423,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 	return __guc_action_deregister_context(guc, guc_id);
 }
 
-static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
-{
-	switch (class) {
-	case RENDER_CLASS:
-		return mask >> RCS0;
-	case VIDEO_ENHANCEMENT_CLASS:
-		return mask >> VECS0;
-	case VIDEO_DECODE_CLASS:
-		return mask >> VCS0;
-	case COPY_ENGINE_CLASS:
-		return mask >> BCS0;
-	default:
-		MISSING_CASE(class);
-		return 0;
-	}
-}
-
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1481,8 +1464,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	desc = __get_lrc_desc(guc, desc_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-	desc->engine_submit_mask = adjust_engine_mask(engine->class,
-						      engine->mask);
+	desc->engine_submit_mask = engine->logical_mask;
 	desc->hw_context_desc = ce->lrc.lrca;
 	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
@@ -3271,6 +3253,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		}
 
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		if (n != 0 && ve->base.class != sibling->class) {
 			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 06/26] drm/i915: Expose logical engine instance to user
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these needs to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Daniel Vetter)
  - Add IGT link, placeholder for media UMD

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_query.c | 2 ++
 include/uapi/drm/i915_drm.h       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
index 5e2b909827f4..51b368be0fc4 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
 	for_each_uabi_engine(engine, i915) {
 		info.engine.engine_class = engine->uabi_class;
 		info.engine.engine_instance = engine->uabi_instance;
+		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
 		info.capabilities = engine->uabi_capabilities;
+		info.logical_instance = ilog2(engine->logical_mask);
 
 		if (copy_to_user(info_ptr, &info, sizeof(info)))
 			return -EFAULT;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bde5860b3686..b1248a67b4f8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** @logical_instance: Logical instance of engine */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	/** @rsvd2: Reserved fields. */
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 06/26] drm/i915: Expose logical engine instance to user
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these needs to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Daniel Vetter)
  - Add IGT link, placeholder for media UMD

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_query.c | 2 ++
 include/uapi/drm/i915_drm.h       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
index 5e2b909827f4..51b368be0fc4 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
 	for_each_uabi_engine(engine, i915) {
 		info.engine.engine_class = engine->uabi_class;
 		info.engine.engine_instance = engine->uabi_instance;
+		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
 		info.capabilities = engine->uabi_capabilities;
+		info.logical_instance = ilog2(engine->logical_mask);
 
 		if (copy_to_user(info_ptr, &info, sizeof(info)))
 			return -EFAULT;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bde5860b3686..b1248a67b4f8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** @logical_instance: Logical instance of engine */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	/** @rsvd2: Reserved fields. */
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its'
children and itself.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.

Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.

v2:
 (Daniel Vetter)
  - Add kernel doc, add wrapper to access parent to ensure safety
v3:
 (John Harrison)
  - Fix comment explaing GEM_BUG_ON in to_parent()
  - Make variable names generic (non-GuC specific)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index f601323b939f..c5bb7ccfb3f8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->destroyed_link);
 
+	INIT_LIST_HEAD(&ce->parallel.child_list);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
@@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 void intel_context_fini(struct intel_context *ce)
 {
+	struct intel_context *child, *next;
+
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
 
+	/* Need to put the creation ref for the children */
+	if (intel_context_is_parent(ce))
+		for_each_child_safe(ce, child, next)
+			intel_context_put(child);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 	i915_sw_fence_fini(&ce->guc_state.blocked);
@@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child)
+{
+	/*
+	 * Callers responsibility to validate that this function is used
+	 * correctly but we use GEM_BUG_ON here ensure that they do.
+	 */
+	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
+	GEM_BUG_ON(intel_context_is_pinned(parent));
+	GEM_BUG_ON(intel_context_is_child(parent));
+	GEM_BUG_ON(intel_context_is_pinned(child));
+	GEM_BUG_ON(intel_context_is_child(child));
+	GEM_BUG_ON(intel_context_is_parent(child));
+
+	parent->parallel.number_children++;
+	list_add_tail(&child->parallel.child_link,
+		      &parent->parallel.child_list);
+	child->parallel.parent = parent;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c41098950746..b63c10a144af 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
 int intel_context_reconfigure_sseu(struct intel_context *ce,
 				   const struct intel_sseu sseu);
 
+static inline bool intel_context_is_child(struct intel_context *ce)
+{
+	return !!ce->parallel.parent;
+}
+
+static inline bool intel_context_is_parent(struct intel_context *ce)
+{
+	return !!ce->parallel.number_children;
+}
+
+static inline bool intel_context_is_pinned(struct intel_context *ce);
+
+static inline struct intel_context *
+intel_context_to_parent(struct intel_context *ce)
+{
+	if (intel_context_is_child(ce)) {
+		/*
+		 * The parent holds ref count to the child so it is always safe
+		 * for the parent to access the child, but the child has a
+		 * pointer to the parent without a ref. To ensure this is safe
+		 * the child should only access the parent pointer while the
+		 * parent is pinned.
+		 */
+		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
+
+		return ce->parallel.parent;
+	} else {
+		return ce;
+	}
+}
+
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child);
+
+#define for_each_child(parent, ce)\
+	list_for_each_entry(ce, &(parent)->parallel.child_list,\
+			    parallel.child_link)
+#define for_each_child_safe(parent, ce, cn)\
+	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
+				 parallel.child_link)
+
 /**
  * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
  * @ce - the context
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 4613d027cbc3..76dfca57cb45 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -220,6 +220,27 @@ struct intel_context {
 	 */
 	struct list_head destroyed_link;
 
+	/** @parallel: sub-structure for parallel submission members */
+	struct {
+		union {
+			/**
+			 * @child_list: parent's list of children
+			 * contexts, no protection as immutable after context
+			 * creation
+			 */
+			struct list_head child_list;
+			/**
+			 * @child_link: child's link into parent's list of
+			 * children
+			 */
+			struct list_head child_link;
+		};
+		/** @parent: pointer to parent if child */
+		struct intel_context *parent;
+		/** @number_children: number of children if parent */
+		u8 number_children;
+	} parallel;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its'
children and itself.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.

Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.

v2:
 (Daniel Vetter)
  - Add kernel doc, add wrapper to access parent to ensure safety
v3:
 (John Harrison)
  - Fix comment explaing GEM_BUG_ON in to_parent()
  - Make variable names generic (non-GuC specific)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index f601323b939f..c5bb7ccfb3f8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->destroyed_link);
 
+	INIT_LIST_HEAD(&ce->parallel.child_list);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
@@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 void intel_context_fini(struct intel_context *ce)
 {
+	struct intel_context *child, *next;
+
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
 
+	/* Need to put the creation ref for the children */
+	if (intel_context_is_parent(ce))
+		for_each_child_safe(ce, child, next)
+			intel_context_put(child);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 	i915_sw_fence_fini(&ce->guc_state.blocked);
@@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child)
+{
+	/*
+	 * Callers responsibility to validate that this function is used
+	 * correctly but we use GEM_BUG_ON here ensure that they do.
+	 */
+	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
+	GEM_BUG_ON(intel_context_is_pinned(parent));
+	GEM_BUG_ON(intel_context_is_child(parent));
+	GEM_BUG_ON(intel_context_is_pinned(child));
+	GEM_BUG_ON(intel_context_is_child(child));
+	GEM_BUG_ON(intel_context_is_parent(child));
+
+	parent->parallel.number_children++;
+	list_add_tail(&child->parallel.child_link,
+		      &parent->parallel.child_list);
+	child->parallel.parent = parent;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c41098950746..b63c10a144af 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
 int intel_context_reconfigure_sseu(struct intel_context *ce,
 				   const struct intel_sseu sseu);
 
+static inline bool intel_context_is_child(struct intel_context *ce)
+{
+	return !!ce->parallel.parent;
+}
+
+static inline bool intel_context_is_parent(struct intel_context *ce)
+{
+	return !!ce->parallel.number_children;
+}
+
+static inline bool intel_context_is_pinned(struct intel_context *ce);
+
+static inline struct intel_context *
+intel_context_to_parent(struct intel_context *ce)
+{
+	if (intel_context_is_child(ce)) {
+		/*
+		 * The parent holds ref count to the child so it is always safe
+		 * for the parent to access the child, but the child has a
+		 * pointer to the parent without a ref. To ensure this is safe
+		 * the child should only access the parent pointer while the
+		 * parent is pinned.
+		 */
+		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
+
+		return ce->parallel.parent;
+	} else {
+		return ce;
+	}
+}
+
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child);
+
+#define for_each_child(parent, ce)\
+	list_for_each_entry(ce, &(parent)->parallel.child_list,\
+			    parallel.child_link)
+#define for_each_child_safe(parent, ce, cn)\
+	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
+				 parallel.child_link)
+
 /**
  * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
  * @ce - the context
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 4613d027cbc3..76dfca57cb45 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -220,6 +220,27 @@ struct intel_context {
 	 */
 	struct list_head destroyed_link;
 
+	/** @parallel: sub-structure for parallel submission members */
+	struct {
+		union {
+			/**
+			 * @child_list: parent's list of children
+			 * contexts, no protection as immutable after context
+			 * creation
+			 */
+			struct list_head child_list;
+			/**
+			 * @child_link: child's link into parent's list of
+			 * children
+			 */
+			struct list_head child_link;
+		};
+		/** @parent: pointer to parent if child */
+		struct intel_context *parent;
+		/** @number_children: number of children if parent */
+		u8 number_children;
+	} parallel;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

v2:
 (John Harrison)
  - Move GuC specific fields into sub-struct
  - Clean up WQ defines
  - Add comment explaining math to derive WQ / PD address

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 76dfca57cb45..48decb5ee954 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -239,6 +239,18 @@ struct intel_context {
 		struct intel_context *parent;
 		/** @number_children: number of children if parent */
 		u8 number_children;
+		/** @guc: GuC specific members for parallel submission */
+		struct {
+			/** @wqi_head: head pointer in work queue */
+			u16 wqi_head;
+			/** @wqi_tail: tail pointer in work queue */
+			u16 wqi_tail;
+			/**
+			 * @parent_page: page in context state (ce->state) used
+			 * by parent for work queue, process descriptor
+			 */
+			u8 parent_page;
+		} guc;
 	} parallel;
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 3ef9eaf8c50e..57339d5c1fc8 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
 		context_size += PAGE_SIZE;
 	}
 
+	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
+		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
+		context_size += PAGE_SIZE;
+	}
+
 	obj = i915_gem_object_create_lmem(engine->i915, context_size,
 					  I915_BO_ALLOC_PM_VOLATILE);
 	if (IS_ERR(obj))
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ba10bd374cee 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -142,6 +142,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
+	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_LIMIT
 };
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..0eeb2a9feeed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -52,8 +52,6 @@
 
 #define GUC_DOORBELL_INVALID		256
 
-#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
-
 /* Work queue item header definitions */
 #define WQ_STATUS_ACTIVE		1
 #define WQ_STATUS_SUSPENDED		2
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 451d9ae861a6..ab6d7fc1b0b1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
+/*
+ * When using multi-lrc submission an extra page in the context state is
+ * reserved for the process descriptor and work queue.
+ *
+ * The layout of this page is below:
+ * 0						guc_process_desc
+ * ...						unused
+ * PAGE_SIZE / 2				work queue start
+ * ...						work queue
+ * PAGE_SIZE - 1				work queue end
+ */
+#define WQ_SIZE			(PAGE_SIZE / 2)
+#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
+static u32 __get_process_desc_offset(struct intel_context *ce)
+{
+	GEM_BUG_ON(!ce->parallel.guc.parent_page);
+
+	return ce->parallel.guc.parent_page * PAGE_SIZE;
+}
+
+static u32 __get_wq_offset(struct intel_context *ce)
+{
+	return __get_process_desc_offset(ce) + WQ_OFFSET;
+}
+
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	/*
+	 * Need to subtract LRC_STATE_OFFSET here as the
+	 * parallel.guc.parent_page is the offset into ce->state while
+	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
+	 */
+	return (struct guc_process_desc *)
+		(ce->lrc_reg_state +
+		 ((__get_process_desc_offset(ce) -
+		   LRC_STATE_OFFSET) / sizeof(u32)));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
+static int __guc_action_register_multi_lrc(struct intel_guc *guc,
+					   struct intel_context *ce,
+					   u32 guc_id,
+					   u32 offset,
+					   bool loop)
+{
+	struct intel_context *child;
+	u32 action[4 + MAX_ENGINE_INSTANCE];
+	int len = 0;
+
+	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
+
+	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
+	action[len++] = guc_id;
+	action[len++] = ce->parallel.number_children + 1;
+	action[len++] = offset;
+	for_each_child(ce, child) {
+		offset += sizeof(struct guc_lrc_desc);
+		action[len++] = offset;
+	}
+
+	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
+}
+
 static int __guc_action_register_context(struct intel_guc *guc,
 					 u32 guc_id,
 					 u32 offset,
@@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
 		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (intel_context_is_parent(ce))
+		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
+						      offset, loop);
+	else
+		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
+						    loop);
 	if (likely(!ret)) {
 		unsigned long flags;
 
@@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_deregister(ce);
 
 	return __guc_action_deregister_context(guc, guc_id);
@@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct guc_lrc_desc *desc;
 	bool context_registered;
 	intel_wakeref_t wakeref;
+	struct intel_context *child;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
@@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
+	/*
+	 * Context is a parent, we need to register a process descriptor
+	 * describing a work queue and register all child contexts.
+	 */
+	if (intel_context_is_parent(ce)) {
+		struct guc_process_desc *pdesc;
+
+		ce->parallel.guc.wqi_tail = 0;
+		ce->parallel.guc.wqi_head = 0;
+
+		desc->process_desc = i915_ggtt_offset(ce->state) +
+			__get_process_desc_offset(ce);
+		desc->wq_addr = i915_ggtt_offset(ce->state) +
+			__get_wq_offset(ce);
+		desc->wq_size = WQ_SIZE;
+
+		pdesc = __get_process_desc(ce);
+		memset(pdesc, 0, sizeof(*(pdesc)));
+		pdesc->stage_id = ce->guc_id.id;
+		pdesc->wq_base_addr = desc->wq_addr;
+		pdesc->wq_size_bytes = desc->wq_size;
+		pdesc->wq_status = WQ_STATUS_ACTIVE;
+
+		for_each_child(ce, child) {
+			desc = __get_lrc_desc(guc, child->guc_id.id);
+
+			desc->engine_class =
+				engine_class_to_guc_class(engine->class);
+			desc->hw_context_desc = child->lrc.lrca;
+			desc->priority = ce->guc_state.prio;
+			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
+			guc_context_policy_init(engine, desc);
+		}
+	}
+
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
 	 * context is currently registered. There are two cases in which it
@@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 		return NULL;
 	}
 
+	if (unlikely(intel_context_is_child(ce))) {
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"Context is child, desc_idx %u", desc_idx);
+		return NULL;
+	}
+
 	return ce;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

v2:
 (John Harrison)
  - Move GuC specific fields into sub-struct
  - Clean up WQ defines
  - Add comment explaining math to derive WQ / PD address

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 76dfca57cb45..48decb5ee954 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -239,6 +239,18 @@ struct intel_context {
 		struct intel_context *parent;
 		/** @number_children: number of children if parent */
 		u8 number_children;
+		/** @guc: GuC specific members for parallel submission */
+		struct {
+			/** @wqi_head: head pointer in work queue */
+			u16 wqi_head;
+			/** @wqi_tail: tail pointer in work queue */
+			u16 wqi_tail;
+			/**
+			 * @parent_page: page in context state (ce->state) used
+			 * by parent for work queue, process descriptor
+			 */
+			u8 parent_page;
+		} guc;
 	} parallel;
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 3ef9eaf8c50e..57339d5c1fc8 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
 		context_size += PAGE_SIZE;
 	}
 
+	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
+		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
+		context_size += PAGE_SIZE;
+	}
+
 	obj = i915_gem_object_create_lmem(engine->i915, context_size,
 					  I915_BO_ALLOC_PM_VOLATILE);
 	if (IS_ERR(obj))
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ba10bd374cee 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -142,6 +142,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
+	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_LIMIT
 };
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..0eeb2a9feeed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -52,8 +52,6 @@
 
 #define GUC_DOORBELL_INVALID		256
 
-#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
-
 /* Work queue item header definitions */
 #define WQ_STATUS_ACTIVE		1
 #define WQ_STATUS_SUSPENDED		2
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 451d9ae861a6..ab6d7fc1b0b1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
+/*
+ * When using multi-lrc submission an extra page in the context state is
+ * reserved for the process descriptor and work queue.
+ *
+ * The layout of this page is below:
+ * 0						guc_process_desc
+ * ...						unused
+ * PAGE_SIZE / 2				work queue start
+ * ...						work queue
+ * PAGE_SIZE - 1				work queue end
+ */
+#define WQ_SIZE			(PAGE_SIZE / 2)
+#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
+static u32 __get_process_desc_offset(struct intel_context *ce)
+{
+	GEM_BUG_ON(!ce->parallel.guc.parent_page);
+
+	return ce->parallel.guc.parent_page * PAGE_SIZE;
+}
+
+static u32 __get_wq_offset(struct intel_context *ce)
+{
+	return __get_process_desc_offset(ce) + WQ_OFFSET;
+}
+
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	/*
+	 * Need to subtract LRC_STATE_OFFSET here as the
+	 * parallel.guc.parent_page is the offset into ce->state while
+	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
+	 */
+	return (struct guc_process_desc *)
+		(ce->lrc_reg_state +
+		 ((__get_process_desc_offset(ce) -
+		   LRC_STATE_OFFSET) / sizeof(u32)));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
+static int __guc_action_register_multi_lrc(struct intel_guc *guc,
+					   struct intel_context *ce,
+					   u32 guc_id,
+					   u32 offset,
+					   bool loop)
+{
+	struct intel_context *child;
+	u32 action[4 + MAX_ENGINE_INSTANCE];
+	int len = 0;
+
+	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
+
+	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
+	action[len++] = guc_id;
+	action[len++] = ce->parallel.number_children + 1;
+	action[len++] = offset;
+	for_each_child(ce, child) {
+		offset += sizeof(struct guc_lrc_desc);
+		action[len++] = offset;
+	}
+
+	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
+}
+
 static int __guc_action_register_context(struct intel_guc *guc,
 					 u32 guc_id,
 					 u32 offset,
@@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
 		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (intel_context_is_parent(ce))
+		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
+						      offset, loop);
+	else
+		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
+						    loop);
 	if (likely(!ret)) {
 		unsigned long flags;
 
@@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_deregister(ce);
 
 	return __guc_action_deregister_context(guc, guc_id);
@@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct guc_lrc_desc *desc;
 	bool context_registered;
 	intel_wakeref_t wakeref;
+	struct intel_context *child;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
@@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
+	/*
+	 * Context is a parent, we need to register a process descriptor
+	 * describing a work queue and register all child contexts.
+	 */
+	if (intel_context_is_parent(ce)) {
+		struct guc_process_desc *pdesc;
+
+		ce->parallel.guc.wqi_tail = 0;
+		ce->parallel.guc.wqi_head = 0;
+
+		desc->process_desc = i915_ggtt_offset(ce->state) +
+			__get_process_desc_offset(ce);
+		desc->wq_addr = i915_ggtt_offset(ce->state) +
+			__get_wq_offset(ce);
+		desc->wq_size = WQ_SIZE;
+
+		pdesc = __get_process_desc(ce);
+		memset(pdesc, 0, sizeof(*(pdesc)));
+		pdesc->stage_id = ce->guc_id.id;
+		pdesc->wq_base_addr = desc->wq_addr;
+		pdesc->wq_size_bytes = desc->wq_size;
+		pdesc->wq_status = WQ_STATUS_ACTIVE;
+
+		for_each_child(ce, child) {
+			desc = __get_lrc_desc(guc, child->guc_id.id);
+
+			desc->engine_class =
+				engine_class_to_guc_class(engine->class);
+			desc->hw_context_desc = child->lrc.lrca;
+			desc->priority = ce->guc_state.prio;
+			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
+			guc_context_policy_init(engine, desc);
+		}
+	}
+
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
 	 * context is currently registered. There are two cases in which it
@@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 		return NULL;
 	}
 
+	if (unlikely(intel_context_is_child(ce))) {
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"Context is child, desc_idx %u", desc_idx);
+		return NULL;
+	}
+
 	return ce;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 09/26] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ab6d7fc1b0b1..1f2809187513 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -324,6 +324,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
+static struct intel_context *
+request_to_scheduling_context(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id.id == GUC_INVALID_LRC_ID;
@@ -1710,6 +1716,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_sched_disable(ce);
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -1935,6 +1942,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_wakeref_t wakeref;
 	u16 guc_id;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
@@ -2303,6 +2312,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	clr_context_wait_for_deregister_to_register(ce);
 	__guc_signal_context_fence(ce);
@@ -2333,7 +2344,7 @@ static void guc_context_init(struct intel_context *ce)
 
 static int guc_request_alloc(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = ce_to_guc(ce);
 	unsigned long flags;
 	int ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 09/26] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ab6d7fc1b0b1..1f2809187513 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -324,6 +324,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
+static struct intel_context *
+request_to_scheduling_context(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id.id == GUC_INVALID_LRC_ID;
@@ -1710,6 +1716,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_sched_disable(ce);
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -1935,6 +1942,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_wakeref_t wakeref;
 	u16 guc_id;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
@@ -2303,6 +2312,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	clr_context_wait_for_deregister_to_register(ce);
 	__guc_signal_context_fence(ce);
@@ -2333,7 +2344,7 @@ static void guc_context_init(struct intel_context *ce)
 
 static int guc_request_alloc(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = ce_to_guc(ce);
 	unsigned long flags;
 	int ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

v2:
 (Daniel Vetter)
  - Explicitly state why we assign consecutive guc_ids
v3:
 (John Harrison)
  - Bring back in spin lock

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
 2 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 25a598e2b6e8..a9f4ec972bfb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -76,9 +76,13 @@ struct intel_guc {
 		 */
 		spinlock_t lock;
 		/**
-		 * @guc_ids: used to allocate new guc_ids
+		 * @guc_ids: used to allocate new guc_ids, single-lrc
 		 */
 		struct ida guc_ids;
+		/**
+		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
+		 */
+		unsigned long *guc_ids_bitmap;
 		/**
 		 * @guc_id_list: list of intel_context with valid guc_ids but no
 		 * refs
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1f2809187513..79e7732e83b2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+/*
+ * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
+ * per the GuC submission interface. A different allocation algorithm is used
+ * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
+ * partition the guc_id space. We believe the number of multi-lrc contexts in
+ * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
+ * multi-lrc.
+ */
+#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
+
 /*
  * Below is a set of functions which control the GuC scheduling state which
  * require a lock.
@@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
 
+	guc->submission_state.guc_ids_bitmap =
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
+	if (!guc->submission_state.guc_ids_bitmap)
+		return -ENOMEM;
+
 	return 0;
 }
 
@@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	guc_lrc_desc_pool_destroy(guc);
 	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
+	bitmap_free(guc->submission_state.guc_ids_bitmap);
 }
 
 static inline void queue_request(struct i915_sched_engine *sched_engine,
@@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
-static int new_guc_id(struct intel_guc *guc)
+static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	return ida_simple_get(&guc->submission_state.guc_ids, 0,
-			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
-			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	int ret;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (intel_context_is_parent(ce))
+		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
+					      NUMBER_MULTI_LRC_GUC_ID,
+					      order_base_2(ce->parallel.number_children
+							   + 1));
+	else
+		ret = ida_simple_get(&guc->submission_state.guc_ids,
+				     NUMBER_MULTI_LRC_GUC_ID,
+				     GUC_MAX_LRC_DESCRIPTORS,
+				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
+				     __GFP_NOWARN);
+	if (unlikely(ret < 0))
+		return ret;
+
+	ce->guc_id.id = ret;
+	return 0;
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->submission_state.guc_ids,
-				  ce->guc_id.id);
+		if (intel_context_is_parent(ce))
+			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
+					      ce->guc_id.id,
+					      order_base_2(ce->parallel.number_children
+							   + 1));
+		else
+			ida_simple_remove(&guc->submission_state.guc_ids,
+					  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
-static int steal_guc_id(struct intel_guc *guc)
+static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	struct intel_context *ce;
-	int guc_id;
+	struct intel_context *cn;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(intel_context_is_parent(ce));
 
 	if (!list_empty(&guc->submission_state.guc_id_list)) {
-		ce = list_first_entry(&guc->submission_state.guc_id_list,
+		cn = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
-		GEM_BUG_ON(context_guc_id_invalid(ce));
+		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
+		GEM_BUG_ON(context_guc_id_invalid(cn));
+		GEM_BUG_ON(intel_context_is_child(cn));
+		GEM_BUG_ON(intel_context_is_parent(cn));
 
-		list_del_init(&ce->guc_id.link);
-		guc_id = ce->guc_id.id;
+		list_del_init(&cn->guc_id.link);
+		ce->guc_id = cn->guc_id;
 
 		spin_lock(&ce->guc_state.lock);
-		clr_context_registered(ce);
+		clr_context_registered(cn);
 		spin_unlock(&ce->guc_state.lock);
 
-		set_context_guc_id_invalid(ce);
-		return guc_id;
+		set_context_guc_id_invalid(cn);
+
+		return 0;
 	} else {
 		return -EAGAIN;
 	}
 }
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out)
+static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	int ret;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	ret = new_guc_id(guc);
+	ret = new_guc_id(guc, ce);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc);
+		if (intel_context_is_parent(ce))
+			return -ENOSPC;
+
+		ret = steal_guc_id(guc, ce);
 		if (ret < 0)
 			return ret;
 	}
 
-	*out = ret;
+	if (intel_context_is_parent(ce)) {
+		struct intel_context *child;
+		int i = 1;
+
+		for_each_child(ce, child)
+			child->guc_id.id = ce->guc_id.id + i++;
+	}
+
 	return 0;
 }
 
@@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	might_lock(&ce->guc_state.lock);
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id.id);
+		ret = assign_guc_id(guc, ce);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
@@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	unsigned long flags;
 
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	if (unlikely(context_guc_id_invalid(ce)))
+	if (unlikely(context_guc_id_invalid(ce) ||
+		     intel_context_is_parent(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

v2:
 (Daniel Vetter)
  - Explicitly state why we assign consecutive guc_ids
v3:
 (John Harrison)
  - Bring back in spin lock

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
 2 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 25a598e2b6e8..a9f4ec972bfb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -76,9 +76,13 @@ struct intel_guc {
 		 */
 		spinlock_t lock;
 		/**
-		 * @guc_ids: used to allocate new guc_ids
+		 * @guc_ids: used to allocate new guc_ids, single-lrc
 		 */
 		struct ida guc_ids;
+		/**
+		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
+		 */
+		unsigned long *guc_ids_bitmap;
 		/**
 		 * @guc_id_list: list of intel_context with valid guc_ids but no
 		 * refs
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1f2809187513..79e7732e83b2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+/*
+ * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
+ * per the GuC submission interface. A different allocation algorithm is used
+ * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
+ * partition the guc_id space. We believe the number of multi-lrc contexts in
+ * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
+ * multi-lrc.
+ */
+#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
+
 /*
  * Below is a set of functions which control the GuC scheduling state which
  * require a lock.
@@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_WORK(&guc->submission_state.destroyed_worker,
 		  destroyed_worker_func);
 
+	guc->submission_state.guc_ids_bitmap =
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
+	if (!guc->submission_state.guc_ids_bitmap)
+		return -ENOMEM;
+
 	return 0;
 }
 
@@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	guc_lrc_desc_pool_destroy(guc);
 	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
+	bitmap_free(guc->submission_state.guc_ids_bitmap);
 }
 
 static inline void queue_request(struct i915_sched_engine *sched_engine,
@@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
-static int new_guc_id(struct intel_guc *guc)
+static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	return ida_simple_get(&guc->submission_state.guc_ids, 0,
-			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
-			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	int ret;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (intel_context_is_parent(ce))
+		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
+					      NUMBER_MULTI_LRC_GUC_ID,
+					      order_base_2(ce->parallel.number_children
+							   + 1));
+	else
+		ret = ida_simple_get(&guc->submission_state.guc_ids,
+				     NUMBER_MULTI_LRC_GUC_ID,
+				     GUC_MAX_LRC_DESCRIPTORS,
+				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
+				     __GFP_NOWARN);
+	if (unlikely(ret < 0))
+		return ret;
+
+	ce->guc_id.id = ret;
+	return 0;
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->submission_state.guc_ids,
-				  ce->guc_id.id);
+		if (intel_context_is_parent(ce))
+			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
+					      ce->guc_id.id,
+					      order_base_2(ce->parallel.number_children
+							   + 1));
+		else
+			ida_simple_remove(&guc->submission_state.guc_ids,
+					  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
-static int steal_guc_id(struct intel_guc *guc)
+static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	struct intel_context *ce;
-	int guc_id;
+	struct intel_context *cn;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(intel_context_is_parent(ce));
 
 	if (!list_empty(&guc->submission_state.guc_id_list)) {
-		ce = list_first_entry(&guc->submission_state.guc_id_list,
+		cn = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
-		GEM_BUG_ON(context_guc_id_invalid(ce));
+		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
+		GEM_BUG_ON(context_guc_id_invalid(cn));
+		GEM_BUG_ON(intel_context_is_child(cn));
+		GEM_BUG_ON(intel_context_is_parent(cn));
 
-		list_del_init(&ce->guc_id.link);
-		guc_id = ce->guc_id.id;
+		list_del_init(&cn->guc_id.link);
+		ce->guc_id = cn->guc_id;
 
 		spin_lock(&ce->guc_state.lock);
-		clr_context_registered(ce);
+		clr_context_registered(cn);
 		spin_unlock(&ce->guc_state.lock);
 
-		set_context_guc_id_invalid(ce);
-		return guc_id;
+		set_context_guc_id_invalid(cn);
+
+		return 0;
 	} else {
 		return -EAGAIN;
 	}
 }
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out)
+static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	int ret;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	ret = new_guc_id(guc);
+	ret = new_guc_id(guc, ce);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc);
+		if (intel_context_is_parent(ce))
+			return -ENOSPC;
+
+		ret = steal_guc_id(guc, ce);
 		if (ret < 0)
 			return ret;
 	}
 
-	*out = ret;
+	if (intel_context_is_parent(ce)) {
+		struct intel_context *child;
+		int i = 1;
+
+		for_each_child(ce, child)
+			child->guc_id.id = ce->guc_id.id + i++;
+	}
+
 	return 0;
 }
 
@@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	might_lock(&ce->guc_state.lock);
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id.id);
+		ret = assign_guc_id(guc, ce);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
@@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	unsigned long flags;
 
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	if (unlikely(context_guc_id_invalid(ce)))
+	if (unlikely(context_guc_id_invalid(ce) ||
+		     intel_context_is_parent(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 11/26] drm/i915/guc: Implement parallel context pin / unpin functions
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Parallel contexts are perma-pinned by the upper layers which makes the
backend implementation rather simple. The parent pins the guc_id and
children increment the parent's pin count on pin to ensure all the
contexts are unpinned before we disable scheduling with the GuC / or
deregister the context.

v2:
 (Daniel Vetter)
  - Perma-pin parallel contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 70 +++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 79e7732e83b2..031b1bf5ba91 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2583,6 +2583,76 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	struct intel_guc *guc = ce_to_guc(ce);
+	int ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	ret = pin_guc_id(guc, ce);
+	if (unlikely(ret < 0))
+		return ret;
+
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	__intel_context_pin(ce->parallel.parent);
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_post_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_post_unpin(ce);
+	intel_context_unpin(ce->parallel.parent);
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 11/26] drm/i915/guc: Implement parallel context pin / unpin functions
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Parallel contexts are perma-pinned by the upper layers which makes the
backend implementation rather simple. The parent pins the guc_id and
children increment the parent's pin count on pin to ensure all the
contexts are unpinned before we disable scheduling with the GuC / or
deregister the context.

v2:
 (Daniel Vetter)
  - Perma-pin parallel contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 70 +++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 79e7732e83b2..031b1bf5ba91 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2583,6 +2583,76 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	struct intel_guc *guc = ce_to_guc(ce);
+	int ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	ret = pin_guc_id(guc, ce);
+	if (unlikely(ret < 0))
+		return ret;
+
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	__intel_context_pin(ce->parallel.parent);
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_post_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_post_unpin(ce);
+	intel_context_unpin(ce->parallel.parent);
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

v2:
 (John Harrison)
  - s/wqe/wqi
  - Use FIELD_PREP macros
  - Add GEM_BUG_ONs ensures length fits within field
  - Add comment / white space to intel_guc_write_barrier
 (Kernel test robot)
  - Make need_tasklet a static function

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
 drivers/gpu/drm/i915/i915_request.h           |   8 +
 6 files changed, 335 insertions(+), 73 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 8f8182bf7c11..7191e8439290 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
 		}
 	}
 }
+
+void intel_guc_write_barrier(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
+		/*
+		 * Ensure intel_uncore_write_fw can be used rather than
+		 * intel_uncore_write.
+		 */
+		GEM_BUG_ON(guc->send_regs.fw_domains);
+
+		/*
+		 * This register is used by the i915 and GuC for MMIO based
+		 * communication. Once we are in this code CTBs are the only
+		 * method the i915 uses to communicate with the GuC so it is
+		 * safe to write to this register (a value of 0 is NOP for MMIO
+		 * communication). If we ever start mixing CTBs and MMIOs a new
+		 * register will have to be chosen.
+		 */
+		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
+	} else {
+		/* wmb() sufficient for a barrier if in smem */
+		wmb();
+	}
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index a9f4ec972bfb..147f39cc0f2f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -46,6 +46,12 @@ struct intel_guc {
 	 * submitted until the stalled request is processed.
 	 */
 	struct i915_request *stalled_request;
+	enum {
+		STALL_NONE,
+		STALL_REGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
 
 	/* intel_guc_recv interrupt related state */
 	/** @irq_lock: protects GuC irq state */
@@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
 
 void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
 
+void intel_guc_write_barrier(struct intel_guc *guc);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 20c710a74498..10d1878d2826 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
 	return ++ct->requests.last_fence;
 }
 
-static void write_barrier(struct intel_guc_ct *ct)
-{
-	struct intel_guc *guc = ct_to_guc(ct);
-	struct intel_gt *gt = guc_to_gt(guc);
-
-	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
-		GEM_BUG_ON(guc->send_regs.fw_domains);
-		/*
-		 * This register is used by the i915 and GuC for MMIO based
-		 * communication. Once we are in this code CTBs are the only
-		 * method the i915 uses to communicate with the GuC so it is
-		 * safe to write to this register (a value of 0 is NOP for MMIO
-		 * communication). If we ever start mixing CTBs and MMIOs a new
-		 * register will have to be chosen.
-		 */
-		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
-	} else {
-		/* wmb() sufficient for a barrier if in smem */
-		wmb();
-	}
-}
-
 static int ct_write(struct intel_guc_ct *ct,
 		    const u32 *action,
 		    u32 len /* in dwords */,
@@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
 	 * make sure H2G buffer update and LRC tail update (if this triggering a
 	 * submission) are visible before updating the descriptor tail
 	 */
-	write_barrier(ct);
+	intel_guc_write_barrier(ct_to_guc(ct));
 
 	/* update local copies */
 	ctb->tail = tail;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 0eeb2a9feeed..a00eeddc1449 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -58,19 +58,16 @@
 #define WQ_STATUS_CMD_ERROR		3
 #define WQ_STATUS_ENGINE_ID_NOT_USED	4
 #define WQ_STATUS_SUSPENDED_FROM_RESET	5
-#define WQ_TYPE_SHIFT			0
-#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
-#define WQ_TARGET_SHIFT			10
-#define WQ_LEN_SHIFT			16
-#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
-#define WQ_PRESENT_WORKLOAD		(1 << 28)
-
-#define WQ_RING_TAIL_SHIFT		20
-#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
-#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
+#define WQ_TYPE_BATCH_BUF		0x1
+#define WQ_TYPE_PSEUDO			0x2
+#define WQ_TYPE_INORDER			0x3
+#define WQ_TYPE_NOOP			0x4
+#define WQ_TYPE_MULTI_LRC		0x5
+#define WQ_TYPE_MASK			GENMASK(7, 0)
+#define WQ_LEN_MASK			GENMASK(26, 16)
+
+#define WQ_GUC_ID_MASK			GENMASK(15, 0)
+#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
 
 #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
 #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 031b1bf5ba91..1610120e31a1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static u32 *get_wq_pointer(struct guc_process_desc *desc,
+			   struct intel_context *ce,
+			   u32 wqi_size)
+{
+	/*
+	 * Check for space in work queue. Caching a value of head pointer in
+	 * intel_context structure in order reduce the number accesses to shared
+	 * GPU memory which may be across a PCIe bus.
+	 */
+#define AVAILABLE_SPACE	\
+	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
+	if (wqi_size > AVAILABLE_SPACE) {
+		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
+
+		if (wqi_size > AVAILABLE_SPACE)
+			return NULL;
+	}
+#undef AVAILABLE_SPACE
+
+	return ((u32 *)__get_process_desc(ce)) +
+		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u32 action[3];
 	int len = 0;
 	u32 g2h_len_dw = 0;
@@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
-	/*
-	 * Corner case where the GuC firmware was blown away and reloaded while
-	 * this context was pinned.
-	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
-		err = guc_lrc_desc_pin(ce, false);
-		if (unlikely(err))
-			return err;
-	}
-
 	spin_lock(&ce->guc_state.lock);
 
 	/*
 	 * The request / context will be run on the hardware when scheduling
-	 * gets enabled in the unblock.
+	 * gets enabled in the unblock. For multi-lrc we still submit the
+	 * context to move the LRC tails.
 	 */
-	if (unlikely(context_blocked(ce)))
+	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
 		goto out;
 
-	enabled = context_enabled(ce);
+	enabled = context_enabled(ce) || context_blocked(ce);
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
@@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_intel_context_sched_enable(ce);
 		atomic_inc(&guc->outstanding_submission_g2h);
 		set_context_enabled(ce);
+
+		/*
+		 * Without multi-lrc KMD does the submission step (moving the
+		 * lrc tail) so enabling scheduling is sufficient to submit the
+		 * context. This isn't the case in multi-lrc submission as the
+		 * GuC needs to move the tails, hence the need for another H2G
+		 * to submit a multi-lrc context after enabling scheduling.
+		 */
+		if (intel_context_is_parent(ce)) {
+			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
+			err = intel_guc_send_nb(guc, action, len - 1, 0);
+		}
 	} else if (!enabled) {
 		clr_context_pending_enable(ce);
 		intel_context_put(ce);
@@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
+static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+{
+	int ret = __guc_add_request(guc, rq);
+
+	if (unlikely(ret == -EBUSY)) {
+		guc->stalled_request = rq;
+		guc->submission_stall_reason = STALL_ADD_REQUEST;
+	}
+
+	return ret;
+}
+
 static inline void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
@@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static bool is_multi_lrc_rq(struct i915_request *rq)
+{
+	return intel_context_is_child(rq->context) ||
+		intel_context_is_parent(rq->context);
+}
+
+static bool can_merge_rq(struct i915_request *rq,
+			 struct i915_request *last)
+{
+	return request_to_scheduling_context(rq) ==
+		request_to_scheduling_context(last);
+}
+
+static u32 wq_space_until_wrap(struct intel_context *ce)
+{
+	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
+}
+
+static void write_wqi(struct guc_process_desc *desc,
+		      struct intel_context *ce,
+		      u32 wqi_size)
+{
+	/*
+	 * Ensure WQI are visible before updating tail
+	 */
+	intel_guc_write_barrier(ce_to_guc(ce));
+
+	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
+		(WQ_SIZE - 1);
+	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
+}
+
+static int guc_wq_noop_append(struct intel_context *ce)
+{
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
+	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
+
+	if (!wqi)
+		return -EBUSY;
+
+	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
+
+	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
+		FIELD_PREP(WQ_LEN_MASK, len_dw);
+	ce->parallel.guc.wqi_tail = 0;
+
+	return 0;
+}
+
+static int __guc_wq_item_append(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	struct intel_context *child;
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	unsigned int wqi_size = (ce->parallel.number_children + 4) *
+		sizeof(u32);
+	u32 *wqi;
+	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
+	int ret;
+
+	/* Ensure context is in correct state updating work queue */
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
+
+	/* Insert NOOP if this work queue item will wrap the tail pointer. */
+	if (wqi_size > wq_space_until_wrap(ce)) {
+		ret = guc_wq_noop_append(ce);
+		if (ret)
+			return ret;
+	}
+
+	wqi = get_wq_pointer(desc, ce, wqi_size);
+	if (!wqi)
+		return -EBUSY;
+
+	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
+
+	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
+		FIELD_PREP(WQ_LEN_MASK, len_dw);
+	*wqi++ = ce->lrc.lrca;
+	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
+	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
+	*wqi++ = 0;	/* fence_id */
+	for_each_child(ce, child)
+		*wqi++ = child->ring->tail / sizeof(u64);
+
+	write_wqi(desc, ce, wqi_size);
+
+	return 0;
+}
+
+static int guc_wq_item_append(struct intel_guc *guc,
+			      struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	int ret = 0;
+
+	if (likely(!intel_context_is_banned(ce))) {
+		ret = __guc_wq_item_append(rq);
+
+		if (unlikely(ret == -EBUSY)) {
+			guc->stalled_request = rq;
+			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+		}
+	}
+
+	return ret;
+}
+
+static bool multi_lrc_submit(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	intel_ring_set_tail(rq->ring, rq->tail);
+
+	/*
+	 * We expect the front end (execbuf IOCTL) to set this flag on the last
+	 * request generated from a multi-BB submission. This indicates to the
+	 * backend (GuC interface) that we should submit this context thus
+	 * submitting all the requests generated in parallel.
+	 */
+	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
+		intel_context_is_banned(ce);
+}
+
 static int guc_dequeue_one_context(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	if (guc->stalled_request) {
 		submit = true;
 		last = guc->stalled_request;
-		goto resubmit;
+
+		switch (guc->submission_stall_reason) {
+		case STALL_REGISTER_CONTEXT:
+			goto register_context;
+		case STALL_MOVE_LRC_TAIL:
+			goto move_lrc_tail;
+		case STALL_ADD_REQUEST:
+			goto add_request;
+		default:
+			MISSING_CASE(guc->submission_stall_reason);
+		}
 	}
 
 	while ((rb = rb_first_cached(&sched_engine->queue))) {
@@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		struct i915_request *rq, *rn;
 
 		priolist_for_each_request_consume(rq, rn, p) {
-			if (last && rq->context != last->context)
-				goto done;
+			if (last && !can_merge_rq(rq, last))
+				goto register_context;
 
 			list_del_init(&rq->sched.link);
 
@@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 
 			trace_i915_request_in(rq, 0);
 			last = rq;
-			submit = true;
+
+			if (is_multi_lrc_rq(rq)) {
+				/*
+				 * We need to coalesce all multi-lrc requests in
+				 * a relationship into a single H2G. We are
+				 * guaranteed that all of these requests will be
+				 * submitted sequentially.
+				 */
+				if (multi_lrc_submit(rq)) {
+					submit = true;
+					goto register_context;
+				}
+			} else {
+				submit = true;
+			}
 		}
 
 		rb_erase_cached(&p->node, &sched_engine->queue);
 		i915_priolist_free(p);
 	}
-done:
+
+register_context:
 	if (submit) {
-		guc_set_lrc_tail(last);
-resubmit:
+		struct intel_context *ce = request_to_scheduling_context(last);
+
+		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
+			     !intel_context_is_banned(ce))) {
+			ret = guc_lrc_desc_pin(ce, false);
+			if (unlikely(ret == -EPIPE)) {
+				goto deadlk;
+			} else if (ret == -EBUSY) {
+				guc->stalled_request = last;
+				guc->submission_stall_reason =
+					STALL_REGISTER_CONTEXT;
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		}
+
+move_lrc_tail:
+		if (is_multi_lrc_rq(last)) {
+			ret = guc_wq_item_append(guc, last);
+			if (ret == -EBUSY) {
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		} else {
+			guc_set_lrc_tail(last);
+		}
+
+add_request:
 		ret = guc_add_request(guc, last);
-		if (unlikely(ret == -EPIPE))
+		if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (ret != 0) {
+			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
-		else if (ret == -EBUSY) {
-			tasklet_schedule(&sched_engine->tasklet);
-			guc->stalled_request = last;
-			return false;
 		}
 	}
 
 	guc->stalled_request = NULL;
+	guc->submission_stall_reason = STALL_NONE;
 	return submit;
 
 deadlk:
 	sched_engine->tasklet.callback = NULL;
 	tasklet_disable_nosync(&sched_engine->tasklet);
 	return false;
+
+schedule_tasklet:
+	tasklet_schedule(&sched_engine->tasklet);
+	return false;
 }
 
 static void guc_submission_tasklet(struct tasklet_struct *t)
@@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 
 	trace_i915_request_in(rq, 0);
 
-	guc_set_lrc_tail(rq);
-	ret = guc_add_request(guc, rq);
-	if (ret == -EBUSY)
-		guc->stalled_request = rq;
+	if (is_multi_lrc_rq(rq)) {
+		if (multi_lrc_submit(rq)) {
+			ret = guc_wq_item_append(guc, rq);
+			if (!ret)
+				ret = guc_add_request(guc, rq);
+		}
+	} else {
+		guc_set_lrc_tail(rq);
+		ret = guc_add_request(guc, rq);
+	}
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(guc);
@@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 	return ret;
 }
 
+static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
+{
+	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	return submission_disabled(guc) || guc->stalled_request ||
+		!i915_sched_engine_is_empty(sched_engine) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+}
+
 static void guc_submit_request(struct i915_request *rq)
 {
 	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
@@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (submission_disabled(guc) || guc->stalled_request ||
-	    !i915_sched_engine_is_empty(sched_engine))
+	if (need_tasklet(guc, rq))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
 		tasklet_hi_schedule(&sched_engine->tasklet);
@@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 
 static void add_to_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
 	spin_lock(&ce->guc_state.lock);
@@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 
 static void remove_from_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irq(&ce->guc_state.lock);
 
@@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
 static void guc_bump_inflight_request_prio(struct i915_request *rq,
 					   int prio)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
 
 	/* Short circuit function */
@@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 
 	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 7bd9ed20623e..8950785e55d6 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -139,6 +139,14 @@ enum {
 	 * the GPU. Here we track such boost requests on a per-request basis.
 	 */
 	I915_FENCE_FLAG_BOOST,
+
+	/*
+	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) should
+	 * trigger a submission to the GuC rather than just moving the context
+	 * tail.
+	 */
+	I915_FENCE_FLAG_SUBMIT_PARALLEL,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

v2:
 (John Harrison)
  - s/wqe/wqi
  - Use FIELD_PREP macros
  - Add GEM_BUG_ONs ensures length fits within field
  - Add comment / white space to intel_guc_write_barrier
 (Kernel test robot)
  - Make need_tasklet a static function

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
 drivers/gpu/drm/i915/i915_request.h           |   8 +
 6 files changed, 335 insertions(+), 73 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 8f8182bf7c11..7191e8439290 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
 		}
 	}
 }
+
+void intel_guc_write_barrier(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
+		/*
+		 * Ensure intel_uncore_write_fw can be used rather than
+		 * intel_uncore_write.
+		 */
+		GEM_BUG_ON(guc->send_regs.fw_domains);
+
+		/*
+		 * This register is used by the i915 and GuC for MMIO based
+		 * communication. Once we are in this code CTBs are the only
+		 * method the i915 uses to communicate with the GuC so it is
+		 * safe to write to this register (a value of 0 is NOP for MMIO
+		 * communication). If we ever start mixing CTBs and MMIOs a new
+		 * register will have to be chosen.
+		 */
+		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
+	} else {
+		/* wmb() sufficient for a barrier if in smem */
+		wmb();
+	}
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index a9f4ec972bfb..147f39cc0f2f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -46,6 +46,12 @@ struct intel_guc {
 	 * submitted until the stalled request is processed.
 	 */
 	struct i915_request *stalled_request;
+	enum {
+		STALL_NONE,
+		STALL_REGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
 
 	/* intel_guc_recv interrupt related state */
 	/** @irq_lock: protects GuC irq state */
@@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
 
 void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
 
+void intel_guc_write_barrier(struct intel_guc *guc);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 20c710a74498..10d1878d2826 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
 	return ++ct->requests.last_fence;
 }
 
-static void write_barrier(struct intel_guc_ct *ct)
-{
-	struct intel_guc *guc = ct_to_guc(ct);
-	struct intel_gt *gt = guc_to_gt(guc);
-
-	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
-		GEM_BUG_ON(guc->send_regs.fw_domains);
-		/*
-		 * This register is used by the i915 and GuC for MMIO based
-		 * communication. Once we are in this code CTBs are the only
-		 * method the i915 uses to communicate with the GuC so it is
-		 * safe to write to this register (a value of 0 is NOP for MMIO
-		 * communication). If we ever start mixing CTBs and MMIOs a new
-		 * register will have to be chosen.
-		 */
-		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
-	} else {
-		/* wmb() sufficient for a barrier if in smem */
-		wmb();
-	}
-}
-
 static int ct_write(struct intel_guc_ct *ct,
 		    const u32 *action,
 		    u32 len /* in dwords */,
@@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
 	 * make sure H2G buffer update and LRC tail update (if this triggering a
 	 * submission) are visible before updating the descriptor tail
 	 */
-	write_barrier(ct);
+	intel_guc_write_barrier(ct_to_guc(ct));
 
 	/* update local copies */
 	ctb->tail = tail;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 0eeb2a9feeed..a00eeddc1449 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -58,19 +58,16 @@
 #define WQ_STATUS_CMD_ERROR		3
 #define WQ_STATUS_ENGINE_ID_NOT_USED	4
 #define WQ_STATUS_SUSPENDED_FROM_RESET	5
-#define WQ_TYPE_SHIFT			0
-#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
-#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
-#define WQ_TARGET_SHIFT			10
-#define WQ_LEN_SHIFT			16
-#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
-#define WQ_PRESENT_WORKLOAD		(1 << 28)
-
-#define WQ_RING_TAIL_SHIFT		20
-#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
-#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
+#define WQ_TYPE_BATCH_BUF		0x1
+#define WQ_TYPE_PSEUDO			0x2
+#define WQ_TYPE_INORDER			0x3
+#define WQ_TYPE_NOOP			0x4
+#define WQ_TYPE_MULTI_LRC		0x5
+#define WQ_TYPE_MASK			GENMASK(7, 0)
+#define WQ_LEN_MASK			GENMASK(26, 16)
+
+#define WQ_GUC_ID_MASK			GENMASK(15, 0)
+#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
 
 #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
 #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 031b1bf5ba91..1610120e31a1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static u32 *get_wq_pointer(struct guc_process_desc *desc,
+			   struct intel_context *ce,
+			   u32 wqi_size)
+{
+	/*
+	 * Check for space in work queue. Caching a value of head pointer in
+	 * intel_context structure in order reduce the number accesses to shared
+	 * GPU memory which may be across a PCIe bus.
+	 */
+#define AVAILABLE_SPACE	\
+	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
+	if (wqi_size > AVAILABLE_SPACE) {
+		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
+
+		if (wqi_size > AVAILABLE_SPACE)
+			return NULL;
+	}
+#undef AVAILABLE_SPACE
+
+	return ((u32 *)__get_process_desc(ce)) +
+		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u32 action[3];
 	int len = 0;
 	u32 g2h_len_dw = 0;
@@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
-	/*
-	 * Corner case where the GuC firmware was blown away and reloaded while
-	 * this context was pinned.
-	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
-		err = guc_lrc_desc_pin(ce, false);
-		if (unlikely(err))
-			return err;
-	}
-
 	spin_lock(&ce->guc_state.lock);
 
 	/*
 	 * The request / context will be run on the hardware when scheduling
-	 * gets enabled in the unblock.
+	 * gets enabled in the unblock. For multi-lrc we still submit the
+	 * context to move the LRC tails.
 	 */
-	if (unlikely(context_blocked(ce)))
+	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
 		goto out;
 
-	enabled = context_enabled(ce);
+	enabled = context_enabled(ce) || context_blocked(ce);
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
@@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_intel_context_sched_enable(ce);
 		atomic_inc(&guc->outstanding_submission_g2h);
 		set_context_enabled(ce);
+
+		/*
+		 * Without multi-lrc KMD does the submission step (moving the
+		 * lrc tail) so enabling scheduling is sufficient to submit the
+		 * context. This isn't the case in multi-lrc submission as the
+		 * GuC needs to move the tails, hence the need for another H2G
+		 * to submit a multi-lrc context after enabling scheduling.
+		 */
+		if (intel_context_is_parent(ce)) {
+			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
+			err = intel_guc_send_nb(guc, action, len - 1, 0);
+		}
 	} else if (!enabled) {
 		clr_context_pending_enable(ce);
 		intel_context_put(ce);
@@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
+static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+{
+	int ret = __guc_add_request(guc, rq);
+
+	if (unlikely(ret == -EBUSY)) {
+		guc->stalled_request = rq;
+		guc->submission_stall_reason = STALL_ADD_REQUEST;
+	}
+
+	return ret;
+}
+
 static inline void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
@@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static bool is_multi_lrc_rq(struct i915_request *rq)
+{
+	return intel_context_is_child(rq->context) ||
+		intel_context_is_parent(rq->context);
+}
+
+static bool can_merge_rq(struct i915_request *rq,
+			 struct i915_request *last)
+{
+	return request_to_scheduling_context(rq) ==
+		request_to_scheduling_context(last);
+}
+
+static u32 wq_space_until_wrap(struct intel_context *ce)
+{
+	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
+}
+
+static void write_wqi(struct guc_process_desc *desc,
+		      struct intel_context *ce,
+		      u32 wqi_size)
+{
+	/*
+	 * Ensure WQI are visible before updating tail
+	 */
+	intel_guc_write_barrier(ce_to_guc(ce));
+
+	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
+		(WQ_SIZE - 1);
+	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
+}
+
+static int guc_wq_noop_append(struct intel_context *ce)
+{
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
+	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
+
+	if (!wqi)
+		return -EBUSY;
+
+	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
+
+	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
+		FIELD_PREP(WQ_LEN_MASK, len_dw);
+	ce->parallel.guc.wqi_tail = 0;
+
+	return 0;
+}
+
+static int __guc_wq_item_append(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	struct intel_context *child;
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	unsigned int wqi_size = (ce->parallel.number_children + 4) *
+		sizeof(u32);
+	u32 *wqi;
+	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
+	int ret;
+
+	/* Ensure context is in correct state updating work queue */
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
+
+	/* Insert NOOP if this work queue item will wrap the tail pointer. */
+	if (wqi_size > wq_space_until_wrap(ce)) {
+		ret = guc_wq_noop_append(ce);
+		if (ret)
+			return ret;
+	}
+
+	wqi = get_wq_pointer(desc, ce, wqi_size);
+	if (!wqi)
+		return -EBUSY;
+
+	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
+
+	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
+		FIELD_PREP(WQ_LEN_MASK, len_dw);
+	*wqi++ = ce->lrc.lrca;
+	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
+	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
+	*wqi++ = 0;	/* fence_id */
+	for_each_child(ce, child)
+		*wqi++ = child->ring->tail / sizeof(u64);
+
+	write_wqi(desc, ce, wqi_size);
+
+	return 0;
+}
+
+static int guc_wq_item_append(struct intel_guc *guc,
+			      struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	int ret = 0;
+
+	if (likely(!intel_context_is_banned(ce))) {
+		ret = __guc_wq_item_append(rq);
+
+		if (unlikely(ret == -EBUSY)) {
+			guc->stalled_request = rq;
+			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+		}
+	}
+
+	return ret;
+}
+
+static bool multi_lrc_submit(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	intel_ring_set_tail(rq->ring, rq->tail);
+
+	/*
+	 * We expect the front end (execbuf IOCTL) to set this flag on the last
+	 * request generated from a multi-BB submission. This indicates to the
+	 * backend (GuC interface) that we should submit this context thus
+	 * submitting all the requests generated in parallel.
+	 */
+	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
+		intel_context_is_banned(ce);
+}
+
 static int guc_dequeue_one_context(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	if (guc->stalled_request) {
 		submit = true;
 		last = guc->stalled_request;
-		goto resubmit;
+
+		switch (guc->submission_stall_reason) {
+		case STALL_REGISTER_CONTEXT:
+			goto register_context;
+		case STALL_MOVE_LRC_TAIL:
+			goto move_lrc_tail;
+		case STALL_ADD_REQUEST:
+			goto add_request;
+		default:
+			MISSING_CASE(guc->submission_stall_reason);
+		}
 	}
 
 	while ((rb = rb_first_cached(&sched_engine->queue))) {
@@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		struct i915_request *rq, *rn;
 
 		priolist_for_each_request_consume(rq, rn, p) {
-			if (last && rq->context != last->context)
-				goto done;
+			if (last && !can_merge_rq(rq, last))
+				goto register_context;
 
 			list_del_init(&rq->sched.link);
 
@@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 
 			trace_i915_request_in(rq, 0);
 			last = rq;
-			submit = true;
+
+			if (is_multi_lrc_rq(rq)) {
+				/*
+				 * We need to coalesce all multi-lrc requests in
+				 * a relationship into a single H2G. We are
+				 * guaranteed that all of these requests will be
+				 * submitted sequentially.
+				 */
+				if (multi_lrc_submit(rq)) {
+					submit = true;
+					goto register_context;
+				}
+			} else {
+				submit = true;
+			}
 		}
 
 		rb_erase_cached(&p->node, &sched_engine->queue);
 		i915_priolist_free(p);
 	}
-done:
+
+register_context:
 	if (submit) {
-		guc_set_lrc_tail(last);
-resubmit:
+		struct intel_context *ce = request_to_scheduling_context(last);
+
+		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
+			     !intel_context_is_banned(ce))) {
+			ret = guc_lrc_desc_pin(ce, false);
+			if (unlikely(ret == -EPIPE)) {
+				goto deadlk;
+			} else if (ret == -EBUSY) {
+				guc->stalled_request = last;
+				guc->submission_stall_reason =
+					STALL_REGISTER_CONTEXT;
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		}
+
+move_lrc_tail:
+		if (is_multi_lrc_rq(last)) {
+			ret = guc_wq_item_append(guc, last);
+			if (ret == -EBUSY) {
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		} else {
+			guc_set_lrc_tail(last);
+		}
+
+add_request:
 		ret = guc_add_request(guc, last);
-		if (unlikely(ret == -EPIPE))
+		if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (ret != 0) {
+			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
-		else if (ret == -EBUSY) {
-			tasklet_schedule(&sched_engine->tasklet);
-			guc->stalled_request = last;
-			return false;
 		}
 	}
 
 	guc->stalled_request = NULL;
+	guc->submission_stall_reason = STALL_NONE;
 	return submit;
 
 deadlk:
 	sched_engine->tasklet.callback = NULL;
 	tasklet_disable_nosync(&sched_engine->tasklet);
 	return false;
+
+schedule_tasklet:
+	tasklet_schedule(&sched_engine->tasklet);
+	return false;
 }
 
 static void guc_submission_tasklet(struct tasklet_struct *t)
@@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 
 	trace_i915_request_in(rq, 0);
 
-	guc_set_lrc_tail(rq);
-	ret = guc_add_request(guc, rq);
-	if (ret == -EBUSY)
-		guc->stalled_request = rq;
+	if (is_multi_lrc_rq(rq)) {
+		if (multi_lrc_submit(rq)) {
+			ret = guc_wq_item_append(guc, rq);
+			if (!ret)
+				ret = guc_add_request(guc, rq);
+		}
+	} else {
+		guc_set_lrc_tail(rq);
+		ret = guc_add_request(guc, rq);
+	}
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(guc);
@@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 	return ret;
 }
 
+static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
+{
+	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	return submission_disabled(guc) || guc->stalled_request ||
+		!i915_sched_engine_is_empty(sched_engine) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+}
+
 static void guc_submit_request(struct i915_request *rq)
 {
 	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
@@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (submission_disabled(guc) || guc->stalled_request ||
-	    !i915_sched_engine_is_empty(sched_engine))
+	if (need_tasklet(guc, rq))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
 		tasklet_hi_schedule(&sched_engine->tasklet);
@@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 
 static void add_to_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
 	spin_lock(&ce->guc_state.lock);
@@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 
 static void remove_from_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irq(&ce->guc_state.lock);
 
@@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
 static void guc_bump_inflight_request_prio(struct i915_request *rq,
 					   int prio)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
 
 	/* Short circuit function */
@@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 
 	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 7bd9ed20623e..8950785e55d6 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -139,6 +139,14 @@ enum {
 	 * the GPU. Here we track such boost requests on a per-request basis.
 	 */
 	I915_FENCE_FLAG_BOOST,
+
+	/*
+	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) should
+	 * trigger a submission to the GuC rather than just moving the context
+	 * tail.
+	 */
+	I915_FENCE_FLAG_SUBMIT_PARALLEL,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 13/26] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   6 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 +-
 drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
 4 files changed, 108 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index b63c10a144af..1bc705f98e2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -75,6 +75,11 @@ intel_context_to_parent(struct intel_context *ce)
 	}
 }
 
+static inline bool intel_context_is_parallel(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) || intel_context_is_parent(ce);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 48decb5ee954..8309d1141d0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -237,6 +237,12 @@ struct intel_context {
 		};
 		/** @parent: pointer to parent if child */
 		struct intel_context *parent;
+		/**
+		 * @last_rq: last request submitted on a parallel context, used
+		 * to insert submit fences between requests in the parallel
+		 * context
+		 */
+		struct i915_request *last_rq;
 		/** @number_children: number of children if parent */
 		u8 number_children;
 		/** @guc: GuC specific members for parallel submission */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1610120e31a1..6be7adf89e4f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -683,8 +683,7 @@ static inline int rq_prio(const struct i915_request *rq)
 
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
-	return intel_context_is_child(rq->context) ||
-		intel_context_is_parent(rq->context);
+	return intel_context_is_parallel(rq->context);
 }
 
 static bool can_merge_rq(struct i915_request *rq,
@@ -2870,6 +2869,8 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
+	if (ce->parallel.last_rq)
+		i915_request_put(ce->parallel.last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 79da5eca60af..e9bfa32f9270 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1539,36 +1539,62 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static struct i915_request *
-__i915_request_add_to_timeline(struct i915_request *rq)
+__i915_request_ensure_parallel_ordering(struct i915_request *rq,
+					struct intel_timeline *timeline)
 {
-	struct intel_timeline *timeline = i915_request_timeline(rq);
 	struct i915_request *prev;
 
-	/*
-	 * Dependency tracking and request ordering along the timeline
-	 * is special cased so that we can eliminate redundant ordering
-	 * operations while building the request (we know that the timeline
-	 * itself is ordered, and here we guarantee it).
-	 *
-	 * As we know we will need to emit tracking along the timeline,
-	 * we embed the hooks into our request struct -- at the cost of
-	 * having to have specialised no-allocation interfaces (which will
-	 * be beneficial elsewhere).
-	 *
-	 * A second benefit to open-coding i915_request_await_request is
-	 * that we can apply a slight variant of the rules specialised
-	 * for timelines that jump between engines (such as virtual engines).
-	 * If we consider the case of virtual engine, we must emit a dma-fence
-	 * to prevent scheduling of the second request until the first is
-	 * complete (to maximise our greedy late load balancing) and this
-	 * precludes optimising to use semaphores serialisation of a single
-	 * timeline across engines.
-	 */
+	GEM_BUG_ON(!is_parallel_rq(rq));
+
+	prev = request_to_parent(rq)->parallel.last_rq;
+	if (prev) {
+		if (!__i915_request_is_complete(prev)) {
+			i915_sw_fence_await_sw_fence(&rq->submit,
+						     &prev->submit,
+						     &rq->submitq);
+
+			if (rq->engine->sched_engine->schedule)
+				__i915_sched_node_add_dependency(&rq->sched,
+								 &prev->sched,
+								 &rq->dep,
+								 0);
+		}
+		i915_request_put(prev);
+	}
+
+	request_to_parent(rq)->parallel.last_rq = i915_request_get(rq);
+
+	return to_request(__i915_active_fence_set(&timeline->last_request,
+						  &rq->fence));
+}
+
+static struct i915_request *
+__i915_request_ensure_ordering(struct i915_request *rq,
+			       struct intel_timeline *timeline)
+{
+	struct i915_request *prev;
+
+	GEM_BUG_ON(is_parallel_rq(rq));
+
 	prev = to_request(__i915_active_fence_set(&timeline->last_request,
 						  &rq->fence));
+
 	if (prev && !__i915_request_is_complete(prev)) {
 		bool uses_guc = intel_engine_uses_guc(rq->engine);
+		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
+					  rq->engine->mask);
+		bool same_context = prev->context == rq->context;
 
 		/*
 		 * The requests are supposed to be kept in order. However,
@@ -1576,13 +1602,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 		 * is used as a barrier for external modification to this
 		 * context.
 		 */
-		GEM_BUG_ON(prev->context == rq->context &&
+		GEM_BUG_ON(same_context &&
 			   i915_seqno_passed(prev->fence.seqno,
 					     rq->fence.seqno));
 
-		if ((!uses_guc &&
-		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
-		    (uses_guc && prev->context == rq->context))
+		if ((same_context && uses_guc) || (!uses_guc && pow2))
 			i915_sw_fence_await_sw_fence(&rq->submit,
 						     &prev->submit,
 						     &rq->submitq);
@@ -1597,6 +1621,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 							 0);
 	}
 
+	return prev;
+}
+
+static struct i915_request *
+__i915_request_add_to_timeline(struct i915_request *rq)
+{
+	struct intel_timeline *timeline = i915_request_timeline(rq);
+	struct i915_request *prev;
+
+	/*
+	 * Dependency tracking and request ordering along the timeline
+	 * is special cased so that we can eliminate redundant ordering
+	 * operations while building the request (we know that the timeline
+	 * itself is ordered, and here we guarantee it).
+	 *
+	 * As we know we will need to emit tracking along the timeline,
+	 * we embed the hooks into our request struct -- at the cost of
+	 * having to have specialised no-allocation interfaces (which will
+	 * be beneficial elsewhere).
+	 *
+	 * A second benefit to open-coding i915_request_await_request is
+	 * that we can apply a slight variant of the rules specialised
+	 * for timelines that jump between engines (such as virtual engines).
+	 * If we consider the case of virtual engine, we must emit a dma-fence
+	 * to prevent scheduling of the second request until the first is
+	 * complete (to maximise our greedy late load balancing) and this
+	 * precludes optimising to use semaphores serialisation of a single
+	 * timeline across engines.
+	 *
+	 * We do not order parallel submission requests on the timeline as each
+	 * parallel submission context has its own timeline and the ordering
+	 * rules for parallel requests are that they must be submitted in the
+	 * order received from the execbuf IOCTL. So rather than using the
+	 * timeline we store a pointer to last request submitted in the
+	 * relationship in the gem context and insert a submission fence
+	 * between that request and request passed into this function or
+	 * alternatively we use completion fence if gem context has a single
+	 * timeline and this is the first submission of an execbuf IOCTL.
+	 */
+	if (likely(!is_parallel_rq(rq)))
+		prev = __i915_request_ensure_ordering(rq, timeline);
+	else
+		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+
 	/*
 	 * Make sure that no request gazumped us - if it was allocated after
 	 * our i915_request_alloc() and called __i915_request_add() before
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 13/26] drm/i915/guc: Insert submit fences between requests in parent-child relationship
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   6 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 +-
 drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
 4 files changed, 108 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index b63c10a144af..1bc705f98e2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -75,6 +75,11 @@ intel_context_to_parent(struct intel_context *ce)
 	}
 }
 
+static inline bool intel_context_is_parallel(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) || intel_context_is_parent(ce);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 48decb5ee954..8309d1141d0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -237,6 +237,12 @@ struct intel_context {
 		};
 		/** @parent: pointer to parent if child */
 		struct intel_context *parent;
+		/**
+		 * @last_rq: last request submitted on a parallel context, used
+		 * to insert submit fences between requests in the parallel
+		 * context
+		 */
+		struct i915_request *last_rq;
 		/** @number_children: number of children if parent */
 		u8 number_children;
 		/** @guc: GuC specific members for parallel submission */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1610120e31a1..6be7adf89e4f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -683,8 +683,7 @@ static inline int rq_prio(const struct i915_request *rq)
 
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
-	return intel_context_is_child(rq->context) ||
-		intel_context_is_parent(rq->context);
+	return intel_context_is_parallel(rq->context);
 }
 
 static bool can_merge_rq(struct i915_request *rq,
@@ -2870,6 +2869,8 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
+	if (ce->parallel.last_rq)
+		i915_request_put(ce->parallel.last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 79da5eca60af..e9bfa32f9270 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1539,36 +1539,62 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static struct i915_request *
-__i915_request_add_to_timeline(struct i915_request *rq)
+__i915_request_ensure_parallel_ordering(struct i915_request *rq,
+					struct intel_timeline *timeline)
 {
-	struct intel_timeline *timeline = i915_request_timeline(rq);
 	struct i915_request *prev;
 
-	/*
-	 * Dependency tracking and request ordering along the timeline
-	 * is special cased so that we can eliminate redundant ordering
-	 * operations while building the request (we know that the timeline
-	 * itself is ordered, and here we guarantee it).
-	 *
-	 * As we know we will need to emit tracking along the timeline,
-	 * we embed the hooks into our request struct -- at the cost of
-	 * having to have specialised no-allocation interfaces (which will
-	 * be beneficial elsewhere).
-	 *
-	 * A second benefit to open-coding i915_request_await_request is
-	 * that we can apply a slight variant of the rules specialised
-	 * for timelines that jump between engines (such as virtual engines).
-	 * If we consider the case of virtual engine, we must emit a dma-fence
-	 * to prevent scheduling of the second request until the first is
-	 * complete (to maximise our greedy late load balancing) and this
-	 * precludes optimising to use semaphores serialisation of a single
-	 * timeline across engines.
-	 */
+	GEM_BUG_ON(!is_parallel_rq(rq));
+
+	prev = request_to_parent(rq)->parallel.last_rq;
+	if (prev) {
+		if (!__i915_request_is_complete(prev)) {
+			i915_sw_fence_await_sw_fence(&rq->submit,
+						     &prev->submit,
+						     &rq->submitq);
+
+			if (rq->engine->sched_engine->schedule)
+				__i915_sched_node_add_dependency(&rq->sched,
+								 &prev->sched,
+								 &rq->dep,
+								 0);
+		}
+		i915_request_put(prev);
+	}
+
+	request_to_parent(rq)->parallel.last_rq = i915_request_get(rq);
+
+	return to_request(__i915_active_fence_set(&timeline->last_request,
+						  &rq->fence));
+}
+
+static struct i915_request *
+__i915_request_ensure_ordering(struct i915_request *rq,
+			       struct intel_timeline *timeline)
+{
+	struct i915_request *prev;
+
+	GEM_BUG_ON(is_parallel_rq(rq));
+
 	prev = to_request(__i915_active_fence_set(&timeline->last_request,
 						  &rq->fence));
+
 	if (prev && !__i915_request_is_complete(prev)) {
 		bool uses_guc = intel_engine_uses_guc(rq->engine);
+		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
+					  rq->engine->mask);
+		bool same_context = prev->context == rq->context;
 
 		/*
 		 * The requests are supposed to be kept in order. However,
@@ -1576,13 +1602,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 		 * is used as a barrier for external modification to this
 		 * context.
 		 */
-		GEM_BUG_ON(prev->context == rq->context &&
+		GEM_BUG_ON(same_context &&
 			   i915_seqno_passed(prev->fence.seqno,
 					     rq->fence.seqno));
 
-		if ((!uses_guc &&
-		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
-		    (uses_guc && prev->context == rq->context))
+		if ((same_context && uses_guc) || (!uses_guc && pow2))
 			i915_sw_fence_await_sw_fence(&rq->submit,
 						     &prev->submit,
 						     &rq->submitq);
@@ -1597,6 +1621,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 							 0);
 	}
 
+	return prev;
+}
+
+static struct i915_request *
+__i915_request_add_to_timeline(struct i915_request *rq)
+{
+	struct intel_timeline *timeline = i915_request_timeline(rq);
+	struct i915_request *prev;
+
+	/*
+	 * Dependency tracking and request ordering along the timeline
+	 * is special cased so that we can eliminate redundant ordering
+	 * operations while building the request (we know that the timeline
+	 * itself is ordered, and here we guarantee it).
+	 *
+	 * As we know we will need to emit tracking along the timeline,
+	 * we embed the hooks into our request struct -- at the cost of
+	 * having to have specialised no-allocation interfaces (which will
+	 * be beneficial elsewhere).
+	 *
+	 * A second benefit to open-coding i915_request_await_request is
+	 * that we can apply a slight variant of the rules specialised
+	 * for timelines that jump between engines (such as virtual engines).
+	 * If we consider the case of virtual engine, we must emit a dma-fence
+	 * to prevent scheduling of the second request until the first is
+	 * complete (to maximise our greedy late load balancing) and this
+	 * precludes optimising to use semaphores serialisation of a single
+	 * timeline across engines.
+	 *
+	 * We do not order parallel submission requests on the timeline as each
+	 * parallel submission context has its own timeline and the ordering
+	 * rules for parallel requests are that they must be submitted in the
+	 * order received from the execbuf IOCTL. So rather than using the
+	 * timeline we store a pointer to last request submitted in the
+	 * relationship in the gem context and insert a submission fence
+	 * between that request and request passed into this function or
+	 * alternatively we use completion fence if gem context has a single
+	 * timeline and this is the first submission of an execbuf IOCTL.
+	 */
+	if (likely(!is_parallel_rq(rq)))
+		prev = __i915_request_ensure_ordering(rq, timeline);
+	else
+		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+
 	/*
 	 * Make sure that no request gazumped us - if it was allocated after
 	 * our i915_request_alloc() and called __i915_request_add() before
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its' children. The parent context owns the reset replaying / canceling
requests as needed.

v2:
 (John Harrison)
  - Simply loop in find active request
  - Add comments to find ative request / reset loop

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
 2 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index c5bb7ccfb3f8..3b340eb59ada 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 
 struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 {
+	struct intel_context *parent = intel_context_to_parent(ce);
 	struct i915_request *rq, *active = NULL;
 	unsigned long flags;
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
+	/*
+	 * We search the parent list to find an active request on the submitted
+	 * context. The parent list contains the requests for all the contexts
+	 * in the relationship so we have to do a compare of each request's
+	 * context must be done.
+	 */
+	spin_lock_irqsave(&parent->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
 				    sched.link) {
+		if (rq->context != ce)
+			continue;
 		if (i915_request_completed(rq))
 			break;
 
 		active = rq;
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 6be7adf89e4f..d661a69ef4f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc(struct intel_context *ce)
+{
+	return intel_context_is_parallel(ce);
+}
+
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
 	return intel_context_is_parallel(rq->context);
@@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
+	bool local_stalled;
 	struct i915_request *rq;
 	unsigned long flags;
 	u32 head;
+	int i, number_children = ce->parallel.number_children;
 	bool skip = false;
+	struct intel_context *parent = ce;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	intel_context_get(ce);
 
@@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 	if (unlikely(skip))
 		goto out_put;
 
-	rq = intel_context_find_active_request(ce);
-	if (!rq) {
-		head = ce->ring->tail;
-		stalled = false;
-		goto out_replay;
-	}
+	/*
+	 * For each context in the relationship find the hanging request
+	 * resetting each context / request as needed
+	 */
+	for (i = 0; i < number_children + 1; ++i) {
+		if (!intel_context_is_pinned(ce))
+			goto next_context;
+
+		local_stalled = false;
+		rq = intel_context_find_active_request(ce);
+		if (!rq) {
+			head = ce->ring->tail;
+			goto out_replay;
+		}
 
-	if (!i915_request_started(rq))
-		stalled = false;
+		GEM_BUG_ON(i915_active_is_idle(&ce->active));
+		head = intel_ring_wrap(ce->ring, rq->head);
 
-	GEM_BUG_ON(i915_active_is_idle(&ce->active));
-	head = intel_ring_wrap(ce->ring, rq->head);
-	__i915_request_reset(rq, stalled);
+		if (i915_request_started(rq))
+			local_stalled = true;
 
+		__i915_request_reset(rq, local_stalled && stalled);
 out_replay:
-	guc_reset_state(ce, head, stalled);
-	__unwind_incomplete_requests(ce);
+		guc_reset_state(ce, head, local_stalled && stalled);
+next_context:
+		if (i != number_children)
+			ce = list_next_entry(ce, parallel.child_link);
+	}
+
+	__unwind_incomplete_requests(parent);
 out_put:
-	intel_context_put(ce);
+	intel_context_put(parent);
 }
 
 void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
@@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			__guc_reset_context(ce, stalled);
 
 		intel_context_put(ce);
@@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			guc_cancel_context_requests(ce);
 
 		intel_context_put(ce);
@@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	incr_context_blocked(ce);
@@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	bool enable;
 
 	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
@@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
 static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
+	struct intel_context *block_context =
+		request_to_scheduling_context(rq);
+
 	if (i915_sw_fence_signaled(&rq->submit)) {
 		struct i915_sw_fence *fence;
 
 		intel_context_get(ce);
-		fence = guc_context_block(ce);
+		fence = guc_context_block(block_context);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		 */
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
-		guc_context_unblock(ce);
+		guc_context_unblock(block_context);
 		intel_context_put(ce);
 	}
 }
@@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	guc_flush_submissions(guc);
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its' children. The parent context owns the reset replaying / canceling
requests as needed.

v2:
 (John Harrison)
  - Simply loop in find active request
  - Add comments to find ative request / reset loop

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
 2 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index c5bb7ccfb3f8..3b340eb59ada 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 
 struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 {
+	struct intel_context *parent = intel_context_to_parent(ce);
 	struct i915_request *rq, *active = NULL;
 	unsigned long flags;
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
+	/*
+	 * We search the parent list to find an active request on the submitted
+	 * context. The parent list contains the requests for all the contexts
+	 * in the relationship so we have to do a compare of each request's
+	 * context must be done.
+	 */
+	spin_lock_irqsave(&parent->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
 				    sched.link) {
+		if (rq->context != ce)
+			continue;
 		if (i915_request_completed(rq))
 			break;
 
 		active = rq;
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 6be7adf89e4f..d661a69ef4f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc(struct intel_context *ce)
+{
+	return intel_context_is_parallel(ce);
+}
+
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
 	return intel_context_is_parallel(rq->context);
@@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
+	bool local_stalled;
 	struct i915_request *rq;
 	unsigned long flags;
 	u32 head;
+	int i, number_children = ce->parallel.number_children;
 	bool skip = false;
+	struct intel_context *parent = ce;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	intel_context_get(ce);
 
@@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 	if (unlikely(skip))
 		goto out_put;
 
-	rq = intel_context_find_active_request(ce);
-	if (!rq) {
-		head = ce->ring->tail;
-		stalled = false;
-		goto out_replay;
-	}
+	/*
+	 * For each context in the relationship find the hanging request
+	 * resetting each context / request as needed
+	 */
+	for (i = 0; i < number_children + 1; ++i) {
+		if (!intel_context_is_pinned(ce))
+			goto next_context;
+
+		local_stalled = false;
+		rq = intel_context_find_active_request(ce);
+		if (!rq) {
+			head = ce->ring->tail;
+			goto out_replay;
+		}
 
-	if (!i915_request_started(rq))
-		stalled = false;
+		GEM_BUG_ON(i915_active_is_idle(&ce->active));
+		head = intel_ring_wrap(ce->ring, rq->head);
 
-	GEM_BUG_ON(i915_active_is_idle(&ce->active));
-	head = intel_ring_wrap(ce->ring, rq->head);
-	__i915_request_reset(rq, stalled);
+		if (i915_request_started(rq))
+			local_stalled = true;
 
+		__i915_request_reset(rq, local_stalled && stalled);
 out_replay:
-	guc_reset_state(ce, head, stalled);
-	__unwind_incomplete_requests(ce);
+		guc_reset_state(ce, head, local_stalled && stalled);
+next_context:
+		if (i != number_children)
+			ce = list_next_entry(ce, parallel.child_link);
+	}
+
+	__unwind_incomplete_requests(parent);
 out_put:
-	intel_context_put(ce);
+	intel_context_put(parent);
 }
 
 void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
@@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			__guc_reset_context(ce, stalled);
 
 		intel_context_put(ce);
@@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			guc_cancel_context_requests(ce);
 
 		intel_context_put(ce);
@@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	incr_context_blocked(ce);
@@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	bool enable;
 
 	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
@@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
 static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
+	struct intel_context *block_context =
+		request_to_scheduling_context(rq);
+
 	if (i915_sw_fence_signaled(&rq->submit)) {
 		struct i915_sw_fence *fence;
 
 		intel_context_get(ce);
-		fence = guc_context_block(ce);
+		fence = guc_context_block(block_context);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		 */
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
-		guc_context_unblock(ce);
+		guc_context_unblock(block_context);
 		intel_context_put(ce);
 	}
 }
@@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	guc_flush_submissions(guc);
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 15/26] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

v2:
 (John Harrison)
  - Output number children in debugfs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 ++++++++++++++-----
 1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d661a69ef4f7..f69e984683aa 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3704,6 +3704,26 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 	drm_printf(p, "\n");
 }
 
+
+static inline void guc_log_context(struct drm_printer *p,
+				   struct intel_context *ce)
+{
+	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
+	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
+	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
+		   ce->ring->head,
+		   ce->lrc_reg_state[CTX_RING_HEAD]);
+	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
+		   ce->ring->tail,
+		   ce->lrc_reg_state[CTX_RING_TAIL]);
+	drm_printf(p, "\t\tContext Pin Count: %u\n",
+		   atomic_read(&ce->pin_count));
+	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
+		   atomic_read(&ce->guc_id.ref));
+	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+		   ce->guc_state.sched_state);
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
@@ -3713,22 +3733,27 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
-		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
-			   ce->ring->head,
-			   ce->lrc_reg_state[CTX_RING_HEAD]);
-		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
-			   ce->ring->tail,
-			   ce->lrc_reg_state[CTX_RING_TAIL]);
-		drm_printf(p, "\t\tContext Pin Count: %u\n",
-			   atomic_read(&ce->pin_count));
-		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id.ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
-			   ce->guc_state.sched_state);
+		GEM_BUG_ON(intel_context_is_child(ce));
 
+		guc_log_context(p, ce);
 		guc_log_context_priority(p, ce);
+
+		if (intel_context_is_parent(ce)) {
+			struct guc_process_desc *desc = __get_process_desc(ce);
+			struct intel_context *child;
+
+			drm_printf(p, "\t\tNumber children: %u\n",
+				   ce->parallel.number_children);
+			drm_printf(p, "\t\tWQI Head: %u\n",
+				   READ_ONCE(desc->head));
+			drm_printf(p, "\t\tWQI Tail: %u\n",
+				   READ_ONCE(desc->tail));
+			drm_printf(p, "\t\tWQI Status: %u\n\n",
+				   READ_ONCE(desc->wq_status));
+
+			for_each_child(ce, child)
+				guc_log_context(p, child);
+		}
 	}
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 15/26] drm/i915/guc: Update debugfs for GuC multi-lrc
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

v2:
 (John Harrison)
  - Output number children in debugfs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 ++++++++++++++-----
 1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d661a69ef4f7..f69e984683aa 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3704,6 +3704,26 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 	drm_printf(p, "\n");
 }
 
+
+static inline void guc_log_context(struct drm_printer *p,
+				   struct intel_context *ce)
+{
+	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
+	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
+	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
+		   ce->ring->head,
+		   ce->lrc_reg_state[CTX_RING_HEAD]);
+	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
+		   ce->ring->tail,
+		   ce->lrc_reg_state[CTX_RING_TAIL]);
+	drm_printf(p, "\t\tContext Pin Count: %u\n",
+		   atomic_read(&ce->pin_count));
+	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
+		   atomic_read(&ce->guc_id.ref));
+	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+		   ce->guc_state.sched_state);
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
@@ -3713,22 +3733,27 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
-		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
-			   ce->ring->head,
-			   ce->lrc_reg_state[CTX_RING_HEAD]);
-		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
-			   ce->ring->tail,
-			   ce->lrc_reg_state[CTX_RING_TAIL]);
-		drm_printf(p, "\t\tContext Pin Count: %u\n",
-			   atomic_read(&ce->pin_count));
-		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id.ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
-			   ce->guc_state.sched_state);
+		GEM_BUG_ON(intel_context_is_child(ce));
 
+		guc_log_context(p, ce);
 		guc_log_context_priority(p, ce);
+
+		if (intel_context_is_parent(ce)) {
+			struct guc_process_desc *desc = __get_process_desc(ce);
+			struct intel_context *child;
+
+			drm_printf(p, "\t\tNumber children: %u\n",
+				   ce->parallel.number_children);
+			drm_printf(p, "\t\tWQI Head: %u\n",
+				   READ_ONCE(desc->head));
+			drm_printf(p, "\t\tWQI Tail: %u\n",
+				   READ_ONCE(desc->tail));
+			drm_printf(p, "\t\tWQI Status: %u\n\n",
+				   READ_ONCE(desc->wq_status));
+
+			for_each_child(ce, child)
+				guc_log_context(p, child);
+		}
 	}
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 16/26] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Set number of engines before attempting to create contexts so the
function free_engines can clean up properly. Also check return of
alloc_engines for NULL.

v2:
 (Tvrtko)
  - Send as stand alone patch
 (John Harrison)
  - Check for alloc_engines returning NULL
v3:
 (Checkpatch / Tvrtko)
  - Remove braces around single line if statement

Cc: Jason Ekstrand <jason@jlekstrand.net>
Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8208fd5b72c3..8c7ea6e56262 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -898,6 +898,10 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	unsigned int n;
 
 	e = alloc_engines(num_engines);
+	if (!e)
+		return ERR_PTR(-ENOMEM);
+	e->num_engines = num_engines;
+
 	for (n = 0; n < num_engines; n++) {
 		struct intel_context *ce;
 		int ret;
@@ -931,7 +935,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			goto free_engines;
 		}
 	}
-	e->num_engines = num_engines;
 
 	return e;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 16/26] drm/i915: Fix bug in user proto-context creation that leaked contexts
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Set number of engines before attempting to create contexts so the
function free_engines can clean up properly. Also check return of
alloc_engines for NULL.

v2:
 (Tvrtko)
  - Send as stand alone patch
 (John Harrison)
  - Check for alloc_engines returning NULL
v3:
 (Checkpatch / Tvrtko)
  - Remove braces around single line if statement

Cc: Jason Ekstrand <jason@jlekstrand.net>
Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8208fd5b72c3..8c7ea6e56262 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -898,6 +898,10 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	unsigned int n;
 
 	e = alloc_engines(num_engines);
+	if (!e)
+		return ERR_PTR(-ENOMEM);
+	e->num_engines = num_engines;
+
 	for (n = 0; n < num_engines; n++) {
 		struct intel_context *ce;
 		int ret;
@@ -931,7 +935,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			goto free_engines;
 		}
 	}
-	e->num_engines = num_engines;
 
 	return e;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Daniel Vetter)
  - Add IGT link and placeholder for media UMD link
v3:
 (Kernel test robot)
  - Fix warning in unpin engines call
 (John Harrison)
  - Reword a bunch of the kernel doc

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
 include/uapi/drm/i915_drm.h                   | 131 +++++++++++
 9 files changed, 489 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8c7ea6e56262..6290bc20ccb1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
 	return 0;
 }
 
+static int
+set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
+				      void *data)
+{
+	struct i915_context_engines_parallel_submit __user *ext =
+		container_of_user(base, typeof(*ext), base);
+	const struct set_proto_ctx_engines *set = data;
+	struct drm_i915_private *i915 = set->i915;
+	u64 flags;
+	int err = 0, n, i, j;
+	u16 slot, width, num_siblings;
+	struct intel_engine_cs **siblings = NULL;
+	intel_engine_mask_t prev_mask;
+
+	/* Disabling for now */
+	return -ENODEV;
+
+	/* FIXME: This is NIY for execlists */
+	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
+		return -ENODEV;
+
+	if (get_user(slot, &ext->engine_index))
+		return -EFAULT;
+
+	if (get_user(width, &ext->width))
+		return -EFAULT;
+
+	if (get_user(num_siblings, &ext->num_siblings))
+		return -EFAULT;
+
+	if (slot >= set->num_engines) {
+		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
+			slot, set->num_engines);
+		return -EINVAL;
+	}
+
+	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
+		drm_dbg(&i915->drm,
+			"Invalid placement[%d], already occupied\n", slot);
+		return -EINVAL;
+	}
+
+	if (get_user(flags, &ext->flags))
+		return -EFAULT;
+
+	if (flags) {
+		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
+		return -EINVAL;
+	}
+
+	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
+		err = check_user_mbz(&ext->mbz64[n]);
+		if (err)
+			return err;
+	}
+
+	if (width < 2) {
+		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
+		return -EINVAL;
+	}
+
+	if (num_siblings < 1) {
+		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
+	siblings = kmalloc_array(num_siblings * width,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return -ENOMEM;
+
+	/* Create contexts / engines */
+	for (i = 0; i < width; ++i) {
+		intel_engine_mask_t current_mask = 0;
+		struct i915_engine_class_instance prev_engine;
+
+		for (j = 0; j < num_siblings; ++j) {
+			struct i915_engine_class_instance ci;
+
+			n = i * num_siblings + j;
+			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
+				err = -EFAULT;
+				goto out_err;
+			}
+
+			siblings[n] =
+				intel_engine_lookup_user(i915, ci.engine_class,
+							 ci.engine_instance);
+			if (!siblings[n]) {
+				drm_dbg(&i915->drm,
+					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
+					n, ci.engine_class, ci.engine_instance);
+				err = -EINVAL;
+				goto out_err;
+			}
+
+			if (n) {
+				if (prev_engine.engine_class !=
+				    ci.engine_class) {
+					drm_dbg(&i915->drm,
+						"Mismatched class %d, %d\n",
+						prev_engine.engine_class,
+						ci.engine_class);
+					err = -EINVAL;
+					goto out_err;
+				}
+			}
+
+			prev_engine = ci;
+			current_mask |= siblings[n]->logical_mask;
+		}
+
+		if (i > 0) {
+			if (current_mask != prev_mask << 1) {
+				drm_dbg(&i915->drm,
+					"Non contiguous logical mask 0x%x, 0x%x\n",
+					prev_mask, current_mask);
+				err = -EINVAL;
+				goto out_err;
+			}
+		}
+		prev_mask = current_mask;
+	}
+
+	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
+	set->engines[slot].num_siblings = num_siblings;
+	set->engines[slot].width = width;
+	set->engines[slot].siblings = siblings;
+
+	return 0;
+
+out_err:
+	kfree(siblings);
+
+	return err;
+}
+
 static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
 	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
 	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
+	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
+		set_proto_ctx_engines_parallel_submit,
 };
 
 static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
@@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
 	return ret;
 }
 
+static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
+{
+	while (count--) {
+		struct intel_context *ce = e->engines[count], *child;
+
+		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
+			continue;
+
+		for_each_child(ce, child)
+			intel_context_unpin(child);
+		intel_context_unpin(ce);
+	}
+}
+
+static void unpin_engines(struct i915_gem_engines *e)
+{
+	__unpin_engines(e, e->num_engines);
+}
+
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
 {
 	while (count--) {
@@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+static int perma_pin_contexts(struct intel_context *ce)
+{
+	struct intel_context *child;
+	int i = 0, j = 0, ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	ret = intel_context_pin(ce);
+	if (unlikely(ret))
+		return ret;
+
+	for_each_child(ce, child) {
+		ret = intel_context_pin(child);
+		if (unlikely(ret))
+			goto unwind;
+		++i;
+	}
+
+	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
+
+	return 0;
+
+unwind:
+	intel_context_unpin(ce);
+	for_each_child(ce, child) {
+		if (j++ < i)
+			intel_context_unpin(child);
+		else
+			break;
+	}
+
+	return ret;
+}
+
 static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 					     unsigned int num_engines,
 					     struct i915_gem_proto_engine *pe)
@@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	e->num_engines = num_engines;
 
 	for (n = 0; n < num_engines; n++) {
-		struct intel_context *ce;
+		struct intel_context *ce, *child;
 		int ret;
 
 		switch (pe[n].type) {
@@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 		case I915_GEM_ENGINE_TYPE_BALANCED:
 			ce = intel_engine_create_virtual(pe[n].siblings,
-							 pe[n].num_siblings);
+							 pe[n].num_siblings, 0);
+			break;
+
+		case I915_GEM_ENGINE_TYPE_PARALLEL:
+			ce = intel_engine_create_parallel(pe[n].siblings,
+							  pe[n].num_siblings,
+							  pe[n].width);
 			break;
 
 		case I915_GEM_ENGINE_TYPE_INVALID:
@@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			err = ERR_PTR(ret);
 			goto free_engines;
 		}
+		for_each_child(ce, child) {
+			ret = intel_context_set_gem(child, ctx, pe->sseu);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
+
+		/* XXX: Must be done after setting gem context */
+		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
+			ret = perma_pin_contexts(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
 	}
 
 	return e;
@@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
 
 	/* Flush any concurrent set_engines() */
 	mutex_lock(&ctx->engines_mutex);
+	unpin_engines(__context_engines_static(ctx));
 	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
 	i915_gem_context_set_closed(ctx);
 	mutex_unlock(&ctx->engines_mutex);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index c4617e4d9fa9..eb5f9b4f2d19 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -78,6 +78,9 @@ enum i915_gem_engine_type {
 
 	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
 	I915_GEM_ENGINE_TYPE_BALANCED,
+
+	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
+	I915_GEM_ENGINE_TYPE_PARALLEL,
 };
 
 /**
@@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
 	/** @num_siblings: Number of balanced siblings */
 	unsigned int num_siblings;
 
+	/** @width: Width of each sibling */
+	unsigned int width;
+
 	/** @siblings: Balanced siblings */
 	struct intel_engine_cs **siblings;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 8309d1141d0a..1d880303a7e4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -55,9 +55,13 @@ struct intel_context_ops {
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
-	/* virtual engine/context interface */
+	/* virtual/parallel engine/context interface */
 	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
-						unsigned int count);
+						unsigned int count,
+						unsigned long flags);
+	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
+						 unsigned int num_siblings,
+						 unsigned int width);
 	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
 					       unsigned int sibling);
 };
@@ -113,6 +117,7 @@ struct intel_context {
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
 #define CONTEXT_GUC_INIT		10
+#define CONTEXT_PERMA_PIN		11
 
 	struct {
 		u64 timeout_us;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 87579affb952..43f16a8347ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+#define FORCE_VIRTUAL	BIT(0)
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count);
+			    unsigned int count, unsigned long flags);
+
+static inline struct intel_context *
+intel_engine_create_parallel(struct intel_engine_cs **engines,
+			     unsigned int num_engines,
+			     unsigned int width)
+{
+	GEM_BUG_ON(!engines[0]->cops->create_parallel);
+	return engines[0]->cops->create_parallel(engines, num_engines, width);
+}
 
 static inline bool
 intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2eb798ad068b..ff6753ccb129 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count)
+			    unsigned int count, unsigned long flags)
 {
 	if (count == 0)
 		return ERR_PTR(-EINVAL);
 
-	if (count == 1)
+	if (count == 1 && !(flags & FORCE_VIRTUAL))
 		return intel_context_create(siblings[0]);
 
 	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
-	return siblings[0]->cops->create_virtual(siblings, count);
+	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
 struct i915_request *
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 5ed1e222c308..8d7f571029df 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags);
 
 static struct i915_request *
 __active_request(const struct intel_timeline * const tl,
@@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags)
 {
 	struct virtual_engine *ve;
 	unsigned int n;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index b3863abc51f5..74986b094b96 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
 
 	for (n = 0; n < nctx; n++) {
-		ve[n] = intel_engine_create_virtual(siblings, nsibling);
+		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ve[n])) {
 			err = PTR_ERR(ve[n]);
 			nctx = n;
@@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
 	 * restrict it to our desired engine within the virtual engine.
 	 */
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_close;
@@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
 		i915_request_add(rq);
 	}
 
-	ce = intel_engine_create_virtual(siblings, nsibling);
+	ce = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ce)) {
 		err = PTR_ERR(ce);
 		goto out;
@@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
 
 	/* XXX We do not handle oversubscription and fairness with normal rq */
 	for (n = 0; n < nsibling; n++) {
-		ce = intel_engine_create_virtual(siblings, nsibling);
+		ce = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ce)) {
 			err = PTR_ERR(ce);
 			goto out;
@@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
 	if (err)
 		goto out_scratch;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_scratch;
@@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	if (igt_spinner_init(&spin, gt))
 		return -ENOMEM;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_spin;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f69e984683aa..9b19e0d830a2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -124,7 +124,13 @@ struct guc_virtual_engine {
 };
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags);
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
@@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.destroy = guc_context_destroy,
 
 	.create_virtual = guc_create_virtual,
+	.create_parallel = guc_create_parallel,
 };
 
 static void submit_work_cb(struct irq_work *wrk)
@@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(context_enabled(ce));
@@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_post_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(!intel_context_is_child(ce));
@@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
 	intel_context_unpin(ce->parallel.parent);
 }
 
+static void guc_child_context_destroy(struct kref *kref)
+{
+	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
+
+	__guc_context_destroy(ce);
+}
+
+static const struct intel_context_ops virtual_parent_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_parent_context_pin,
+	.unpin = guc_parent_context_unpin,
+	.post_unpin = guc_context_post_unpin,
+
+	.ban = guc_context_ban,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.sched_disable = guc_context_sched_disable,
+
+	.destroy = guc_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static const struct intel_context_ops virtual_child_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_child_context_pin,
+	.unpin = guc_child_context_unpin,
+	.post_unpin = guc_child_context_post_unpin,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.destroy = guc_child_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_engine_create_virtual(siblings, num_siblings,
+						 FORCE_VIRTUAL);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+			parent->ops = &virtual_parent_context_ops;
+		} else {
+			ce->ops = &virtual_child_context_ops;
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
@@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 }
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index b1248a67b4f8..f7c19e5464ae 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
 	struct i915_engine_class_instance engines[N__]; \
 } __attribute__((packed)) name__
 
+/**
+ * struct i915_context_engines_parallel_submit - Configure engine for
+ * parallel submission.
+ *
+ * Setup a slot in the context engine map to allow multiple BBs to be submitted
+ * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
+ * in parallel. Multiple hardware contexts are created internally in the i915 to
+ * run these BBs. Once a slot is configured for N BBs only N BBs can be
+ * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
+ * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
+ * many BBs there are based on the slot's configuration. The N BBs are the last
+ * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
+ *
+ * The default placement behavior is to create implicit bonds between each
+ * context if each context maps to more than 1 physical engine (e.g. context is
+ * a virtual engine). Also we only allow contexts of same engine class and these
+ * contexts must be in logically contiguous order. Examples of the placement
+ * behavior are described below. Lastly, the default is to not allow BBs to be
+ * preempted mid-batch. Rather insert coordinated preemption points on all
+ * hardware contexts between each set of BBs. Flags could be added in the future
+ * to change both of these default behaviors.
+ *
+ * Returns -EINVAL if hardware context placement configuration is invalid or if
+ * the placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * interface.
+ *
+ * .. code-block:: none
+ *
+ *	Examples syntax:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *
+ *	Example 1 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=1,
+ *		     engines=CS[0],CS[1])
+ *
+ *	Results in the following valid placement:
+ *	CS[0], CS[1]
+ *
+ *	Example 2 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[2],CS[1],CS[3])
+ *
+ *	Results in the following valid placements:
+ *	CS[0], CS[1]
+ *	CS[2], CS[3]
+ *
+ *	This can be thought of as two virtual engines, each containing two
+ *	engines thereby making a 2D array. However, there are bonds tying the
+ *	entries together and placing restrictions on how they can be scheduled.
+ *	Specifically, the scheduler can choose only vertical columns from the 2D
+ *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
+ *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
+ *	versa. Same for CS[2] requires also using CS[3].
+ *	VE[0] = CS[0], CS[2]
+ *	VE[1] = CS[1], CS[3]
+ *
+ *	Example 3 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[1],CS[1],CS[3])
+ *
+ *	Results in the following valid and invalid placements:
+ *	CS[0], CS[1]
+ *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
+ */
+struct i915_context_engines_parallel_submit {
+	/**
+	 * @base: base user extension.
+	 */
+	struct i915_user_extension base;
+
+	/**
+	 * @engine_index: slot for parallel engine
+	 */
+	__u16 engine_index;
+
+	/**
+	 * @width: number of contexts per parallel engine or in other words the
+	 * number of batches in each submission
+	 */
+	__u16 width;
+
+	/**
+	 * @num_siblings: number of siblings per context or in other words the
+	 * number of possible placements for each submission
+	 */
+	__u16 num_siblings;
+
+	/**
+	 * @mbz16: reserved for future use; must be zero
+	 */
+	__u16 mbz16;
+
+	/**
+	 * @flags: all undefined flags must be zero, currently not defined flags
+	 */
+	__u64 flags;
+
+	/**
+	 * @mbz64: reserved for future use; must be zero
+	 */
+	__u64 mbz64[3];
+
+	/**
+	 * @engines: 2-d array of engine instances to configure parallel engine
+	 *
+	 * length = width (i) * num_siblings (j)
+	 * index = j + i * num_siblings
+	 */
+	struct i915_engine_class_instance engines[0];
+
+} __packed;
+
+#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
+	struct i915_user_extension base; \
+	__u16 engine_index; \
+	__u16 width; \
+	__u16 num_siblings; \
+	__u16 mbz16; \
+	__u64 flags; \
+	__u64 mbz64[3]; \
+	struct i915_engine_class_instance engines[N__]; \
+} __attribute__((packed)) name__
+
 /**
  * DOC: Context Engine Map uAPI
  *
@@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Daniel Vetter)
  - Add IGT link and placeholder for media UMD link
v3:
 (Kernel test robot)
  - Fix warning in unpin engines call
 (John Harrison)
  - Reword a bunch of the kernel doc

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
 include/uapi/drm/i915_drm.h                   | 131 +++++++++++
 9 files changed, 489 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8c7ea6e56262..6290bc20ccb1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
 	return 0;
 }
 
+static int
+set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
+				      void *data)
+{
+	struct i915_context_engines_parallel_submit __user *ext =
+		container_of_user(base, typeof(*ext), base);
+	const struct set_proto_ctx_engines *set = data;
+	struct drm_i915_private *i915 = set->i915;
+	u64 flags;
+	int err = 0, n, i, j;
+	u16 slot, width, num_siblings;
+	struct intel_engine_cs **siblings = NULL;
+	intel_engine_mask_t prev_mask;
+
+	/* Disabling for now */
+	return -ENODEV;
+
+	/* FIXME: This is NIY for execlists */
+	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
+		return -ENODEV;
+
+	if (get_user(slot, &ext->engine_index))
+		return -EFAULT;
+
+	if (get_user(width, &ext->width))
+		return -EFAULT;
+
+	if (get_user(num_siblings, &ext->num_siblings))
+		return -EFAULT;
+
+	if (slot >= set->num_engines) {
+		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
+			slot, set->num_engines);
+		return -EINVAL;
+	}
+
+	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
+		drm_dbg(&i915->drm,
+			"Invalid placement[%d], already occupied\n", slot);
+		return -EINVAL;
+	}
+
+	if (get_user(flags, &ext->flags))
+		return -EFAULT;
+
+	if (flags) {
+		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
+		return -EINVAL;
+	}
+
+	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
+		err = check_user_mbz(&ext->mbz64[n]);
+		if (err)
+			return err;
+	}
+
+	if (width < 2) {
+		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
+		return -EINVAL;
+	}
+
+	if (num_siblings < 1) {
+		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
+	siblings = kmalloc_array(num_siblings * width,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return -ENOMEM;
+
+	/* Create contexts / engines */
+	for (i = 0; i < width; ++i) {
+		intel_engine_mask_t current_mask = 0;
+		struct i915_engine_class_instance prev_engine;
+
+		for (j = 0; j < num_siblings; ++j) {
+			struct i915_engine_class_instance ci;
+
+			n = i * num_siblings + j;
+			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
+				err = -EFAULT;
+				goto out_err;
+			}
+
+			siblings[n] =
+				intel_engine_lookup_user(i915, ci.engine_class,
+							 ci.engine_instance);
+			if (!siblings[n]) {
+				drm_dbg(&i915->drm,
+					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
+					n, ci.engine_class, ci.engine_instance);
+				err = -EINVAL;
+				goto out_err;
+			}
+
+			if (n) {
+				if (prev_engine.engine_class !=
+				    ci.engine_class) {
+					drm_dbg(&i915->drm,
+						"Mismatched class %d, %d\n",
+						prev_engine.engine_class,
+						ci.engine_class);
+					err = -EINVAL;
+					goto out_err;
+				}
+			}
+
+			prev_engine = ci;
+			current_mask |= siblings[n]->logical_mask;
+		}
+
+		if (i > 0) {
+			if (current_mask != prev_mask << 1) {
+				drm_dbg(&i915->drm,
+					"Non contiguous logical mask 0x%x, 0x%x\n",
+					prev_mask, current_mask);
+				err = -EINVAL;
+				goto out_err;
+			}
+		}
+		prev_mask = current_mask;
+	}
+
+	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
+	set->engines[slot].num_siblings = num_siblings;
+	set->engines[slot].width = width;
+	set->engines[slot].siblings = siblings;
+
+	return 0;
+
+out_err:
+	kfree(siblings);
+
+	return err;
+}
+
 static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
 	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
 	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
+	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
+		set_proto_ctx_engines_parallel_submit,
 };
 
 static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
@@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
 	return ret;
 }
 
+static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
+{
+	while (count--) {
+		struct intel_context *ce = e->engines[count], *child;
+
+		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
+			continue;
+
+		for_each_child(ce, child)
+			intel_context_unpin(child);
+		intel_context_unpin(ce);
+	}
+}
+
+static void unpin_engines(struct i915_gem_engines *e)
+{
+	__unpin_engines(e, e->num_engines);
+}
+
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
 {
 	while (count--) {
@@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+static int perma_pin_contexts(struct intel_context *ce)
+{
+	struct intel_context *child;
+	int i = 0, j = 0, ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	ret = intel_context_pin(ce);
+	if (unlikely(ret))
+		return ret;
+
+	for_each_child(ce, child) {
+		ret = intel_context_pin(child);
+		if (unlikely(ret))
+			goto unwind;
+		++i;
+	}
+
+	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
+
+	return 0;
+
+unwind:
+	intel_context_unpin(ce);
+	for_each_child(ce, child) {
+		if (j++ < i)
+			intel_context_unpin(child);
+		else
+			break;
+	}
+
+	return ret;
+}
+
 static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 					     unsigned int num_engines,
 					     struct i915_gem_proto_engine *pe)
@@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	e->num_engines = num_engines;
 
 	for (n = 0; n < num_engines; n++) {
-		struct intel_context *ce;
+		struct intel_context *ce, *child;
 		int ret;
 
 		switch (pe[n].type) {
@@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 		case I915_GEM_ENGINE_TYPE_BALANCED:
 			ce = intel_engine_create_virtual(pe[n].siblings,
-							 pe[n].num_siblings);
+							 pe[n].num_siblings, 0);
+			break;
+
+		case I915_GEM_ENGINE_TYPE_PARALLEL:
+			ce = intel_engine_create_parallel(pe[n].siblings,
+							  pe[n].num_siblings,
+							  pe[n].width);
 			break;
 
 		case I915_GEM_ENGINE_TYPE_INVALID:
@@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			err = ERR_PTR(ret);
 			goto free_engines;
 		}
+		for_each_child(ce, child) {
+			ret = intel_context_set_gem(child, ctx, pe->sseu);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
+
+		/* XXX: Must be done after setting gem context */
+		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
+			ret = perma_pin_contexts(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
 	}
 
 	return e;
@@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
 
 	/* Flush any concurrent set_engines() */
 	mutex_lock(&ctx->engines_mutex);
+	unpin_engines(__context_engines_static(ctx));
 	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
 	i915_gem_context_set_closed(ctx);
 	mutex_unlock(&ctx->engines_mutex);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index c4617e4d9fa9..eb5f9b4f2d19 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -78,6 +78,9 @@ enum i915_gem_engine_type {
 
 	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
 	I915_GEM_ENGINE_TYPE_BALANCED,
+
+	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
+	I915_GEM_ENGINE_TYPE_PARALLEL,
 };
 
 /**
@@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
 	/** @num_siblings: Number of balanced siblings */
 	unsigned int num_siblings;
 
+	/** @width: Width of each sibling */
+	unsigned int width;
+
 	/** @siblings: Balanced siblings */
 	struct intel_engine_cs **siblings;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 8309d1141d0a..1d880303a7e4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -55,9 +55,13 @@ struct intel_context_ops {
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
-	/* virtual engine/context interface */
+	/* virtual/parallel engine/context interface */
 	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
-						unsigned int count);
+						unsigned int count,
+						unsigned long flags);
+	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
+						 unsigned int num_siblings,
+						 unsigned int width);
 	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
 					       unsigned int sibling);
 };
@@ -113,6 +117,7 @@ struct intel_context {
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
 #define CONTEXT_GUC_INIT		10
+#define CONTEXT_PERMA_PIN		11
 
 	struct {
 		u64 timeout_us;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 87579affb952..43f16a8347ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+#define FORCE_VIRTUAL	BIT(0)
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count);
+			    unsigned int count, unsigned long flags);
+
+static inline struct intel_context *
+intel_engine_create_parallel(struct intel_engine_cs **engines,
+			     unsigned int num_engines,
+			     unsigned int width)
+{
+	GEM_BUG_ON(!engines[0]->cops->create_parallel);
+	return engines[0]->cops->create_parallel(engines, num_engines, width);
+}
 
 static inline bool
 intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2eb798ad068b..ff6753ccb129 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count)
+			    unsigned int count, unsigned long flags)
 {
 	if (count == 0)
 		return ERR_PTR(-EINVAL);
 
-	if (count == 1)
+	if (count == 1 && !(flags & FORCE_VIRTUAL))
 		return intel_context_create(siblings[0]);
 
 	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
-	return siblings[0]->cops->create_virtual(siblings, count);
+	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
 struct i915_request *
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 5ed1e222c308..8d7f571029df 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags);
 
 static struct i915_request *
 __active_request(const struct intel_timeline * const tl,
@@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags)
 {
 	struct virtual_engine *ve;
 	unsigned int n;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index b3863abc51f5..74986b094b96 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
 
 	for (n = 0; n < nctx; n++) {
-		ve[n] = intel_engine_create_virtual(siblings, nsibling);
+		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ve[n])) {
 			err = PTR_ERR(ve[n]);
 			nctx = n;
@@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
 	 * restrict it to our desired engine within the virtual engine.
 	 */
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_close;
@@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
 		i915_request_add(rq);
 	}
 
-	ce = intel_engine_create_virtual(siblings, nsibling);
+	ce = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ce)) {
 		err = PTR_ERR(ce);
 		goto out;
@@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
 
 	/* XXX We do not handle oversubscription and fairness with normal rq */
 	for (n = 0; n < nsibling; n++) {
-		ce = intel_engine_create_virtual(siblings, nsibling);
+		ce = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ce)) {
 			err = PTR_ERR(ce);
 			goto out;
@@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
 	if (err)
 		goto out_scratch;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_scratch;
@@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	if (igt_spinner_init(&spin, gt))
 		return -ENOMEM;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_spin;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f69e984683aa..9b19e0d830a2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -124,7 +124,13 @@ struct guc_virtual_engine {
 };
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags);
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
@@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.destroy = guc_context_destroy,
 
 	.create_virtual = guc_create_virtual,
+	.create_parallel = guc_create_parallel,
 };
 
 static void submit_work_cb(struct irq_work *wrk)
@@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(context_enabled(ce));
@@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_post_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(!intel_context_is_child(ce));
@@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
 	intel_context_unpin(ce->parallel.parent);
 }
 
+static void guc_child_context_destroy(struct kref *kref)
+{
+	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
+
+	__guc_context_destroy(ce);
+}
+
+static const struct intel_context_ops virtual_parent_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_parent_context_pin,
+	.unpin = guc_parent_context_unpin,
+	.post_unpin = guc_context_post_unpin,
+
+	.ban = guc_context_ban,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.sched_disable = guc_context_sched_disable,
+
+	.destroy = guc_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static const struct intel_context_ops virtual_child_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_child_context_pin,
+	.unpin = guc_child_context_unpin,
+	.post_unpin = guc_child_context_post_unpin,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.destroy = guc_child_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_engine_create_virtual(siblings, num_siblings,
+						 FORCE_VIRTUAL);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+			parent->ops = &virtual_parent_context_ops;
+		} else {
+			ce->ops = &virtual_child_context_ops;
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
@@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 }
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index b1248a67b4f8..f7c19e5464ae 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
 	struct i915_engine_class_instance engines[N__]; \
 } __attribute__((packed)) name__
 
+/**
+ * struct i915_context_engines_parallel_submit - Configure engine for
+ * parallel submission.
+ *
+ * Setup a slot in the context engine map to allow multiple BBs to be submitted
+ * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
+ * in parallel. Multiple hardware contexts are created internally in the i915 to
+ * run these BBs. Once a slot is configured for N BBs only N BBs can be
+ * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
+ * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
+ * many BBs there are based on the slot's configuration. The N BBs are the last
+ * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
+ *
+ * The default placement behavior is to create implicit bonds between each
+ * context if each context maps to more than 1 physical engine (e.g. context is
+ * a virtual engine). Also we only allow contexts of same engine class and these
+ * contexts must be in logically contiguous order. Examples of the placement
+ * behavior are described below. Lastly, the default is to not allow BBs to be
+ * preempted mid-batch. Rather insert coordinated preemption points on all
+ * hardware contexts between each set of BBs. Flags could be added in the future
+ * to change both of these default behaviors.
+ *
+ * Returns -EINVAL if hardware context placement configuration is invalid or if
+ * the placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * interface.
+ *
+ * .. code-block:: none
+ *
+ *	Examples syntax:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *
+ *	Example 1 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=1,
+ *		     engines=CS[0],CS[1])
+ *
+ *	Results in the following valid placement:
+ *	CS[0], CS[1]
+ *
+ *	Example 2 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[2],CS[1],CS[3])
+ *
+ *	Results in the following valid placements:
+ *	CS[0], CS[1]
+ *	CS[2], CS[3]
+ *
+ *	This can be thought of as two virtual engines, each containing two
+ *	engines thereby making a 2D array. However, there are bonds tying the
+ *	entries together and placing restrictions on how they can be scheduled.
+ *	Specifically, the scheduler can choose only vertical columns from the 2D
+ *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
+ *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
+ *	versa. Same for CS[2] requires also using CS[3].
+ *	VE[0] = CS[0], CS[2]
+ *	VE[1] = CS[1], CS[3]
+ *
+ *	Example 3 pseudo code:
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[1],CS[1],CS[3])
+ *
+ *	Results in the following valid and invalid placements:
+ *	CS[0], CS[1]
+ *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
+ */
+struct i915_context_engines_parallel_submit {
+	/**
+	 * @base: base user extension.
+	 */
+	struct i915_user_extension base;
+
+	/**
+	 * @engine_index: slot for parallel engine
+	 */
+	__u16 engine_index;
+
+	/**
+	 * @width: number of contexts per parallel engine or in other words the
+	 * number of batches in each submission
+	 */
+	__u16 width;
+
+	/**
+	 * @num_siblings: number of siblings per context or in other words the
+	 * number of possible placements for each submission
+	 */
+	__u16 num_siblings;
+
+	/**
+	 * @mbz16: reserved for future use; must be zero
+	 */
+	__u16 mbz16;
+
+	/**
+	 * @flags: all undefined flags must be zero, currently not defined flags
+	 */
+	__u64 flags;
+
+	/**
+	 * @mbz64: reserved for future use; must be zero
+	 */
+	__u64 mbz64[3];
+
+	/**
+	 * @engines: 2-d array of engine instances to configure parallel engine
+	 *
+	 * length = width (i) * num_siblings (j)
+	 * index = j + i * num_siblings
+	 */
+	struct i915_engine_class_instance engines[0];
+
+} __packed;
+
+#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
+	struct i915_user_extension base; \
+	__u16 engine_index; \
+	__u16 width; \
+	__u16 num_siblings; \
+	__u16 mbz16; \
+	__u64 flags; \
+	__u64 mbz64[3]; \
+	struct i915_engine_class_instance engines[N__]; \
+} __attribute__((packed)) name__
+
 /**
  * DOC: Context Engine Map uAPI
  *
@@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 18/26] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Update parallel submit doc to point to i915_drm.h

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
 Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
 2 files changed, 2 insertions(+), 124 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
deleted file mode 100644
index 8cbe2c4e0172..000000000000
--- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
+++ /dev/null
@@ -1,122 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2021 Intel Corporation
- */
-
-#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
-
-/**
- * struct drm_i915_context_engines_parallel_submit - Configure engine for
- * parallel submission.
- *
- * Setup a slot in the context engine map to allow multiple BBs to be submitted
- * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
- * in parallel. Multiple hardware contexts are created internally in the i915
- * run these BBs. Once a slot is configured for N BBs only N BBs can be
- * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
- * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
- * many BBs there are based on the slot's configuration. The N BBs are the last
- * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
- *
- * The default placement behavior is to create implicit bonds between each
- * context if each context maps to more than 1 physical engine (e.g. context is
- * a virtual engine). Also we only allow contexts of same engine class and these
- * contexts must be in logically contiguous order. Examples of the placement
- * behavior described below. Lastly, the default is to not allow BBs to
- * preempted mid BB rather insert coordinated preemption on all hardware
- * contexts between each set of BBs. Flags may be added in the future to change
- * both of these default behaviors.
- *
- * Returns -EINVAL if hardware context placement configuration is invalid or if
- * the placement configuration isn't supported on the platform / submission
- * interface.
- * Returns -ENODEV if extension isn't supported on the platform / submission
- * interface.
- *
- * .. code-block:: none
- *
- *	Example 1 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=1,
- *		     engines=CS[0],CS[1])
- *
- *	Results in the following valid placement:
- *	CS[0], CS[1]
- *
- *	Example 2 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[2],CS[1],CS[3])
- *
- *	Results in the following valid placements:
- *	CS[0], CS[1]
- *	CS[2], CS[3]
- *
- *	This can also be thought of as 2 virtual engines described by 2-D array
- *	in the engines the field with bonds placed between each index of the
- *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
- *	CS[3].
- *	VE[0] = CS[0], CS[2]
- *	VE[1] = CS[1], CS[3]
- *
- *	Example 3 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[1],CS[1],CS[3])
- *
- *	Results in the following valid and invalid placements:
- *	CS[0], CS[1]
- *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
- */
-struct drm_i915_context_engines_parallel_submit {
-	/**
-	 * @base: base user extension.
-	 */
-	struct i915_user_extension base;
-
-	/**
-	 * @engine_index: slot for parallel engine
-	 */
-	__u16 engine_index;
-
-	/**
-	 * @width: number of contexts per parallel engine
-	 */
-	__u16 width;
-
-	/**
-	 * @num_siblings: number of siblings per context
-	 */
-	__u16 num_siblings;
-
-	/**
-	 * @mbz16: reserved for future use; must be zero
-	 */
-	__u16 mbz16;
-
-	/**
-	 * @flags: all undefined flags must be zero, currently not defined flags
-	 */
-	__u64 flags;
-
-	/**
-	 * @mbz64: reserved for future use; must be zero
-	 */
-	__u64 mbz64[3];
-
-	/**
-	 * @engines: 2-d array of engine instances to configure parallel engine
-	 *
-	 * length = width (i) * num_siblings (j)
-	 * index = j + i * num_siblings
-	 */
-	struct i915_engine_class_instance engines[0];
-
-} __packed;
-
diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index cbda75065dad..d630f15ab795 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
 drm_i915_context_engines_parallel_submit to the uAPI to implement this
 extension.
 
-.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
-        :functions: drm_i915_context_engines_parallel_submit
+.. kernel-doc:: include/uapi/drm/i915_drm.h
+        :functions: i915_context_engines_parallel_submit
 
 Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
 -------------------------------------------------------------------
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 18/26] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Update parallel submit doc to point to i915_drm.h

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
 Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
 2 files changed, 2 insertions(+), 124 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
deleted file mode 100644
index 8cbe2c4e0172..000000000000
--- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
+++ /dev/null
@@ -1,122 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2021 Intel Corporation
- */
-
-#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
-
-/**
- * struct drm_i915_context_engines_parallel_submit - Configure engine for
- * parallel submission.
- *
- * Setup a slot in the context engine map to allow multiple BBs to be submitted
- * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
- * in parallel. Multiple hardware contexts are created internally in the i915
- * run these BBs. Once a slot is configured for N BBs only N BBs can be
- * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
- * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
- * many BBs there are based on the slot's configuration. The N BBs are the last
- * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
- *
- * The default placement behavior is to create implicit bonds between each
- * context if each context maps to more than 1 physical engine (e.g. context is
- * a virtual engine). Also we only allow contexts of same engine class and these
- * contexts must be in logically contiguous order. Examples of the placement
- * behavior described below. Lastly, the default is to not allow BBs to
- * preempted mid BB rather insert coordinated preemption on all hardware
- * contexts between each set of BBs. Flags may be added in the future to change
- * both of these default behaviors.
- *
- * Returns -EINVAL if hardware context placement configuration is invalid or if
- * the placement configuration isn't supported on the platform / submission
- * interface.
- * Returns -ENODEV if extension isn't supported on the platform / submission
- * interface.
- *
- * .. code-block:: none
- *
- *	Example 1 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=1,
- *		     engines=CS[0],CS[1])
- *
- *	Results in the following valid placement:
- *	CS[0], CS[1]
- *
- *	Example 2 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[2],CS[1],CS[3])
- *
- *	Results in the following valid placements:
- *	CS[0], CS[1]
- *	CS[2], CS[3]
- *
- *	This can also be thought of as 2 virtual engines described by 2-D array
- *	in the engines the field with bonds placed between each index of the
- *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
- *	CS[3].
- *	VE[0] = CS[0], CS[2]
- *	VE[1] = CS[1], CS[3]
- *
- *	Example 3 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[1],CS[1],CS[3])
- *
- *	Results in the following valid and invalid placements:
- *	CS[0], CS[1]
- *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
- */
-struct drm_i915_context_engines_parallel_submit {
-	/**
-	 * @base: base user extension.
-	 */
-	struct i915_user_extension base;
-
-	/**
-	 * @engine_index: slot for parallel engine
-	 */
-	__u16 engine_index;
-
-	/**
-	 * @width: number of contexts per parallel engine
-	 */
-	__u16 width;
-
-	/**
-	 * @num_siblings: number of siblings per context
-	 */
-	__u16 num_siblings;
-
-	/**
-	 * @mbz16: reserved for future use; must be zero
-	 */
-	__u16 mbz16;
-
-	/**
-	 * @flags: all undefined flags must be zero, currently not defined flags
-	 */
-	__u64 flags;
-
-	/**
-	 * @mbz64: reserved for future use; must be zero
-	 */
-	__u64 mbz64[3];
-
-	/**
-	 * @engines: 2-d array of engine instances to configure parallel engine
-	 *
-	 * length = width (i) * num_siblings (j)
-	 * index = j + i * num_siblings
-	 */
-	struct i915_engine_class_instance engines[0];
-
-} __packed;
-
diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index cbda75065dad..d630f15ab795 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
 drm_i915_context_engines_parallel_submit to the uAPI to implement this
 extension.
 
-.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
-        :functions: drm_i915_context_engines_parallel_submit
+.. kernel-doc:: include/uapi/drm/i915_drm.h
+        :functions: i915_context_engines_parallel_submit
 
 Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
 -------------------------------------------------------------------
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 19/26] drm/i915/guc: Add basic GuC multi-lrc selftest
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 179 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 181 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9b19e0d830a2..12ee8ca76249 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3957,4 +3957,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
+#include "selftest_guc_multi_lrc.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
new file mode 100644
index 000000000000..50953c8e8b53
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static void logical_sort(struct intel_engine_cs **engines, int num_engines)
+{
+	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
+	int i, j;
+
+	for (i = 0; i < num_engines; ++i)
+		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
+			if (engines[j]->logical_mask & BIT(i)) {
+				sorted[i] = engines[j];
+				break;
+			}
+		}
+
+	memcpy(*engines, *sorted,
+	       sizeof(struct intel_engine_cs *) * num_engines);
+}
+
+static struct intel_context *
+multi_lrc_create_parent(struct intel_gt *gt, u8 class,
+			unsigned long flags)
+{
+	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i = 0;
+
+	for_each_engine(engine, gt, id) {
+		if (engine->class != class)
+			continue;
+
+		siblings[i++] = engine;
+	}
+
+	if (i <= 1)
+		return ERR_PTR(0);
+
+	logical_sort(siblings, i);
+
+	return intel_engine_create_parallel(siblings, 1, i);
+}
+
+static void multi_lrc_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+}
+
+static void multi_lrc_context_put(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/*
+	 * Only the parent gets the creation ref put in the uAPI, the parent
+	 * itself is responsible for creation ref put on the children.
+	 */
+	intel_context_put(ce);
+}
+
+static struct i915_request *
+multi_lrc_nop_request(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct i915_request *rq, *child_rq;
+	int i = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	for_each_child(ce, child) {
+		child_rq = intel_context_create_request(child);
+		if (IS_ERR(child_rq))
+			goto child_error;
+
+		if (++i == ce->parallel.number_children)
+			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+				&child_rq->fence.flags);
+		i915_request_add(child_rq);
+	}
+
+	return rq;
+
+child_error:
+	i915_request_put(rq);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
+{
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, class, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d", class);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on request: %d", ret);
+
+	i915_request_put(rq);
+
+	if (ret >= 0) {
+		ret = intel_gt_wait_for_idle(gt, HZ * 5);
+		if (ret < 0)
+			pr_err("GT failed to idle: %d\n", ret);
+	}
+
+out:
+	multi_lrc_context_unpin(parent);
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
+static int intel_guc_multi_lrc_basic(void *arg)
+{
+	struct intel_gt *gt = arg;
+	unsigned int class;
+	int ret;
+
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		ret = __intel_guc_multi_lrc_basic(gt, class);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int intel_guc_multi_lrc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_multi_lrc_basic),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index 3cf6758931f9..bdd290f2bf3c 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -48,5 +48,6 @@ selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
+selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 19/26] drm/i915/guc: Add basic GuC multi-lrc selftest
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 179 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 181 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9b19e0d830a2..12ee8ca76249 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3957,4 +3957,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
+#include "selftest_guc_multi_lrc.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
new file mode 100644
index 000000000000..50953c8e8b53
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static void logical_sort(struct intel_engine_cs **engines, int num_engines)
+{
+	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
+	int i, j;
+
+	for (i = 0; i < num_engines; ++i)
+		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
+			if (engines[j]->logical_mask & BIT(i)) {
+				sorted[i] = engines[j];
+				break;
+			}
+		}
+
+	memcpy(*engines, *sorted,
+	       sizeof(struct intel_engine_cs *) * num_engines);
+}
+
+static struct intel_context *
+multi_lrc_create_parent(struct intel_gt *gt, u8 class,
+			unsigned long flags)
+{
+	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i = 0;
+
+	for_each_engine(engine, gt, id) {
+		if (engine->class != class)
+			continue;
+
+		siblings[i++] = engine;
+	}
+
+	if (i <= 1)
+		return ERR_PTR(0);
+
+	logical_sort(siblings, i);
+
+	return intel_engine_create_parallel(siblings, 1, i);
+}
+
+static void multi_lrc_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+}
+
+static void multi_lrc_context_put(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/*
+	 * Only the parent gets the creation ref put in the uAPI, the parent
+	 * itself is responsible for creation ref put on the children.
+	 */
+	intel_context_put(ce);
+}
+
+static struct i915_request *
+multi_lrc_nop_request(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct i915_request *rq, *child_rq;
+	int i = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	for_each_child(ce, child) {
+		child_rq = intel_context_create_request(child);
+		if (IS_ERR(child_rq))
+			goto child_error;
+
+		if (++i == ce->parallel.number_children)
+			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+				&child_rq->fence.flags);
+		i915_request_add(child_rq);
+	}
+
+	return rq;
+
+child_error:
+	i915_request_put(rq);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
+{
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, class, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d", class);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on request: %d", ret);
+
+	i915_request_put(rq);
+
+	if (ret >= 0) {
+		ret = intel_gt_wait_for_idle(gt, HZ * 5);
+		if (ret < 0)
+			pr_err("GT failed to idle: %d\n", ret);
+	}
+
+out:
+	multi_lrc_context_unpin(parent);
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
+static int intel_guc_multi_lrc_basic(void *arg)
+{
+	struct intel_gt *gt = arg;
+	unsigned int class;
+	int ret;
+
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		ret = __intel_guc_multi_lrc_basic(gt, class);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int intel_guc_multi_lrc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_multi_lrc_basic),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index 3cf6758931f9..bdd290f2bf3c 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -48,5 +48,6 @@ selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
+selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between to parent and child is needed. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled via by
default if a context is configured by set parallel extension.

v2:
 (John Harrison)
  - Fix a few comments wording
  - Add struture for parent page layout

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
 4 files changed, 324 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 3b340eb59ada..ee84259959d0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	GEM_BUG_ON(intel_context_is_child(child));
 	GEM_BUG_ON(intel_context_is_parent(child));
 
-	parent->parallel.number_children++;
+	parent->parallel.child_index = parent->parallel.number_children++;
 	list_add_tail(&child->parallel.child_link,
 		      &parent->parallel.child_list);
 	child->parallel.parent = parent;
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 1d880303a7e4..95a5b94b4ece 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -250,6 +250,8 @@ struct intel_context {
 		struct i915_request *last_rq;
 		/** @number_children: number of children if parent */
 		u8 number_children;
+		/** @child_index: index into child_list if child */
+		u8 child_index;
 		/** @guc: GuC specific members for parallel submission */
 		struct {
 			/** @wqi_head: head pointer in work queue */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index a00eeddc1449..663950d3badc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -181,7 +181,7 @@ struct guc_process_desc {
 	u32 wq_status;
 	u32 engine_presence;
 	u32 priority;
-	u32 reserved[30];
+	u32 reserved[36];
 } __packed;
 
 #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 12ee8ca76249..f28e36aa77c2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -11,6 +11,7 @@
 #include "gt/intel_context.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
+#include "gt/intel_gpu_commands.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
@@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 
 /*
  * When using multi-lrc submission an extra page in the context state is
- * reserved for the process descriptor and work queue.
+ * reserved for the process descriptor, work queue, and handshake between the
+ * parent + childlren contexts to insert safe preemption points between each set
+ * of BBs.
  *
  * The layout of this page is below:
  * 0						guc_process_desc
+ * + sizeof(struct guc_process_desc)		child go
+ * + CACHELINE_BYTES				child join[0]
+ * ...
+ * + CACHELINE_BYTES				child join[n - 1]
  * ...						unused
  * PAGE_SIZE / 2				work queue start
  * ...						work queue
@@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
  */
 #define WQ_SIZE			(PAGE_SIZE / 2)
 #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
-static u32 __get_process_desc_offset(struct intel_context *ce)
+
+struct parent_page {
+	struct guc_process_desc pdesc;
+
+	u32 child_go_memory;
+	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
+
+	struct {
+		u32 child_join_memory;
+		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
+	} join[MAX_ENGINE_INSTANCE + 1];
+
+	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
+		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
+
+	u32 wq[WQ_SIZE / sizeof(u32)];
+};
+
+static u32 __get_parent_page_offset(struct intel_context *ce)
 {
 	GEM_BUG_ON(!ce->parallel.guc.parent_page);
 
@@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
 
 static u32 __get_wq_offset(struct intel_context *ce)
 {
-	return __get_process_desc_offset(ce) + WQ_OFFSET;
+	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
+
+	return __get_parent_page_offset(ce) + WQ_OFFSET;
 }
 
-static struct guc_process_desc *
-__get_process_desc(struct intel_context *ce)
+static struct parent_page *
+__get_parent_page(struct intel_context *ce)
 {
+	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
+
 	/*
 	 * Need to subtract LRC_STATE_OFFSET here as the
 	 * parallel.guc.parent_page is the offset into ce->state while
 	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
 	 */
-	return (struct guc_process_desc *)
+	return (struct parent_page *)
 		(ce->lrc_reg_state +
-		 ((__get_process_desc_offset(ce) -
+		 ((__get_parent_page_offset(ce) -
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	struct parent_page *pp = __get_parent_page(ce);
+
+	return &pp->pdesc;
+}
+
 static u32 *get_wq_pointer(struct guc_process_desc *desc,
 			   struct intel_context *ce,
 			   u32 wqi_size)
@@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
 	}
 #undef AVAILABLE_SPACE
 
-	return ((u32 *)__get_process_desc(ce)) +
-		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
+	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
 }
 
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
@@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 	return __guc_action_deregister_context(guc, guc_id);
 }
 
+static inline void clear_children_join_go_memory(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
+	u8 i;
+
+	for (i = 0; i < ce->parallel.number_children + 1; ++i)
+		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
+}
+
+static inline u32 get_children_go_value(struct intel_context *ce)
+{
+	return __get_parent_page(ce)->child_go_memory;
+}
+
+static inline u32 get_children_join_value(struct intel_context *ce,
+					  u8 child_index)
+{
+	return __get_parent_page(ce)->join[child_index].child_join_memory;
+}
+
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		ce->parallel.guc.wqi_head = 0;
 
 		desc->process_desc = i915_ggtt_offset(ce->state) +
-			__get_process_desc_offset(ce);
+			__get_parent_page_offset(ce);
 		desc->wq_addr = i915_ggtt_offset(ce->state) +
 			__get_wq_offset(ce);
 		desc->wq_size = WQ_SIZE;
@@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 			guc_context_policy_init(engine, desc);
 		}
+
+		clear_children_join_go_memory(ce);
 	}
 
 	/*
@@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/*
+ * The below override of the breadcrumbs is enabled when the user configures a
+ * context for parallel submission (multi-lrc, parent-child).
+ *
+ * The overridden breadcrumbs implements an algorithm which allows the GuC to
+ * safely preempt all the hw contexts configured for parallel submission
+ * between each BB. The contract between the i915 and GuC is if the parent
+ * context can be preempted, all the children can be preempted, and the GuC will
+ * always try to preempt the parent before the children. A handshake between the
+ * parent / children breadcrumbs ensures the i915 holds up its end of the deal
+ * creating a window to preempt between each set of BBs.
+ */
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags);
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags);
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs);
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs);
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->engine->emit_bb_start =
+		emit_bb_start_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb =
+		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb_dw =
+		12 + 4 * parent->parallel.number_children;
+	for_each_child(parent, ce) {
+		ce->engine->emit_bb_start =
+			emit_bb_start_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb =
+			emit_fini_breadcrumb_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb_dw = 16;
+	}
+
 	kfree(siblings);
 	return parent;
 
@@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			drm_printf(p, "\t\tWQI Status: %u\n\n",
 				   READ_ONCE(desc->wq_status));
 
+			if (ce->engine->emit_bb_start ==
+			    emit_bb_start_parent_no_preempt_mid_batch) {
+				u8 i;
+
+				drm_printf(p, "\t\tChildren Go: %u\n\n",
+					   get_children_go_value(ce));
+				for (i = 0; i < ce->parallel.number_children; ++i)
+					drm_printf(p, "\t\tChildren Join: %u\n",
+						   get_children_join_value(ce, i));
+			}
+
 			for_each_child(ce, child)
 				guc_log_context(p, child);
 		}
@@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
+static inline u32 get_children_go_addr(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
+		     sizeof(struct guc_process_desc));
+
+	return i915_ggtt_offset(ce->state) +
+		__get_parent_page_offset(ce) +
+		sizeof(struct guc_process_desc);
+}
+
+static inline u32 get_children_join_addr(struct intel_context *ce,
+					 u8 child_index)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
+}
+
+#define PARENT_GO_BB			1
+#define PARENT_GO_FINI_BREADCRUMB	0
+#define CHILD_GO_BB			1
+#define CHILD_GO_FINI_BREADCRUMB	0
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Wait on children */
+	for (i = 0; i < ce->parallel.number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_BB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_BB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	struct intel_context *parent = intel_context_to_parent(ce);
+	u32 *cs;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	cs = intel_ring_begin(rq, 12);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_BB,
+				  get_children_join_addr(parent,
+							 ce->parallel.child_index),
+				  0);
+
+	/* Wait on parent for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_BB;
+	*cs++ = get_children_go_addr(parent);
+	*cs++ = 0;
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/* Wait on children */
+	for (i = 0; i < ce->parallel.number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_FINI_BREADCRUMB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_FINI_BREADCRUMB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	struct intel_context *parent = intel_context_to_parent(ce);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_FINI_BREADCRUMB,
+				  get_children_join_addr(parent,
+							 ce->parallel.child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_FINI_BREADCRUMB;
+	*cs++ = get_children_go_addr(parent);
+	*cs++ = 0;
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
 static struct intel_context *
 guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
 		   unsigned long flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between to parent and child is needed. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled via by
default if a context is configured by set parallel extension.

v2:
 (John Harrison)
  - Fix a few comments wording
  - Add struture for parent page layout

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
 4 files changed, 324 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 3b340eb59ada..ee84259959d0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	GEM_BUG_ON(intel_context_is_child(child));
 	GEM_BUG_ON(intel_context_is_parent(child));
 
-	parent->parallel.number_children++;
+	parent->parallel.child_index = parent->parallel.number_children++;
 	list_add_tail(&child->parallel.child_link,
 		      &parent->parallel.child_list);
 	child->parallel.parent = parent;
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 1d880303a7e4..95a5b94b4ece 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -250,6 +250,8 @@ struct intel_context {
 		struct i915_request *last_rq;
 		/** @number_children: number of children if parent */
 		u8 number_children;
+		/** @child_index: index into child_list if child */
+		u8 child_index;
 		/** @guc: GuC specific members for parallel submission */
 		struct {
 			/** @wqi_head: head pointer in work queue */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index a00eeddc1449..663950d3badc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -181,7 +181,7 @@ struct guc_process_desc {
 	u32 wq_status;
 	u32 engine_presence;
 	u32 priority;
-	u32 reserved[30];
+	u32 reserved[36];
 } __packed;
 
 #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 12ee8ca76249..f28e36aa77c2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -11,6 +11,7 @@
 #include "gt/intel_context.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
+#include "gt/intel_gpu_commands.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
@@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 
 /*
  * When using multi-lrc submission an extra page in the context state is
- * reserved for the process descriptor and work queue.
+ * reserved for the process descriptor, work queue, and handshake between the
+ * parent + childlren contexts to insert safe preemption points between each set
+ * of BBs.
  *
  * The layout of this page is below:
  * 0						guc_process_desc
+ * + sizeof(struct guc_process_desc)		child go
+ * + CACHELINE_BYTES				child join[0]
+ * ...
+ * + CACHELINE_BYTES				child join[n - 1]
  * ...						unused
  * PAGE_SIZE / 2				work queue start
  * ...						work queue
@@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
  */
 #define WQ_SIZE			(PAGE_SIZE / 2)
 #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
-static u32 __get_process_desc_offset(struct intel_context *ce)
+
+struct parent_page {
+	struct guc_process_desc pdesc;
+
+	u32 child_go_memory;
+	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
+
+	struct {
+		u32 child_join_memory;
+		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
+	} join[MAX_ENGINE_INSTANCE + 1];
+
+	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
+		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
+
+	u32 wq[WQ_SIZE / sizeof(u32)];
+};
+
+static u32 __get_parent_page_offset(struct intel_context *ce)
 {
 	GEM_BUG_ON(!ce->parallel.guc.parent_page);
 
@@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
 
 static u32 __get_wq_offset(struct intel_context *ce)
 {
-	return __get_process_desc_offset(ce) + WQ_OFFSET;
+	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
+
+	return __get_parent_page_offset(ce) + WQ_OFFSET;
 }
 
-static struct guc_process_desc *
-__get_process_desc(struct intel_context *ce)
+static struct parent_page *
+__get_parent_page(struct intel_context *ce)
 {
+	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
+
 	/*
 	 * Need to subtract LRC_STATE_OFFSET here as the
 	 * parallel.guc.parent_page is the offset into ce->state while
 	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
 	 */
-	return (struct guc_process_desc *)
+	return (struct parent_page *)
 		(ce->lrc_reg_state +
-		 ((__get_process_desc_offset(ce) -
+		 ((__get_parent_page_offset(ce) -
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	struct parent_page *pp = __get_parent_page(ce);
+
+	return &pp->pdesc;
+}
+
 static u32 *get_wq_pointer(struct guc_process_desc *desc,
 			   struct intel_context *ce,
 			   u32 wqi_size)
@@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
 	}
 #undef AVAILABLE_SPACE
 
-	return ((u32 *)__get_process_desc(ce)) +
-		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
+	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
 }
 
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
@@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
 	return __guc_action_deregister_context(guc, guc_id);
 }
 
+static inline void clear_children_join_go_memory(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
+	u8 i;
+
+	for (i = 0; i < ce->parallel.number_children + 1; ++i)
+		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
+}
+
+static inline u32 get_children_go_value(struct intel_context *ce)
+{
+	return __get_parent_page(ce)->child_go_memory;
+}
+
+static inline u32 get_children_join_value(struct intel_context *ce,
+					  u8 child_index)
+{
+	return __get_parent_page(ce)->join[child_index].child_join_memory;
+}
+
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		ce->parallel.guc.wqi_head = 0;
 
 		desc->process_desc = i915_ggtt_offset(ce->state) +
-			__get_process_desc_offset(ce);
+			__get_parent_page_offset(ce);
 		desc->wq_addr = i915_ggtt_offset(ce->state) +
 			__get_wq_offset(ce);
 		desc->wq_size = WQ_SIZE;
@@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 			guc_context_policy_init(engine, desc);
 		}
+
+		clear_children_join_go_memory(ce);
 	}
 
 	/*
@@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/*
+ * The below override of the breadcrumbs is enabled when the user configures a
+ * context for parallel submission (multi-lrc, parent-child).
+ *
+ * The overridden breadcrumbs implements an algorithm which allows the GuC to
+ * safely preempt all the hw contexts configured for parallel submission
+ * between each BB. The contract between the i915 and GuC is if the parent
+ * context can be preempted, all the children can be preempted, and the GuC will
+ * always try to preempt the parent before the children. A handshake between the
+ * parent / children breadcrumbs ensures the i915 holds up its end of the deal
+ * creating a window to preempt between each set of BBs.
+ */
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags);
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags);
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs);
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs);
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->engine->emit_bb_start =
+		emit_bb_start_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb =
+		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb_dw =
+		12 + 4 * parent->parallel.number_children;
+	for_each_child(parent, ce) {
+		ce->engine->emit_bb_start =
+			emit_bb_start_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb =
+			emit_fini_breadcrumb_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb_dw = 16;
+	}
+
 	kfree(siblings);
 	return parent;
 
@@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			drm_printf(p, "\t\tWQI Status: %u\n\n",
 				   READ_ONCE(desc->wq_status));
 
+			if (ce->engine->emit_bb_start ==
+			    emit_bb_start_parent_no_preempt_mid_batch) {
+				u8 i;
+
+				drm_printf(p, "\t\tChildren Go: %u\n\n",
+					   get_children_go_value(ce));
+				for (i = 0; i < ce->parallel.number_children; ++i)
+					drm_printf(p, "\t\tChildren Join: %u\n",
+						   get_children_join_value(ce, i));
+			}
+
 			for_each_child(ce, child)
 				guc_log_context(p, child);
 		}
@@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
+static inline u32 get_children_go_addr(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
+		     sizeof(struct guc_process_desc));
+
+	return i915_ggtt_offset(ce->state) +
+		__get_parent_page_offset(ce) +
+		sizeof(struct guc_process_desc);
+}
+
+static inline u32 get_children_join_addr(struct intel_context *ce,
+					 u8 child_index)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
+}
+
+#define PARENT_GO_BB			1
+#define PARENT_GO_FINI_BREADCRUMB	0
+#define CHILD_GO_BB			1
+#define CHILD_GO_FINI_BREADCRUMB	0
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Wait on children */
+	for (i = 0; i < ce->parallel.number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_BB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_BB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	struct intel_context *parent = intel_context_to_parent(ce);
+	u32 *cs;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	cs = intel_ring_begin(rq, 12);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_BB,
+				  get_children_join_addr(parent,
+							 ce->parallel.child_index),
+				  0);
+
+	/* Wait on parent for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_BB;
+	*cs++ = get_children_go_addr(parent);
+	*cs++ = 0;
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/* Wait on children */
+	for (i = 0; i < ce->parallel.number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_FINI_BREADCRUMB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_FINI_BREADCRUMB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	struct intel_context *parent = intel_context_to_parent(ce);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_FINI_BREADCRUMB,
+				  get_children_join_addr(parent,
+							 ce->parallel.child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_FINI_BREADCRUMB;
+	*cs++ = get_children_go_addr(parent);
+	*cs++ = 0;
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
 static struct intel_context *
 guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
 		   unsigned long flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Allow multiple batch buffers to be submitted in a single execbuf IOCTL
after a context has been configured with the 'set_parallel' extension.
The number batches is implicit based on the contexts configuration.

This is implemented with a series of loops. First a loop is used to find
all the batches, a loop to pin all the HW contexts, a loop to create all
the requests, a loop to submit (emit BB start, etc...) all the requests,
a loop to tie the requests to the VMAs they touch, and finally a loop to
commit the requests to the backend.

A composite fence is also created for the generated requests to return
to the user and to stick in dma resv slots.

No behavior from the existing IOCTL should be changed aside from when
throttling because the ring for a context is full, wait on the request
while holding the object locks.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Matthew Brost)
  - Return proper error value if i915_request_create fails
v3:
 (John Harrison)
  - Add comment explaining create / add order loops + locking
  - Update commit message explaining different in IOCTL behavior
  - Line wrap some comments
  - eb_add_request returns void
  - Return -EINVAL rather triggering BUG_ON if cmd parser used
 (Checkpatch)
  - Check eb->batch_len[*current_batch]

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
 drivers/gpu/drm/i915/i915_request.h           |   9 +
 drivers/gpu/drm/i915/i915_vma.c               |  21 +-
 drivers/gpu/drm/i915/i915_vma.h               |  13 +-
 7 files changed, 599 insertions(+), 257 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 2f2434b52317..5c7fb6f68bbb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -244,17 +244,25 @@ struct i915_execbuffer {
 	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
 	struct eb_vma *vma;
 
-	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct intel_gt *gt; /* gt for the execbuf */
 	struct intel_context *context; /* logical state for the request */
 	struct i915_gem_context *gem_context; /** caller's context */
 
-	struct i915_request *request; /** our request to build */
-	struct eb_vma *batch; /** identity of the batch obj/vma */
+	/** our requests to build */
+	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
+	/** identity of the batch obj/vma */
+	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/** used for excl fence in dma_resv objects when > 1 BB submitted */
+	struct dma_fence *composite_fence;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
+	/* number of batches in execbuf IOCTL */
+	unsigned int num_batches;
+
 	/** list of vma not yet bound during reservation phase */
 	struct list_head unbound;
 
@@ -281,7 +289,8 @@ struct i915_execbuffer {
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
 
-	u64 batch_len; /** Length of batch within object */
+	/** Length of batch within object */
+	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
 	u32 batch_start_offset; /** Location within object of batch */
 	u32 batch_flags; /** Flags composed for emit_bb_start() */
 	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
@@ -299,14 +308,13 @@ struct i915_execbuffer {
 };
 
 static int eb_parse(struct i915_execbuffer *eb);
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
-					  bool throttle);
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
 static void eb_unpin_engine(struct i915_execbuffer *eb);
 
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
-	return intel_engine_requires_cmd_parser(eb->engine) ||
-		(intel_engine_using_cmd_parser(eb->engine) &&
+	return intel_engine_requires_cmd_parser(eb->context->engine) ||
+		(intel_engine_using_cmd_parser(eb->context->engine) &&
 		 eb->args->batch_len);
 }
 
@@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
 	return 0;
 }
 
-static void
+static inline bool
+is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
+{
+	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
+		buffer_idx < eb->num_batches :
+		buffer_idx >= eb->args->buffer_count - eb->num_batches;
+}
+
+static int
 eb_add_vma(struct i915_execbuffer *eb,
-	   unsigned int i, unsigned batch_idx,
+	   unsigned int *current_batch,
+	   unsigned int i,
 	   struct i915_vma *vma)
 {
+	struct drm_i915_private *i915 = eb->i915;
 	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 	struct eb_vma *ev = &eb->vma[i];
 
@@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if (i == batch_idx) {
+	if (is_batch_buffer(eb, i)) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
 			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-		eb->batch = ev;
+		eb->batches[*current_batch] = ev;
+
+		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
+			drm_dbg(&i915->drm,
+				"Attempting to use self-modifying batch buffer\n");
+			return -EINVAL;
+		}
+
+		if (range_overflows_t(u64,
+				      eb->batch_start_offset,
+				      eb->args->batch_len,
+				      ev->vma->size)) {
+			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
+			return -EINVAL;
+		}
+
+		if (eb->args->batch_len == 0)
+			eb->batch_len[*current_batch] = ev->vma->size -
+				eb->batch_start_offset;
+		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
+			drm_dbg(&i915->drm, "Invalid batch length\n");
+			return -EINVAL;
+		}
+
+		++*current_batch;
 	}
+
+	return 0;
 }
 
 static inline int use_cpu_reloc(const struct reloc_cache *cache,
@@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	} while (1);
 }
 
-static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
-{
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		return 0;
-	else
-		return eb->buffer_count - 1;
-}
-
 static int eb_select_context(struct i915_execbuffer *eb)
 {
 	struct i915_gem_context *ctx;
@@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
 
 static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_private *i915 = eb->i915;
-	unsigned int batch = eb_batch_index(eb);
-	unsigned int i;
+	unsigned int i, current_batch = 0;
 	int err = 0;
 
 	INIT_LIST_HEAD(&eb->relocs);
@@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, batch, vma);
+		err = eb_add_vma(eb, &current_batch, i, vma);
+		if (err)
+			return err;
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
@@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 	}
 
-	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
-		drm_dbg(&i915->drm,
-			"Attempting to use self-modifying batch buffer\n");
-		return -EINVAL;
-	}
-
-	if (range_overflows_t(u64,
-			      eb->batch_start_offset, eb->batch_len,
-			      eb->batch->vma->size)) {
-		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
-		return -EINVAL;
-	}
-
-	if (eb->batch_len == 0)
-		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
-	if (unlikely(eb->batch_len == 0)) { /* impossible! */
-		drm_dbg(&i915->drm, "Invalid batch length\n");
-		return -EINVAL;
-	}
-
 	return 0;
 
 err:
@@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
-					   struct i915_request *rq)
+static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 {
 	bool have_copy = false;
 	struct eb_vma *ev;
@@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	eb_release_vmas(eb, false);
 	i915_gem_ww_ctx_fini(&eb->ww);
 
-	if (rq) {
-		/* nonblocking is always false */
-		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
-				      MAX_SCHEDULE_TIMEOUT) < 0) {
-			i915_request_put(rq);
-			rq = NULL;
-
-			err = -EINTR;
-			goto err_relock;
-		}
-
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/*
 	 * We take 3 passes through the slowpatch.
 	 *
@@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	if (!err)
 		err = eb_reinit_userptr(eb);
 
-err_relock:
 	i915_gem_ww_ctx_init(&eb->ww, true);
 	if (err)
 		goto out;
 
 	/* reacquire the objects */
 repeat_validate:
-	rq = eb_pin_engine(eb, false);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, false);
+	if (err)
 		goto err;
-	}
-
-	/* We didn't throttle, should be NULL */
-	GEM_WARN_ON(rq);
 
 	err = eb_validate_vmas(eb);
 	if (err)
 		goto err;
 
-	GEM_BUG_ON(!eb->batch);
+	GEM_BUG_ON(!eb->batches[0]);
 
 	list_for_each_entry(ev, &eb->relocs, reloc_link) {
 		if (!have_copy) {
@@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 		}
 	}
 
-	if (rq)
-		i915_request_put(rq);
-
 	return err;
 }
 
 static int eb_relocate_parse(struct i915_execbuffer *eb)
 {
 	int err;
-	struct i915_request *rq = NULL;
 	bool throttle = true;
 
 retry:
-	rq = eb_pin_engine(eb, throttle);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, throttle);
+	if (err) {
 		if (err != -EDEADLK)
 			return err;
 
 		goto err;
 	}
 
-	if (rq) {
-		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
-
-		/* Need to drop all locks now for throttling, take slowpath */
-		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
-		if (err == -ETIME) {
-			if (nonblock) {
-				err = -EWOULDBLOCK;
-				i915_request_put(rq);
-				goto err;
-			}
-			goto slow;
-		}
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/* only throttle once, even if we didn't need to throttle */
 	throttle = false;
 
@@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 
 slow:
-	err = eb_relocate_parse_slow(eb, rq);
+	err = eb_relocate_parse_slow(eb);
 	if (err)
 		/*
 		 * If the user expects the execobject.offset and
@@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
+/*
+ * Using two helper loops for the order of which requests / batches are created
+ * and added the to backend. Requests are created in order from the parent to
+ * the last child. Requests are add in the reverse order, from the last child to
+ * parent. This is down from locking reasons as the timeline lock is acquired
+ * during request creation and released when the request is added to the
+ * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
+ * the ordering.
+ */
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < (_eb)->num_batches; ++_i)
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
+
+static struct i915_request *
+eb_find_first_request_added(struct i915_execbuffer *eb)
+{
+	int i;
+
+	for_each_batch_add_order(eb, i)
+		if (eb->requests[i])
+			return eb->requests[i];
+
+	GEM_BUG_ON("Request not found");
+
+	return NULL;
+}
+
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
-	int err = 0;
+	int err = 0, j;
 
 	while (i--) {
 		struct eb_vma *ev = &eb->vma[i];
@@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_capture_list *capture;
 
-			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
-			if (capture) {
-				capture->next = eb->request->capture_list;
-				capture->vma = vma;
-				eb->request->capture_list = capture;
+			for_each_batch_create_order(eb, j) {
+				if (!eb->requests[j])
+					break;
+
+				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
+				if (capture) {
+					capture->next =
+						eb->requests[j]->capture_list;
+					capture->vma = vma;
+					eb->requests[j]->capture_list = capture;
+				}
 			}
 		}
 
@@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
+		/* We only need to await on the first request */
 		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
-				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
+				(eb_find_first_request_added(eb), obj,
+				 flags & EXEC_OBJECT_WRITE);
 		}
 
-		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request,
-						      flags | __EXEC_OBJECT_NO_RESERVE);
+		for_each_batch_add_order(eb, j) {
+			if (err)
+				break;
+			if (!eb->requests[j])
+				continue;
+
+			err = _i915_vma_move_to_active(vma, eb->requests[j],
+						       j ? NULL :
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->requests[j]->fence,
+						       flags | __EXEC_OBJECT_NO_RESERVE);
+		}
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		goto err_skip;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	intel_gt_chipset_flush(eb->engine->gt);
+	intel_gt_chipset_flush(eb->gt);
 	return 0;
 
 err_skip:
-	i915_request_set_error_once(eb->request, err);
+	for_each_batch_create_order(eb, j) {
+		if (!eb->requests[j])
+			break;
+
+		i915_request_set_error_once(eb->requests[j], err);
+	}
 	return err;
 }
 
@@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
 	int err;
 
 	if (!eb_use_cmdparser(eb)) {
-		batch = eb_dispatch_secure(eb, eb->batch->vma);
+		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
 		if (IS_ERR(batch))
 			return PTR_ERR(batch);
 
 		goto secure_batch;
 	}
 
-	len = eb->batch_len;
+	if (intel_context_is_parallel(eb->context))
+		return -EINVAL;
+
+	len = eb->batch_len[0];
 	if (!CMDPARSER_USES_GGTT(eb->i915)) {
 		/*
 		 * ppGTT backed shadow buffers must be mapped RO, to prevent
@@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
 	} else {
 		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
 	}
-	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
+	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
 		return -EINVAL;
 
 	if (!pool) {
-		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
+		pool = intel_gt_get_buffer_pool(eb->gt, len,
 						I915_MAP_WB);
 		if (IS_ERR(pool))
 			return PTR_ERR(pool);
@@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
 		trampoline = shadow;
 
 		shadow = shadow_batch_pin(eb, pool->obj,
-					  &eb->engine->gt->ggtt->vm,
+					  &eb->gt->ggtt->vm,
 					  PIN_GLOBAL);
 		if (IS_ERR(shadow)) {
 			err = PTR_ERR(shadow);
@@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
 	if (err)
 		goto err_trampoline;
 
-	err = intel_engine_cmd_parser(eb->engine,
-				      eb->batch->vma,
+	err = intel_engine_cmd_parser(eb->context->engine,
+				      eb->batches[0]->vma,
 				      eb->batch_start_offset,
-				      eb->batch_len,
+				      eb->batch_len[0],
 				      shadow, trampoline);
 	if (err)
 		goto err_unpin_batch;
 
-	eb->batch = &eb->vma[eb->buffer_count++];
-	eb->batch->vma = i915_vma_get(shadow);
-	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
+	eb->batches[0] = &eb->vma[eb->buffer_count++];
+	eb->batches[0]->vma = i915_vma_get(shadow);
+	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
 
 	eb->trampoline = trampoline;
 	eb->batch_start_offset = 0;
 
 secure_batch:
 	if (batch) {
-		eb->batch = &eb->vma[eb->buffer_count++];
-		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
-		eb->batch->vma = i915_vma_get(batch);
+		if (intel_context_is_parallel(eb->context))
+			return -EINVAL;
+
+		eb->batches[0] = &eb->vma[eb->buffer_count++];
+		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
+		eb->batches[0]->vma = i915_vma_get(batch);
 	}
 	return 0;
 
@@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_request_submit(struct i915_execbuffer *eb,
+			     struct i915_request *rq,
+			     struct i915_vma *batch,
+			     u64 batch_len)
 {
 	int err;
 
-	if (intel_context_nopreempt(eb->context))
-		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
-
-	err = eb_move_to_gpu(eb);
-	if (err)
-		return err;
+	if (intel_context_nopreempt(rq->context))
+		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		err = i915_reset_gen7_sol_offsets(eb->request);
+		err = i915_reset_gen7_sol_offsets(rq);
 		if (err)
 			return err;
 	}
@@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	 * allows us to determine if the batch is still waiting on the GPU
 	 * or actually running by checking the breadcrumb.
 	 */
-	if (eb->engine->emit_init_breadcrumb) {
-		err = eb->engine->emit_init_breadcrumb(eb->request);
+	if (rq->context->engine->emit_init_breadcrumb) {
+		err = rq->context->engine->emit_init_breadcrumb(rq);
 		if (err)
 			return err;
 	}
 
-	err = eb->engine->emit_bb_start(eb->request,
-					batch->node.start +
-					eb->batch_start_offset,
-					eb->batch_len,
-					eb->batch_flags);
+	err = rq->context->engine->emit_bb_start(rq,
+						 batch->node.start +
+						 eb->batch_start_offset,
+						 batch_len,
+						 eb->batch_flags);
 	if (err)
 		return err;
 
 	if (eb->trampoline) {
+		GEM_BUG_ON(intel_context_is_parallel(rq->context));
 		GEM_BUG_ON(eb->batch_start_offset);
-		err = eb->engine->emit_bb_start(eb->request,
-						eb->trampoline->node.start +
-						eb->batch_len,
-						0, 0);
+		err = rq->context->engine->emit_bb_start(rq,
+							 eb->trampoline->node.start +
+							 batch_len, 0, 0);
 		if (err)
 			return err;
 	}
@@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	return 0;
 }
 
+static int eb_submit(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+	int err;
+
+	err = eb_move_to_gpu(eb);
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
+		if (!err)
+			err = eb_request_submit(eb, eb->requests[i],
+						eb->batches[i]->vma,
+						eb->batch_len[i]);
+	}
+
+	return err;
+}
+
 static int num_vcs_engines(const struct drm_i915_private *i915)
 {
 	return hweight_long(VDBOX_MASK(&i915->gt));
@@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
 	return i915_request_get(rq);
 }
 
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
+			   bool throttle)
 {
-	struct intel_context *ce = eb->context;
 	struct intel_timeline *tl;
-	struct i915_request *rq = NULL;
-	int err;
-
-	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
-
-	if (unlikely(intel_context_is_banned(ce)))
-		return ERR_PTR(-EIO);
-
-	/*
-	 * Pinning the contexts may generate requests in order to acquire
-	 * GGTT space, so do this first before we reserve a seqno for
-	 * ourselves.
-	 */
-	err = intel_context_pin_ww(ce, &eb->ww);
-	if (err)
-		return ERR_PTR(err);
+	struct i915_request *rq;
 
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
@@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
 	 * taken on the engine, and the parent device.
 	 */
 	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl)) {
-		intel_context_unpin(ce);
-		return ERR_CAST(tl);
-	}
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
 
 	intel_context_enter(ce);
 	if (throttle)
 		rq = eb_throttle(eb, ce);
 	intel_context_timeline_unlock(tl);
 
+	if (rq) {
+		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
+		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
+
+		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
+				      timeout) < 0) {
+			i915_request_put(rq);
+
+			tl = intel_context_timeline_lock(ce);
+			intel_context_exit(ce);
+			intel_context_timeline_unlock(tl);
+
+			if (nonblock)
+				return -EWOULDBLOCK;
+			else
+				return -EINTR;
+		}
+		i915_request_put(rq);
+	}
+
+	return 0;
+}
+
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+{
+	struct intel_context *ce = eb->context, *child;
+	int err;
+	int i = 0, j = 0;
+
+	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
+
+	if (unlikely(intel_context_is_banned(ce)))
+		return -EIO;
+
+	/*
+	 * Pinning the contexts may generate requests in order to acquire
+	 * GGTT space, so do this first before we reserve a seqno for
+	 * ourselves.
+	 */
+	err = intel_context_pin_ww(ce, &eb->ww);
+	if (err)
+		return err;
+	for_each_child(ce, child) {
+		err = intel_context_pin_ww(child, &eb->ww);
+		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
+	}
+
+	for_each_child(ce, child) {
+		err = eb_pin_timeline(eb, child, throttle);
+		if (err)
+			goto unwind;
+		++i;
+	}
+	err = eb_pin_timeline(eb, ce, throttle);
+	if (err)
+		goto unwind;
+
 	eb->args->flags |= __EXEC_ENGINE_PINNED;
-	return rq;
+	return 0;
+
+unwind:
+	for_each_child(ce, child) {
+		if (j++ < i) {
+			mutex_lock(&child->timeline->mutex);
+			intel_context_exit(child);
+			mutex_unlock(&child->timeline->mutex);
+		}
+	}
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+	return err;
 }
 
 static void eb_unpin_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce = eb->context;
-	struct intel_timeline *tl = ce->timeline;
+	struct intel_context *ce = eb->context, *child;
 
 	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
 		return;
 
 	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
 
-	mutex_lock(&tl->mutex);
+	for_each_child(ce, child) {
+		mutex_lock(&child->timeline->mutex);
+		intel_context_exit(child);
+		mutex_unlock(&child->timeline->mutex);
+
+		intel_context_unpin(child);
+	}
+
+	mutex_lock(&ce->timeline->mutex);
 	intel_context_exit(ce);
-	mutex_unlock(&tl->mutex);
+	mutex_unlock(&ce->timeline->mutex);
 
 	intel_context_unpin(ce);
 }
@@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
 static int
 eb_select_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce;
+	struct intel_context *ce, *child;
 	unsigned int idx;
 	int err;
 
@@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
+	if (intel_context_is_parallel(ce)) {
+		if (eb->buffer_count < ce->parallel.number_children + 1) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+		if (eb->batch_start_offset || eb->args->batch_len) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+	}
+	eb->num_batches = ce->parallel.number_children + 1;
+
+	for_each_child(ce, child)
+		intel_context_get(child);
 	intel_gt_pm_get(ce->engine->gt);
 
 	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
@@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
 		if (err)
 			goto err;
 	}
+	for_each_child(ce, child) {
+		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
+			err = intel_context_alloc_state(child);
+			if (err)
+				goto err;
+		}
+	}
 
 	/*
 	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
@@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
 		goto err;
 
 	eb->context = ce;
-	eb->engine = ce->engine;
+	eb->gt = ce->engine->gt;
 
 	/*
 	 * Make sure engine pool stays alive even if we call intel_context_put
@@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
 
 err:
 	intel_gt_pm_put(ce->engine->gt);
+	for_each_child(ce, child)
+		intel_context_put(child);
 	intel_context_put(ce);
 	return err;
 }
@@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
 static void
 eb_put_engine(struct i915_execbuffer *eb)
 {
-	intel_gt_pm_put(eb->engine->gt);
+	struct intel_context *child;
+
+	intel_gt_pm_put(eb->gt);
+	for_each_child(eb->context, child)
+		intel_context_put(child);
 	intel_context_put(eb->context);
 }
 
@@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
 }
 
 static int
-await_fence_array(struct i915_execbuffer *eb)
+await_fence_array(struct i915_execbuffer *eb,
+		  struct i915_request *rq)
 {
 	unsigned int n;
 	int err;
@@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
 		if (!eb->fences[n].dma_fence)
 			continue;
 
-		err = i915_request_await_dma_fence(eb->request,
-						   eb->fences[n].dma_fence);
+		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
 		if (err < 0)
 			return err;
 	}
@@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static void signal_fence_array(const struct i915_execbuffer *eb)
+static void signal_fence_array(const struct i915_execbuffer *eb,
+			       struct dma_fence * const fence)
 {
-	struct dma_fence * const fence = &eb->request->fence;
 	unsigned int n;
 
 	for (n = 0; n < eb->num_fences; n++) {
@@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 			break;
 }
 
-static int eb_request_add(struct i915_execbuffer *eb, int err)
+static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
 {
-	struct i915_request *rq = eb->request;
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
 	struct i915_request *prev;
@@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
 		attr = eb->gem_context->sched;
-	} else {
-		/* Serialise with context_close via the add_to_timeline */
-		i915_request_set_error_once(rq, -ENOENT);
-		__i915_request_skip(rq);
-		err = -ENOENT; /* override any transient errors */
 	}
 
 	__i915_request_queue(rq, &attr);
@@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 		retire_requests(tl, prev);
 
 	mutex_unlock(&tl->mutex);
+}
+
+static int eb_requests_add(struct i915_execbuffer *eb, int err)
+{
+	int i;
+
+	/*
+	 * We iterate in reverse order of creation to release timeline mutexes in
+	 * same order.
+	 */
+	for_each_batch_add_order(eb, i) {
+		struct i915_request *rq = eb->requests[i];
+
+		if (!rq)
+			continue;
+
+		if (unlikely(intel_context_is_closed(eb->context))) {
+			/* Serialise with context_close via the add_to_timeline */
+			i915_request_set_error_once(rq, -ENOENT);
+			__i915_request_skip(rq);
+			err = -ENOENT; /* override any transient errors */
+		}
+
+		if (intel_context_is_parallel(eb->context)) {
+			if (err) {
+				__i915_request_skip(rq);
+				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
+					&rq->fence.flags);
+			}
+			if (i == 0)
+				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+					&rq->fence.flags);
+		}
+
+		eb_request_add(eb, rq);
+	}
 
 	return err;
 }
@@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
 				    eb);
 }
 
+static void eb_requests_get(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_get(eb->requests[i]);
+	}
+}
+
+static void eb_requests_put(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_put(eb->requests[i]);
+	}
+}
+
+static struct sync_file *
+eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	struct dma_fence_array *fence_array;
+	struct dma_fence **fences;
+	unsigned int i;
+
+	GEM_BUG_ON(!intel_context_is_parent(eb->context));
+
+	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
+	if (!fences)
+		return ERR_PTR(-ENOMEM);
+
+	for_each_batch_create_order(eb, i)
+		fences[i] = &eb->requests[i]->fence;
+
+	fence_array = dma_fence_array_create(eb->num_batches,
+					     fences,
+					     eb->context->parallel.fence_context,
+					     eb->context->parallel.seqno,
+					     false);
+	if (!fence_array) {
+		kfree(fences);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Move ownership to the dma_fence_array created above */
+	for_each_batch_create_order(eb, i)
+		dma_fence_get(fences[i]);
+
+	if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&fence_array->base);
+		/* sync_file now owns fence_arry, drop creation ref */
+		dma_fence_put(&fence_array->base);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	eb->composite_fence = &fence_array->base;
+
+	return out_fence;
+}
+
+static struct sync_file *
+eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
+	      struct dma_fence *in_fence, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	int err;
+
+	if (unlikely(eb->gem_context->syncobj)) {
+		struct dma_fence *fence;
+
+		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
+		err = i915_request_await_dma_fence(rq, fence);
+		dma_fence_put(fence);
+		if (err)
+			return ERR_PTR(err);
+	}
+
+	if (in_fence) {
+		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
+			err = i915_request_await_execution(rq, in_fence);
+		else
+			err = i915_request_await_dma_fence(rq, in_fence);
+		if (err < 0)
+			return ERR_PTR(err);
+	}
+
+	if (eb->fences) {
+		err = await_fence_array(eb, rq);
+		if (err)
+			return ERR_PTR(err);
+	}
+
+	if (intel_context_is_parallel(eb->context)) {
+		out_fence = eb_composite_fence_create(eb, out_fence_fd);
+		if (IS_ERR(out_fence))
+			return ERR_PTR(-ENOMEM);
+	} else if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&rq->fence);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	return out_fence;
+}
+
+static struct intel_context *
+eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
+{
+	struct intel_context *child;
+
+	if (likely(context_number == 0))
+		return eb->context;
+
+	for_each_child(eb->context, child)
+		if (!--context_number)
+			return child;
+
+	GEM_BUG_ON("Context not found");
+
+	return NULL;
+}
+
+static struct sync_file *
+eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
+		   int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		/* Allocate a request for this batch buffer nice and early. */
+		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
+		if (IS_ERR(eb->requests[i])) {
+			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
+			eb->requests[i] = NULL;
+			return out_fence;
+		}
+
+		/*
+		 * Only the first request added (committed to backend) has to
+		 * take the in fences into account as all subsequent requests
+		 * will have fences inserted inbetween them.
+		 */
+		if (i + 1 == eb->num_batches) {
+			out_fence = eb_fences_add(eb, eb->requests[i],
+						  in_fence, out_fence_fd);
+			if (IS_ERR(out_fence))
+				return out_fence;
+		}
+
+		/*
+		 * Whilst this request exists, batch_obj will be on the
+		 * active_list, and so will hold the active reference. Only when
+		 * this request is retired will the batch_obj be moved onto
+		 * the inactive_list and lose its active reference. Hence we do
+		 * not need to explicitly hold another reference here.
+		 */
+		eb->requests[i]->batch = eb->batches[i]->vma;
+		if (eb->batch_pool) {
+			GEM_BUG_ON(intel_context_is_parallel(eb->context));
+			intel_gt_buffer_pool_mark_active(eb->batch_pool,
+							 eb->requests[i]);
+		}
+	}
+
+	return out_fence;
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
@@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
-	struct i915_vma *batch;
 	int out_fence_fd = -1;
 	int err;
 
@@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
-	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
 
+	memset(eb.requests, 0, sizeof(struct i915_request *) *
+	       ARRAY_SIZE(eb.requests));
+	eb.composite_fence = NULL;
+
 	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (GRAPHICS_VER(i915) >= 11)
@@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	ww_acquire_done(&eb.ww.ctx);
 
-	batch = eb.batch->vma;
-
-	/* Allocate a request for this batch buffer nice and early. */
-	eb.request = i915_request_create(eb.context);
-	if (IS_ERR(eb.request)) {
-		err = PTR_ERR(eb.request);
-		goto err_vma;
-	}
-
-	if (unlikely(eb.gem_context->syncobj)) {
-		struct dma_fence *fence;
-
-		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
-		err = i915_request_await_dma_fence(eb.request, fence);
-		dma_fence_put(fence);
-		if (err)
-			goto err_ext;
-	}
-
-	if (in_fence) {
-		if (args->flags & I915_EXEC_FENCE_SUBMIT)
-			err = i915_request_await_execution(eb.request,
-							   in_fence);
-		else
-			err = i915_request_await_dma_fence(eb.request,
-							   in_fence);
-		if (err < 0)
-			goto err_request;
-	}
-
-	if (eb.fences) {
-		err = await_fence_array(&eb);
-		if (err)
+	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
+	if (IS_ERR(out_fence)) {
+		err = PTR_ERR(out_fence);
+		if (eb.requests[0])
 			goto err_request;
+		else
+			goto err_vma;
 	}
 
-	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&eb.request->fence);
-		if (!out_fence) {
-			err = -ENOMEM;
-			goto err_request;
-		}
-	}
-
-	/*
-	 * Whilst this request exists, batch_obj will be on the
-	 * active_list, and so will hold the active reference. Only when this
-	 * request is retired will the the batch_obj be moved onto the
-	 * inactive_list and lose its active reference. Hence we do not need
-	 * to explicitly hold another reference here.
-	 */
-	eb.request->batch = batch;
-	if (eb.batch_pool)
-		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
-
-	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb);
 
 err_request:
-	i915_request_get(eb.request);
-	err = eb_request_add(&eb, err);
+	eb_requests_get(&eb);
+	err = eb_requests_add(&eb, err);
 
 	if (eb.fences)
-		signal_fence_array(&eb);
+		signal_fence_array(&eb, eb.composite_fence ?
+				   eb.composite_fence :
+				   &eb.requests[0]->fence);
 
 	if (out_fence) {
 		if (err == 0) {
@@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
-					  &eb.request->fence);
+					  eb.composite_fence ?
+					  eb.composite_fence :
+					  &eb.requests[0]->fence);
 	}
 
-	i915_request_put(eb.request);
+	if (!out_fence && eb.composite_fence)
+		dma_fence_put(eb.composite_fence);
+
+	eb_requests_put(&eb);
 
 err_vma:
 	eb_release_vmas(&eb, true);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 1bc705f98e2a..1781419fa105 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
 	struct intel_timeline *tl = ce->timeline;
 	int err;
 
-	err = mutex_lock_interruptible(&tl->mutex);
+	if (intel_context_is_parent(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
+	else if (intel_context_is_child(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex,
+						      ce->parallel.child_index + 1);
+	else
+		err = mutex_lock_interruptible(&tl->mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 95a5b94b4ece..9e0177dc5484 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -248,6 +248,16 @@ struct intel_context {
 		 * context
 		 */
 		struct i915_request *last_rq;
+		/**
+		 * @fence_context: fence context composite fence when doing
+		 * parallel submission
+		 */
+		u64 fence_context;
+		/**
+		 * @seqno: seqno for composite fence when doing parallel
+		 * submission
+		 */
+		u32 seqno;
 		/** @number_children: number of children if parent */
 		u8 number_children;
 		/** @child_index: index into child_list if child */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f28e36aa77c2..83b0d2a114af 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->parallel.fence_context = dma_fence_context_alloc(1);
+
 	parent->engine->emit_bb_start =
 		emit_bb_start_parent_no_preempt_mid_batch;
 	parent->engine->emit_fini_breadcrumb =
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 8950785e55d6..24db8459376b 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -147,6 +147,15 @@ enum {
 	 * tail.
 	 */
 	I915_FENCE_FLAG_SUBMIT_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) that
+	 * hit an error while generating requests in the execbuf IOCTL.
+	 * Indicates this request should be skipped as another request in
+	 * submission / relationship encoutered an error.
+	 */
+	I915_FENCE_FLAG_SKIP_PARALLEL,
 };
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 4b7fc4647e46..90546fa58fc1 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	return i915_active_add_request(&vma->active, rq);
 }
 
-int i915_vma_move_to_active(struct i915_vma *vma,
-			    struct i915_request *rq,
-			    unsigned int flags)
+int _i915_vma_move_to_active(struct i915_vma *vma,
+			     struct i915_request *rq,
+			     struct dma_fence *fence,
+			     unsigned int flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
@@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 			intel_frontbuffer_put(front);
 		}
 
-		dma_resv_add_excl_fence(vma->resv, &rq->fence);
-		obj->write_domain = I915_GEM_DOMAIN_RENDER;
-		obj->read_domains = 0;
+		if (fence) {
+			dma_resv_add_excl_fence(vma->resv, fence);
+			obj->write_domain = I915_GEM_DOMAIN_RENDER;
+			obj->read_domains = 0;
+		}
 	} else {
 		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
 			err = dma_resv_reserve_shared(vma->resv, 1);
@@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 				return err;
 		}
 
-		dma_resv_add_shared_fence(vma->resv, &rq->fence);
-		obj->write_domain = 0;
+		if (fence) {
+			dma_resv_add_shared_fence(vma->resv, fence);
+			obj->write_domain = 0;
+		}
 	}
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index ed69f66c7ab0..648dbe744c96 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
 
 int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
 					   struct i915_request *rq);
-int __must_check i915_vma_move_to_active(struct i915_vma *vma,
-					 struct i915_request *rq,
-					 unsigned int flags);
+int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
+					  struct i915_request *rq,
+					  struct dma_fence *fence,
+					  unsigned int flags);
+static inline int __must_check
+i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
+			unsigned int flags)
+{
+	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
+}
 
 #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Allow multiple batch buffers to be submitted in a single execbuf IOCTL
after a context has been configured with the 'set_parallel' extension.
The number batches is implicit based on the contexts configuration.

This is implemented with a series of loops. First a loop is used to find
all the batches, a loop to pin all the HW contexts, a loop to create all
the requests, a loop to submit (emit BB start, etc...) all the requests,
a loop to tie the requests to the VMAs they touch, and finally a loop to
commit the requests to the backend.

A composite fence is also created for the generated requests to return
to the user and to stick in dma resv slots.

No behavior from the existing IOCTL should be changed aside from when
throttling because the ring for a context is full, wait on the request
while holding the object locks.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Matthew Brost)
  - Return proper error value if i915_request_create fails
v3:
 (John Harrison)
  - Add comment explaining create / add order loops + locking
  - Update commit message explaining different in IOCTL behavior
  - Line wrap some comments
  - eb_add_request returns void
  - Return -EINVAL rather triggering BUG_ON if cmd parser used
 (Checkpatch)
  - Check eb->batch_len[*current_batch]

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
 drivers/gpu/drm/i915/i915_request.h           |   9 +
 drivers/gpu/drm/i915/i915_vma.c               |  21 +-
 drivers/gpu/drm/i915/i915_vma.h               |  13 +-
 7 files changed, 599 insertions(+), 257 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 2f2434b52317..5c7fb6f68bbb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -244,17 +244,25 @@ struct i915_execbuffer {
 	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
 	struct eb_vma *vma;
 
-	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct intel_gt *gt; /* gt for the execbuf */
 	struct intel_context *context; /* logical state for the request */
 	struct i915_gem_context *gem_context; /** caller's context */
 
-	struct i915_request *request; /** our request to build */
-	struct eb_vma *batch; /** identity of the batch obj/vma */
+	/** our requests to build */
+	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
+	/** identity of the batch obj/vma */
+	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/** used for excl fence in dma_resv objects when > 1 BB submitted */
+	struct dma_fence *composite_fence;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
+	/* number of batches in execbuf IOCTL */
+	unsigned int num_batches;
+
 	/** list of vma not yet bound during reservation phase */
 	struct list_head unbound;
 
@@ -281,7 +289,8 @@ struct i915_execbuffer {
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
 
-	u64 batch_len; /** Length of batch within object */
+	/** Length of batch within object */
+	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
 	u32 batch_start_offset; /** Location within object of batch */
 	u32 batch_flags; /** Flags composed for emit_bb_start() */
 	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
@@ -299,14 +308,13 @@ struct i915_execbuffer {
 };
 
 static int eb_parse(struct i915_execbuffer *eb);
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
-					  bool throttle);
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
 static void eb_unpin_engine(struct i915_execbuffer *eb);
 
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
-	return intel_engine_requires_cmd_parser(eb->engine) ||
-		(intel_engine_using_cmd_parser(eb->engine) &&
+	return intel_engine_requires_cmd_parser(eb->context->engine) ||
+		(intel_engine_using_cmd_parser(eb->context->engine) &&
 		 eb->args->batch_len);
 }
 
@@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
 	return 0;
 }
 
-static void
+static inline bool
+is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
+{
+	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
+		buffer_idx < eb->num_batches :
+		buffer_idx >= eb->args->buffer_count - eb->num_batches;
+}
+
+static int
 eb_add_vma(struct i915_execbuffer *eb,
-	   unsigned int i, unsigned batch_idx,
+	   unsigned int *current_batch,
+	   unsigned int i,
 	   struct i915_vma *vma)
 {
+	struct drm_i915_private *i915 = eb->i915;
 	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 	struct eb_vma *ev = &eb->vma[i];
 
@@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if (i == batch_idx) {
+	if (is_batch_buffer(eb, i)) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
 			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-		eb->batch = ev;
+		eb->batches[*current_batch] = ev;
+
+		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
+			drm_dbg(&i915->drm,
+				"Attempting to use self-modifying batch buffer\n");
+			return -EINVAL;
+		}
+
+		if (range_overflows_t(u64,
+				      eb->batch_start_offset,
+				      eb->args->batch_len,
+				      ev->vma->size)) {
+			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
+			return -EINVAL;
+		}
+
+		if (eb->args->batch_len == 0)
+			eb->batch_len[*current_batch] = ev->vma->size -
+				eb->batch_start_offset;
+		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
+			drm_dbg(&i915->drm, "Invalid batch length\n");
+			return -EINVAL;
+		}
+
+		++*current_batch;
 	}
+
+	return 0;
 }
 
 static inline int use_cpu_reloc(const struct reloc_cache *cache,
@@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	} while (1);
 }
 
-static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
-{
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		return 0;
-	else
-		return eb->buffer_count - 1;
-}
-
 static int eb_select_context(struct i915_execbuffer *eb)
 {
 	struct i915_gem_context *ctx;
@@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
 
 static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_private *i915 = eb->i915;
-	unsigned int batch = eb_batch_index(eb);
-	unsigned int i;
+	unsigned int i, current_batch = 0;
 	int err = 0;
 
 	INIT_LIST_HEAD(&eb->relocs);
@@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, batch, vma);
+		err = eb_add_vma(eb, &current_batch, i, vma);
+		if (err)
+			return err;
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
@@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 	}
 
-	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
-		drm_dbg(&i915->drm,
-			"Attempting to use self-modifying batch buffer\n");
-		return -EINVAL;
-	}
-
-	if (range_overflows_t(u64,
-			      eb->batch_start_offset, eb->batch_len,
-			      eb->batch->vma->size)) {
-		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
-		return -EINVAL;
-	}
-
-	if (eb->batch_len == 0)
-		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
-	if (unlikely(eb->batch_len == 0)) { /* impossible! */
-		drm_dbg(&i915->drm, "Invalid batch length\n");
-		return -EINVAL;
-	}
-
 	return 0;
 
 err:
@@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
-					   struct i915_request *rq)
+static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 {
 	bool have_copy = false;
 	struct eb_vma *ev;
@@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	eb_release_vmas(eb, false);
 	i915_gem_ww_ctx_fini(&eb->ww);
 
-	if (rq) {
-		/* nonblocking is always false */
-		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
-				      MAX_SCHEDULE_TIMEOUT) < 0) {
-			i915_request_put(rq);
-			rq = NULL;
-
-			err = -EINTR;
-			goto err_relock;
-		}
-
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/*
 	 * We take 3 passes through the slowpatch.
 	 *
@@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	if (!err)
 		err = eb_reinit_userptr(eb);
 
-err_relock:
 	i915_gem_ww_ctx_init(&eb->ww, true);
 	if (err)
 		goto out;
 
 	/* reacquire the objects */
 repeat_validate:
-	rq = eb_pin_engine(eb, false);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, false);
+	if (err)
 		goto err;
-	}
-
-	/* We didn't throttle, should be NULL */
-	GEM_WARN_ON(rq);
 
 	err = eb_validate_vmas(eb);
 	if (err)
 		goto err;
 
-	GEM_BUG_ON(!eb->batch);
+	GEM_BUG_ON(!eb->batches[0]);
 
 	list_for_each_entry(ev, &eb->relocs, reloc_link) {
 		if (!have_copy) {
@@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 		}
 	}
 
-	if (rq)
-		i915_request_put(rq);
-
 	return err;
 }
 
 static int eb_relocate_parse(struct i915_execbuffer *eb)
 {
 	int err;
-	struct i915_request *rq = NULL;
 	bool throttle = true;
 
 retry:
-	rq = eb_pin_engine(eb, throttle);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, throttle);
+	if (err) {
 		if (err != -EDEADLK)
 			return err;
 
 		goto err;
 	}
 
-	if (rq) {
-		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
-
-		/* Need to drop all locks now for throttling, take slowpath */
-		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
-		if (err == -ETIME) {
-			if (nonblock) {
-				err = -EWOULDBLOCK;
-				i915_request_put(rq);
-				goto err;
-			}
-			goto slow;
-		}
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/* only throttle once, even if we didn't need to throttle */
 	throttle = false;
 
@@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 
 slow:
-	err = eb_relocate_parse_slow(eb, rq);
+	err = eb_relocate_parse_slow(eb);
 	if (err)
 		/*
 		 * If the user expects the execobject.offset and
@@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
+/*
+ * Using two helper loops for the order of which requests / batches are created
+ * and added the to backend. Requests are created in order from the parent to
+ * the last child. Requests are add in the reverse order, from the last child to
+ * parent. This is down from locking reasons as the timeline lock is acquired
+ * during request creation and released when the request is added to the
+ * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
+ * the ordering.
+ */
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < (_eb)->num_batches; ++_i)
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
+
+static struct i915_request *
+eb_find_first_request_added(struct i915_execbuffer *eb)
+{
+	int i;
+
+	for_each_batch_add_order(eb, i)
+		if (eb->requests[i])
+			return eb->requests[i];
+
+	GEM_BUG_ON("Request not found");
+
+	return NULL;
+}
+
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
-	int err = 0;
+	int err = 0, j;
 
 	while (i--) {
 		struct eb_vma *ev = &eb->vma[i];
@@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_capture_list *capture;
 
-			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
-			if (capture) {
-				capture->next = eb->request->capture_list;
-				capture->vma = vma;
-				eb->request->capture_list = capture;
+			for_each_batch_create_order(eb, j) {
+				if (!eb->requests[j])
+					break;
+
+				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
+				if (capture) {
+					capture->next =
+						eb->requests[j]->capture_list;
+					capture->vma = vma;
+					eb->requests[j]->capture_list = capture;
+				}
 			}
 		}
 
@@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
+		/* We only need to await on the first request */
 		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
-				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
+				(eb_find_first_request_added(eb), obj,
+				 flags & EXEC_OBJECT_WRITE);
 		}
 
-		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request,
-						      flags | __EXEC_OBJECT_NO_RESERVE);
+		for_each_batch_add_order(eb, j) {
+			if (err)
+				break;
+			if (!eb->requests[j])
+				continue;
+
+			err = _i915_vma_move_to_active(vma, eb->requests[j],
+						       j ? NULL :
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->requests[j]->fence,
+						       flags | __EXEC_OBJECT_NO_RESERVE);
+		}
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		goto err_skip;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	intel_gt_chipset_flush(eb->engine->gt);
+	intel_gt_chipset_flush(eb->gt);
 	return 0;
 
 err_skip:
-	i915_request_set_error_once(eb->request, err);
+	for_each_batch_create_order(eb, j) {
+		if (!eb->requests[j])
+			break;
+
+		i915_request_set_error_once(eb->requests[j], err);
+	}
 	return err;
 }
 
@@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
 	int err;
 
 	if (!eb_use_cmdparser(eb)) {
-		batch = eb_dispatch_secure(eb, eb->batch->vma);
+		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
 		if (IS_ERR(batch))
 			return PTR_ERR(batch);
 
 		goto secure_batch;
 	}
 
-	len = eb->batch_len;
+	if (intel_context_is_parallel(eb->context))
+		return -EINVAL;
+
+	len = eb->batch_len[0];
 	if (!CMDPARSER_USES_GGTT(eb->i915)) {
 		/*
 		 * ppGTT backed shadow buffers must be mapped RO, to prevent
@@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
 	} else {
 		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
 	}
-	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
+	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
 		return -EINVAL;
 
 	if (!pool) {
-		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
+		pool = intel_gt_get_buffer_pool(eb->gt, len,
 						I915_MAP_WB);
 		if (IS_ERR(pool))
 			return PTR_ERR(pool);
@@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
 		trampoline = shadow;
 
 		shadow = shadow_batch_pin(eb, pool->obj,
-					  &eb->engine->gt->ggtt->vm,
+					  &eb->gt->ggtt->vm,
 					  PIN_GLOBAL);
 		if (IS_ERR(shadow)) {
 			err = PTR_ERR(shadow);
@@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
 	if (err)
 		goto err_trampoline;
 
-	err = intel_engine_cmd_parser(eb->engine,
-				      eb->batch->vma,
+	err = intel_engine_cmd_parser(eb->context->engine,
+				      eb->batches[0]->vma,
 				      eb->batch_start_offset,
-				      eb->batch_len,
+				      eb->batch_len[0],
 				      shadow, trampoline);
 	if (err)
 		goto err_unpin_batch;
 
-	eb->batch = &eb->vma[eb->buffer_count++];
-	eb->batch->vma = i915_vma_get(shadow);
-	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
+	eb->batches[0] = &eb->vma[eb->buffer_count++];
+	eb->batches[0]->vma = i915_vma_get(shadow);
+	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
 
 	eb->trampoline = trampoline;
 	eb->batch_start_offset = 0;
 
 secure_batch:
 	if (batch) {
-		eb->batch = &eb->vma[eb->buffer_count++];
-		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
-		eb->batch->vma = i915_vma_get(batch);
+		if (intel_context_is_parallel(eb->context))
+			return -EINVAL;
+
+		eb->batches[0] = &eb->vma[eb->buffer_count++];
+		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
+		eb->batches[0]->vma = i915_vma_get(batch);
 	}
 	return 0;
 
@@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_request_submit(struct i915_execbuffer *eb,
+			     struct i915_request *rq,
+			     struct i915_vma *batch,
+			     u64 batch_len)
 {
 	int err;
 
-	if (intel_context_nopreempt(eb->context))
-		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
-
-	err = eb_move_to_gpu(eb);
-	if (err)
-		return err;
+	if (intel_context_nopreempt(rq->context))
+		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		err = i915_reset_gen7_sol_offsets(eb->request);
+		err = i915_reset_gen7_sol_offsets(rq);
 		if (err)
 			return err;
 	}
@@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	 * allows us to determine if the batch is still waiting on the GPU
 	 * or actually running by checking the breadcrumb.
 	 */
-	if (eb->engine->emit_init_breadcrumb) {
-		err = eb->engine->emit_init_breadcrumb(eb->request);
+	if (rq->context->engine->emit_init_breadcrumb) {
+		err = rq->context->engine->emit_init_breadcrumb(rq);
 		if (err)
 			return err;
 	}
 
-	err = eb->engine->emit_bb_start(eb->request,
-					batch->node.start +
-					eb->batch_start_offset,
-					eb->batch_len,
-					eb->batch_flags);
+	err = rq->context->engine->emit_bb_start(rq,
+						 batch->node.start +
+						 eb->batch_start_offset,
+						 batch_len,
+						 eb->batch_flags);
 	if (err)
 		return err;
 
 	if (eb->trampoline) {
+		GEM_BUG_ON(intel_context_is_parallel(rq->context));
 		GEM_BUG_ON(eb->batch_start_offset);
-		err = eb->engine->emit_bb_start(eb->request,
-						eb->trampoline->node.start +
-						eb->batch_len,
-						0, 0);
+		err = rq->context->engine->emit_bb_start(rq,
+							 eb->trampoline->node.start +
+							 batch_len, 0, 0);
 		if (err)
 			return err;
 	}
@@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	return 0;
 }
 
+static int eb_submit(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+	int err;
+
+	err = eb_move_to_gpu(eb);
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
+		if (!err)
+			err = eb_request_submit(eb, eb->requests[i],
+						eb->batches[i]->vma,
+						eb->batch_len[i]);
+	}
+
+	return err;
+}
+
 static int num_vcs_engines(const struct drm_i915_private *i915)
 {
 	return hweight_long(VDBOX_MASK(&i915->gt));
@@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
 	return i915_request_get(rq);
 }
 
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
+			   bool throttle)
 {
-	struct intel_context *ce = eb->context;
 	struct intel_timeline *tl;
-	struct i915_request *rq = NULL;
-	int err;
-
-	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
-
-	if (unlikely(intel_context_is_banned(ce)))
-		return ERR_PTR(-EIO);
-
-	/*
-	 * Pinning the contexts may generate requests in order to acquire
-	 * GGTT space, so do this first before we reserve a seqno for
-	 * ourselves.
-	 */
-	err = intel_context_pin_ww(ce, &eb->ww);
-	if (err)
-		return ERR_PTR(err);
+	struct i915_request *rq;
 
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
@@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
 	 * taken on the engine, and the parent device.
 	 */
 	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl)) {
-		intel_context_unpin(ce);
-		return ERR_CAST(tl);
-	}
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
 
 	intel_context_enter(ce);
 	if (throttle)
 		rq = eb_throttle(eb, ce);
 	intel_context_timeline_unlock(tl);
 
+	if (rq) {
+		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
+		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
+
+		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
+				      timeout) < 0) {
+			i915_request_put(rq);
+
+			tl = intel_context_timeline_lock(ce);
+			intel_context_exit(ce);
+			intel_context_timeline_unlock(tl);
+
+			if (nonblock)
+				return -EWOULDBLOCK;
+			else
+				return -EINTR;
+		}
+		i915_request_put(rq);
+	}
+
+	return 0;
+}
+
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+{
+	struct intel_context *ce = eb->context, *child;
+	int err;
+	int i = 0, j = 0;
+
+	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
+
+	if (unlikely(intel_context_is_banned(ce)))
+		return -EIO;
+
+	/*
+	 * Pinning the contexts may generate requests in order to acquire
+	 * GGTT space, so do this first before we reserve a seqno for
+	 * ourselves.
+	 */
+	err = intel_context_pin_ww(ce, &eb->ww);
+	if (err)
+		return err;
+	for_each_child(ce, child) {
+		err = intel_context_pin_ww(child, &eb->ww);
+		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
+	}
+
+	for_each_child(ce, child) {
+		err = eb_pin_timeline(eb, child, throttle);
+		if (err)
+			goto unwind;
+		++i;
+	}
+	err = eb_pin_timeline(eb, ce, throttle);
+	if (err)
+		goto unwind;
+
 	eb->args->flags |= __EXEC_ENGINE_PINNED;
-	return rq;
+	return 0;
+
+unwind:
+	for_each_child(ce, child) {
+		if (j++ < i) {
+			mutex_lock(&child->timeline->mutex);
+			intel_context_exit(child);
+			mutex_unlock(&child->timeline->mutex);
+		}
+	}
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+	return err;
 }
 
 static void eb_unpin_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce = eb->context;
-	struct intel_timeline *tl = ce->timeline;
+	struct intel_context *ce = eb->context, *child;
 
 	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
 		return;
 
 	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
 
-	mutex_lock(&tl->mutex);
+	for_each_child(ce, child) {
+		mutex_lock(&child->timeline->mutex);
+		intel_context_exit(child);
+		mutex_unlock(&child->timeline->mutex);
+
+		intel_context_unpin(child);
+	}
+
+	mutex_lock(&ce->timeline->mutex);
 	intel_context_exit(ce);
-	mutex_unlock(&tl->mutex);
+	mutex_unlock(&ce->timeline->mutex);
 
 	intel_context_unpin(ce);
 }
@@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
 static int
 eb_select_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce;
+	struct intel_context *ce, *child;
 	unsigned int idx;
 	int err;
 
@@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
+	if (intel_context_is_parallel(ce)) {
+		if (eb->buffer_count < ce->parallel.number_children + 1) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+		if (eb->batch_start_offset || eb->args->batch_len) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+	}
+	eb->num_batches = ce->parallel.number_children + 1;
+
+	for_each_child(ce, child)
+		intel_context_get(child);
 	intel_gt_pm_get(ce->engine->gt);
 
 	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
@@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
 		if (err)
 			goto err;
 	}
+	for_each_child(ce, child) {
+		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
+			err = intel_context_alloc_state(child);
+			if (err)
+				goto err;
+		}
+	}
 
 	/*
 	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
@@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
 		goto err;
 
 	eb->context = ce;
-	eb->engine = ce->engine;
+	eb->gt = ce->engine->gt;
 
 	/*
 	 * Make sure engine pool stays alive even if we call intel_context_put
@@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
 
 err:
 	intel_gt_pm_put(ce->engine->gt);
+	for_each_child(ce, child)
+		intel_context_put(child);
 	intel_context_put(ce);
 	return err;
 }
@@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
 static void
 eb_put_engine(struct i915_execbuffer *eb)
 {
-	intel_gt_pm_put(eb->engine->gt);
+	struct intel_context *child;
+
+	intel_gt_pm_put(eb->gt);
+	for_each_child(eb->context, child)
+		intel_context_put(child);
 	intel_context_put(eb->context);
 }
 
@@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
 }
 
 static int
-await_fence_array(struct i915_execbuffer *eb)
+await_fence_array(struct i915_execbuffer *eb,
+		  struct i915_request *rq)
 {
 	unsigned int n;
 	int err;
@@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
 		if (!eb->fences[n].dma_fence)
 			continue;
 
-		err = i915_request_await_dma_fence(eb->request,
-						   eb->fences[n].dma_fence);
+		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
 		if (err < 0)
 			return err;
 	}
@@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static void signal_fence_array(const struct i915_execbuffer *eb)
+static void signal_fence_array(const struct i915_execbuffer *eb,
+			       struct dma_fence * const fence)
 {
-	struct dma_fence * const fence = &eb->request->fence;
 	unsigned int n;
 
 	for (n = 0; n < eb->num_fences; n++) {
@@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 			break;
 }
 
-static int eb_request_add(struct i915_execbuffer *eb, int err)
+static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
 {
-	struct i915_request *rq = eb->request;
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
 	struct i915_request *prev;
@@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
 		attr = eb->gem_context->sched;
-	} else {
-		/* Serialise with context_close via the add_to_timeline */
-		i915_request_set_error_once(rq, -ENOENT);
-		__i915_request_skip(rq);
-		err = -ENOENT; /* override any transient errors */
 	}
 
 	__i915_request_queue(rq, &attr);
@@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 		retire_requests(tl, prev);
 
 	mutex_unlock(&tl->mutex);
+}
+
+static int eb_requests_add(struct i915_execbuffer *eb, int err)
+{
+	int i;
+
+	/*
+	 * We iterate in reverse order of creation to release timeline mutexes in
+	 * same order.
+	 */
+	for_each_batch_add_order(eb, i) {
+		struct i915_request *rq = eb->requests[i];
+
+		if (!rq)
+			continue;
+
+		if (unlikely(intel_context_is_closed(eb->context))) {
+			/* Serialise with context_close via the add_to_timeline */
+			i915_request_set_error_once(rq, -ENOENT);
+			__i915_request_skip(rq);
+			err = -ENOENT; /* override any transient errors */
+		}
+
+		if (intel_context_is_parallel(eb->context)) {
+			if (err) {
+				__i915_request_skip(rq);
+				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
+					&rq->fence.flags);
+			}
+			if (i == 0)
+				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+					&rq->fence.flags);
+		}
+
+		eb_request_add(eb, rq);
+	}
 
 	return err;
 }
@@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
 				    eb);
 }
 
+static void eb_requests_get(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_get(eb->requests[i]);
+	}
+}
+
+static void eb_requests_put(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_put(eb->requests[i]);
+	}
+}
+
+static struct sync_file *
+eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	struct dma_fence_array *fence_array;
+	struct dma_fence **fences;
+	unsigned int i;
+
+	GEM_BUG_ON(!intel_context_is_parent(eb->context));
+
+	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
+	if (!fences)
+		return ERR_PTR(-ENOMEM);
+
+	for_each_batch_create_order(eb, i)
+		fences[i] = &eb->requests[i]->fence;
+
+	fence_array = dma_fence_array_create(eb->num_batches,
+					     fences,
+					     eb->context->parallel.fence_context,
+					     eb->context->parallel.seqno,
+					     false);
+	if (!fence_array) {
+		kfree(fences);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Move ownership to the dma_fence_array created above */
+	for_each_batch_create_order(eb, i)
+		dma_fence_get(fences[i]);
+
+	if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&fence_array->base);
+		/* sync_file now owns fence_arry, drop creation ref */
+		dma_fence_put(&fence_array->base);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	eb->composite_fence = &fence_array->base;
+
+	return out_fence;
+}
+
+static struct sync_file *
+eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
+	      struct dma_fence *in_fence, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	int err;
+
+	if (unlikely(eb->gem_context->syncobj)) {
+		struct dma_fence *fence;
+
+		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
+		err = i915_request_await_dma_fence(rq, fence);
+		dma_fence_put(fence);
+		if (err)
+			return ERR_PTR(err);
+	}
+
+	if (in_fence) {
+		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
+			err = i915_request_await_execution(rq, in_fence);
+		else
+			err = i915_request_await_dma_fence(rq, in_fence);
+		if (err < 0)
+			return ERR_PTR(err);
+	}
+
+	if (eb->fences) {
+		err = await_fence_array(eb, rq);
+		if (err)
+			return ERR_PTR(err);
+	}
+
+	if (intel_context_is_parallel(eb->context)) {
+		out_fence = eb_composite_fence_create(eb, out_fence_fd);
+		if (IS_ERR(out_fence))
+			return ERR_PTR(-ENOMEM);
+	} else if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&rq->fence);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	return out_fence;
+}
+
+static struct intel_context *
+eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
+{
+	struct intel_context *child;
+
+	if (likely(context_number == 0))
+		return eb->context;
+
+	for_each_child(eb->context, child)
+		if (!--context_number)
+			return child;
+
+	GEM_BUG_ON("Context not found");
+
+	return NULL;
+}
+
+static struct sync_file *
+eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
+		   int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		/* Allocate a request for this batch buffer nice and early. */
+		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
+		if (IS_ERR(eb->requests[i])) {
+			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
+			eb->requests[i] = NULL;
+			return out_fence;
+		}
+
+		/*
+		 * Only the first request added (committed to backend) has to
+		 * take the in fences into account as all subsequent requests
+		 * will have fences inserted inbetween them.
+		 */
+		if (i + 1 == eb->num_batches) {
+			out_fence = eb_fences_add(eb, eb->requests[i],
+						  in_fence, out_fence_fd);
+			if (IS_ERR(out_fence))
+				return out_fence;
+		}
+
+		/*
+		 * Whilst this request exists, batch_obj will be on the
+		 * active_list, and so will hold the active reference. Only when
+		 * this request is retired will the batch_obj be moved onto
+		 * the inactive_list and lose its active reference. Hence we do
+		 * not need to explicitly hold another reference here.
+		 */
+		eb->requests[i]->batch = eb->batches[i]->vma;
+		if (eb->batch_pool) {
+			GEM_BUG_ON(intel_context_is_parallel(eb->context));
+			intel_gt_buffer_pool_mark_active(eb->batch_pool,
+							 eb->requests[i]);
+		}
+	}
+
+	return out_fence;
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
@@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
-	struct i915_vma *batch;
 	int out_fence_fd = -1;
 	int err;
 
@@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
-	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
 
+	memset(eb.requests, 0, sizeof(struct i915_request *) *
+	       ARRAY_SIZE(eb.requests));
+	eb.composite_fence = NULL;
+
 	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (GRAPHICS_VER(i915) >= 11)
@@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	ww_acquire_done(&eb.ww.ctx);
 
-	batch = eb.batch->vma;
-
-	/* Allocate a request for this batch buffer nice and early. */
-	eb.request = i915_request_create(eb.context);
-	if (IS_ERR(eb.request)) {
-		err = PTR_ERR(eb.request);
-		goto err_vma;
-	}
-
-	if (unlikely(eb.gem_context->syncobj)) {
-		struct dma_fence *fence;
-
-		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
-		err = i915_request_await_dma_fence(eb.request, fence);
-		dma_fence_put(fence);
-		if (err)
-			goto err_ext;
-	}
-
-	if (in_fence) {
-		if (args->flags & I915_EXEC_FENCE_SUBMIT)
-			err = i915_request_await_execution(eb.request,
-							   in_fence);
-		else
-			err = i915_request_await_dma_fence(eb.request,
-							   in_fence);
-		if (err < 0)
-			goto err_request;
-	}
-
-	if (eb.fences) {
-		err = await_fence_array(&eb);
-		if (err)
+	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
+	if (IS_ERR(out_fence)) {
+		err = PTR_ERR(out_fence);
+		if (eb.requests[0])
 			goto err_request;
+		else
+			goto err_vma;
 	}
 
-	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&eb.request->fence);
-		if (!out_fence) {
-			err = -ENOMEM;
-			goto err_request;
-		}
-	}
-
-	/*
-	 * Whilst this request exists, batch_obj will be on the
-	 * active_list, and so will hold the active reference. Only when this
-	 * request is retired will the the batch_obj be moved onto the
-	 * inactive_list and lose its active reference. Hence we do not need
-	 * to explicitly hold another reference here.
-	 */
-	eb.request->batch = batch;
-	if (eb.batch_pool)
-		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
-
-	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb);
 
 err_request:
-	i915_request_get(eb.request);
-	err = eb_request_add(&eb, err);
+	eb_requests_get(&eb);
+	err = eb_requests_add(&eb, err);
 
 	if (eb.fences)
-		signal_fence_array(&eb);
+		signal_fence_array(&eb, eb.composite_fence ?
+				   eb.composite_fence :
+				   &eb.requests[0]->fence);
 
 	if (out_fence) {
 		if (err == 0) {
@@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
-					  &eb.request->fence);
+					  eb.composite_fence ?
+					  eb.composite_fence :
+					  &eb.requests[0]->fence);
 	}
 
-	i915_request_put(eb.request);
+	if (!out_fence && eb.composite_fence)
+		dma_fence_put(eb.composite_fence);
+
+	eb_requests_put(&eb);
 
 err_vma:
 	eb_release_vmas(&eb, true);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 1bc705f98e2a..1781419fa105 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
 	struct intel_timeline *tl = ce->timeline;
 	int err;
 
-	err = mutex_lock_interruptible(&tl->mutex);
+	if (intel_context_is_parent(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
+	else if (intel_context_is_child(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex,
+						      ce->parallel.child_index + 1);
+	else
+		err = mutex_lock_interruptible(&tl->mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 95a5b94b4ece..9e0177dc5484 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -248,6 +248,16 @@ struct intel_context {
 		 * context
 		 */
 		struct i915_request *last_rq;
+		/**
+		 * @fence_context: fence context composite fence when doing
+		 * parallel submission
+		 */
+		u64 fence_context;
+		/**
+		 * @seqno: seqno for composite fence when doing parallel
+		 * submission
+		 */
+		u32 seqno;
 		/** @number_children: number of children if parent */
 		u8 number_children;
 		/** @child_index: index into child_list if child */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f28e36aa77c2..83b0d2a114af 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->parallel.fence_context = dma_fence_context_alloc(1);
+
 	parent->engine->emit_bb_start =
 		emit_bb_start_parent_no_preempt_mid_batch;
 	parent->engine->emit_fini_breadcrumb =
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 8950785e55d6..24db8459376b 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -147,6 +147,15 @@ enum {
 	 * tail.
 	 */
 	I915_FENCE_FLAG_SUBMIT_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) that
+	 * hit an error while generating requests in the execbuf IOCTL.
+	 * Indicates this request should be skipped as another request in
+	 * submission / relationship encoutered an error.
+	 */
+	I915_FENCE_FLAG_SKIP_PARALLEL,
 };
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 4b7fc4647e46..90546fa58fc1 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	return i915_active_add_request(&vma->active, rq);
 }
 
-int i915_vma_move_to_active(struct i915_vma *vma,
-			    struct i915_request *rq,
-			    unsigned int flags)
+int _i915_vma_move_to_active(struct i915_vma *vma,
+			     struct i915_request *rq,
+			     struct dma_fence *fence,
+			     unsigned int flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
@@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 			intel_frontbuffer_put(front);
 		}
 
-		dma_resv_add_excl_fence(vma->resv, &rq->fence);
-		obj->write_domain = I915_GEM_DOMAIN_RENDER;
-		obj->read_domains = 0;
+		if (fence) {
+			dma_resv_add_excl_fence(vma->resv, fence);
+			obj->write_domain = I915_GEM_DOMAIN_RENDER;
+			obj->read_domains = 0;
+		}
 	} else {
 		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
 			err = dma_resv_reserve_shared(vma->resv, 1);
@@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 				return err;
 		}
 
-		dma_resv_add_shared_fence(vma->resv, &rq->fence);
-		obj->write_domain = 0;
+		if (fence) {
+			dma_resv_add_shared_fence(vma->resv, fence);
+			obj->write_domain = 0;
+		}
 	}
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index ed69f66c7ab0..648dbe744c96 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
 
 int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
 					   struct i915_request *rq);
-int __must_check i915_vma_move_to_active(struct i915_vma *vma,
-					 struct i915_request *rq,
-					 unsigned int flags);
+int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
+					  struct i915_request *rq,
+					  struct dma_fence *fence,
+					  unsigned int flags);
+static inline int __must_check
+i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
+			unsigned int flags)
+{
+	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
+}
 
 #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forward progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

v2:
 (John Harrison)
  - Add comment explaining the skipping of the handshake logic
  - Fix typos in the commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
 1 file changed, 68 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 83b0d2a114af..05e8b199e4ce 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
-						 u32 *cs)
+__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						   u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	u8 i;
@@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 				  get_children_go_addr(ce),
 				  0);
 
+	return cs;
+}
+
+/*
+ * If this true, a submission of multi-lrc requests had an error and the
+ * requests need to be skipped. The front end (execuf IOCTL) should've called
+ * i915_request_skip which squashes the BB but we still need to emit the fini
+ * breadrcrumbs seqno write. At this point we don't know how many of the
+ * requests in the multi-lrc submission were generated so we can't do the
+ * handshake between the parent and children (e.g. if 4 requests should be
+ * generated but 2nd hit an error only 1 would be seen by the GuC backend).
+ * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
+ * has occurred on any of the requests in submission / relationship.
+ */
+static inline bool skip_handshake(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		/*
+		 * NOP everything in
+		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
+		 * comes of the length emission below.
+		 */
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
@@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						  u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	struct intel_context *parent = intel_context_to_parent(ce);
@@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
 	*cs++ = get_children_go_addr(parent);
 	*cs++ = 0;
 
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		/*
+		 * NOP everything in
+		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
+		 * comes from the length the emission below.
+		 */
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forward progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

v2:
 (John Harrison)
  - Add comment explaining the skipping of the handshake logic
  - Fix typos in the commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
 1 file changed, 68 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 83b0d2a114af..05e8b199e4ce 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
-						 u32 *cs)
+__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						   u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	u8 i;
@@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 				  get_children_go_addr(ce),
 				  0);
 
+	return cs;
+}
+
+/*
+ * If this true, a submission of multi-lrc requests had an error and the
+ * requests need to be skipped. The front end (execuf IOCTL) should've called
+ * i915_request_skip which squashes the BB but we still need to emit the fini
+ * breadrcrumbs seqno write. At this point we don't know how many of the
+ * requests in the multi-lrc submission were generated so we can't do the
+ * handshake between the parent and children (e.g. if 4 requests should be
+ * generated but 2nd hit an error only 1 would be seen by the GuC backend).
+ * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
+ * has occurred on any of the requests in submission / relationship.
+ */
+static inline bool skip_handshake(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		/*
+		 * NOP everything in
+		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
+		 * comes of the length emission below.
+		 */
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
@@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						  u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	struct intel_context *parent = intel_context_to_parent(ce);
@@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
 	*cs++ = get_children_go_addr(parent);
 	*cs++ = 0;
 
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		/*
+		 * NOP everything in
+		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
+		 * comes from the length the emission below.
+		 */
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

If an object in the excl or shared slot is a composite fence from a
parallel submit and the current request in the conflict tracking is from
the same parallel context there is no need to enforce ordering as the
ordering already implicit. Make the request conflict tracking understand
this by comparing the parents parallel fence values and skipping the
conflict insertion if the values match.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e9bfa32f9270..cf89624020ad 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
 	return err;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
+static bool is_same_parallel_context(struct i915_request *to,
+				     struct i915_request *from)
+{
+	if (is_parallel_rq(to))
+		return request_to_parent(to) == request_to_parent(from);
+
+	return false;
+}
+
 int
 i915_request_await_execution(struct i915_request *rq,
 			     struct dma_fence *fence)
@@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
 		 * want to run our callback in all cases.
 		 */
 
-		if (dma_fence_is_i915(fence))
+		if (dma_fence_is_i915(fence)) {
+			if (is_same_parallel_context(rq, to_request(fence)))
+				continue;
 			ret = __i915_request_await_execution(rq,
 							     to_request(fence));
-		else
+		} else {
 			ret = i915_request_await_external(rq, fence);
+		}
 		if (ret < 0)
 			return ret;
 	} while (--nchild);
@@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
 						 fence))
 			continue;
 
-		if (dma_fence_is_i915(fence))
+		if (dma_fence_is_i915(fence)) {
+			if (is_same_parallel_context(rq, to_request(fence)))
+				continue;
 			ret = i915_request_await_request(rq, to_request(fence));
-		else
+		} else {
 			ret = i915_request_await_external(rq, fence);
+		}
 		if (ret < 0)
 			return ret;
 
@@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
-static inline bool is_parallel_rq(struct i915_request *rq)
-{
-	return intel_context_is_parallel(rq->context);
-}
-
-static inline struct intel_context *request_to_parent(struct i915_request *rq)
-{
-	return intel_context_to_parent(rq->context);
-}
-
 static struct i915_request *
 __i915_request_ensure_parallel_ordering(struct i915_request *rq,
 					struct intel_timeline *timeline)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

If an object in the excl or shared slot is a composite fence from a
parallel submit and the current request in the conflict tracking is from
the same parallel context there is no need to enforce ordering as the
ordering already implicit. Make the request conflict tracking understand
this by comparing the parents parallel fence values and skipping the
conflict insertion if the values match.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e9bfa32f9270..cf89624020ad 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
 	return err;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
+static bool is_same_parallel_context(struct i915_request *to,
+				     struct i915_request *from)
+{
+	if (is_parallel_rq(to))
+		return request_to_parent(to) == request_to_parent(from);
+
+	return false;
+}
+
 int
 i915_request_await_execution(struct i915_request *rq,
 			     struct dma_fence *fence)
@@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
 		 * want to run our callback in all cases.
 		 */
 
-		if (dma_fence_is_i915(fence))
+		if (dma_fence_is_i915(fence)) {
+			if (is_same_parallel_context(rq, to_request(fence)))
+				continue;
 			ret = __i915_request_await_execution(rq,
 							     to_request(fence));
-		else
+		} else {
 			ret = i915_request_await_external(rq, fence);
+		}
 		if (ret < 0)
 			return ret;
 	} while (--nchild);
@@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
 						 fence))
 			continue;
 
-		if (dma_fence_is_i915(fence))
+		if (dma_fence_is_i915(fence)) {
+			if (is_same_parallel_context(rq, to_request(fence)))
+				continue;
 			ret = i915_request_await_request(rq, to_request(fence));
-		else
+		} else {
 			ret = i915_request_await_external(rq, fence);
+		}
 		if (ret < 0)
 			return ret;
 
@@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
-static inline bool is_parallel_rq(struct i915_request *rq)
-{
-	return intel_context_is_parallel(rq->context);
-}
-
-static inline struct intel_context *request_to_parent(struct i915_request *rq)
-{
-	return intel_context_to_parent(rq->context);
-}
-
 static struct i915_request *
 __i915_request_ensure_parallel_ordering(struct i915_request *rq,
 					struct intel_timeline *timeline)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Parallel submission create composite fences (dma_fence_array) for excl /
shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
determine the busyness of the object. Prior to patch it only check if
the fence in the slot was a i915_request. Update the check to understand
composite fences and correctly report the busyness.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
 drivers/gpu/drm/i915/i915_request.h           |  6 ++
 3 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
index 6234e17259c1..b89d173c62eb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
@@ -4,6 +4,8 @@
  * Copyright © 2014-2016 Intel Corporation
  */
 
+#include <linux/dma-fence-array.h>
+
 #include "gt/intel_engine.h"
 
 #include "i915_gem_ioctls.h"
@@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
 }
 
 static __always_inline unsigned int
-__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
+__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
 {
 	const struct i915_request *rq;
 
@@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
 	 * to eventually flush us, but to minimise latency just ask the
 	 * hardware.
 	 *
-	 * Note we only report on the status of native fences.
+	 * Note we only report on the status of native fences and we currently
+	 * have two native fences:
+	 *
+	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
+	 * created during a parallel submission. In this case we deconstruct the
+	 * composite fence into individual i915 requests and check the status of
+	 * each request.
+	 *
+	 * 2. A single i915 request.
 	 */
-	if (!dma_fence_is_i915(fence))
+	if (dma_fence_is_array(fence)) {
+		struct dma_fence_array *array = to_dma_fence_array(fence);
+		struct dma_fence **child = array->fences;
+		unsigned int nchild = array->num_fences;
+
+		do {
+			struct dma_fence *current_fence = *child++;
+
+			/* Not an i915 fence, can't be busy per above */
+			if (!dma_fence_is_i915(current_fence) ||
+			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
+				      &current_fence->flags)) {
+				return 0;
+			}
+
+			rq = to_request(current_fence);
+			if (!i915_request_completed(rq)) {
+				BUILD_BUG_ON(!typecheck(u16,
+							rq->engine->uabi_class));
+				return flag(rq->engine->uabi_class);
+			}
+		} while (--nchild);
+
+		/* All requests in array complete, not busy */
 		return 0;
+	} else {
+		if (!dma_fence_is_i915(fence))
+			return 0;
 
-	/* opencode to_request() in order to avoid const warnings */
-	rq = container_of(fence, const struct i915_request, fence);
-	if (i915_request_completed(rq))
-		return 0;
+		rq = to_request(fence);
+		if (i915_request_completed(rq))
+			return 0;
 
-	/* Beware type-expansion follies! */
-	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
-	return flag(rq->engine->uabi_class);
+		/* Beware type-expansion follies! */
+		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
+		return flag(rq->engine->uabi_class);
+	}
 }
 
 static __always_inline unsigned int
-busy_check_reader(const struct dma_fence *fence)
+busy_check_reader(struct dma_fence *fence)
 {
 	return __busy_set_if_active(fence, __busy_read_flag);
 }
 
 static __always_inline unsigned int
-busy_check_writer(const struct dma_fence *fence)
+busy_check_writer(struct dma_fence *fence)
 {
 	if (!fence)
 		return 0;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 5c7fb6f68bbb..16276f406fd6 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
 	if (!fences)
 		return ERR_PTR(-ENOMEM);
 
-	for_each_batch_create_order(eb, i)
+	for_each_batch_create_order(eb, i) {
 		fences[i] = &eb->requests[i]->fence;
+		__set_bit(I915_FENCE_FLAG_COMPOSITE,
+			  &eb->requests[i]->fence.flags);
+	}
 
 	fence_array = dma_fence_array_create(eb->num_batches,
 					     fences,
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 24db8459376b..dc359242d1ae 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -156,6 +156,12 @@ enum {
 	 * submission / relationship encoutered an error.
 	 */
 	I915_FENCE_FLAG_SKIP_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
+	 * fence (dma_fence_array) and i915 generated for parallel submission.
+	 */
+	I915_FENCE_FLAG_COMPOSITE,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Parallel submission create composite fences (dma_fence_array) for excl /
shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
determine the busyness of the object. Prior to patch it only check if
the fence in the slot was a i915_request. Update the check to understand
composite fences and correctly report the busyness.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
 drivers/gpu/drm/i915/i915_request.h           |  6 ++
 3 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
index 6234e17259c1..b89d173c62eb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
@@ -4,6 +4,8 @@
  * Copyright © 2014-2016 Intel Corporation
  */
 
+#include <linux/dma-fence-array.h>
+
 #include "gt/intel_engine.h"
 
 #include "i915_gem_ioctls.h"
@@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
 }
 
 static __always_inline unsigned int
-__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
+__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
 {
 	const struct i915_request *rq;
 
@@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
 	 * to eventually flush us, but to minimise latency just ask the
 	 * hardware.
 	 *
-	 * Note we only report on the status of native fences.
+	 * Note we only report on the status of native fences and we currently
+	 * have two native fences:
+	 *
+	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
+	 * created during a parallel submission. In this case we deconstruct the
+	 * composite fence into individual i915 requests and check the status of
+	 * each request.
+	 *
+	 * 2. A single i915 request.
 	 */
-	if (!dma_fence_is_i915(fence))
+	if (dma_fence_is_array(fence)) {
+		struct dma_fence_array *array = to_dma_fence_array(fence);
+		struct dma_fence **child = array->fences;
+		unsigned int nchild = array->num_fences;
+
+		do {
+			struct dma_fence *current_fence = *child++;
+
+			/* Not an i915 fence, can't be busy per above */
+			if (!dma_fence_is_i915(current_fence) ||
+			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
+				      &current_fence->flags)) {
+				return 0;
+			}
+
+			rq = to_request(current_fence);
+			if (!i915_request_completed(rq)) {
+				BUILD_BUG_ON(!typecheck(u16,
+							rq->engine->uabi_class));
+				return flag(rq->engine->uabi_class);
+			}
+		} while (--nchild);
+
+		/* All requests in array complete, not busy */
 		return 0;
+	} else {
+		if (!dma_fence_is_i915(fence))
+			return 0;
 
-	/* opencode to_request() in order to avoid const warnings */
-	rq = container_of(fence, const struct i915_request, fence);
-	if (i915_request_completed(rq))
-		return 0;
+		rq = to_request(fence);
+		if (i915_request_completed(rq))
+			return 0;
 
-	/* Beware type-expansion follies! */
-	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
-	return flag(rq->engine->uabi_class);
+		/* Beware type-expansion follies! */
+		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
+		return flag(rq->engine->uabi_class);
+	}
 }
 
 static __always_inline unsigned int
-busy_check_reader(const struct dma_fence *fence)
+busy_check_reader(struct dma_fence *fence)
 {
 	return __busy_set_if_active(fence, __busy_read_flag);
 }
 
 static __always_inline unsigned int
-busy_check_writer(const struct dma_fence *fence)
+busy_check_writer(struct dma_fence *fence)
 {
 	if (!fence)
 		return 0;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 5c7fb6f68bbb..16276f406fd6 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
 	if (!fences)
 		return ERR_PTR(-ENOMEM);
 
-	for_each_batch_create_order(eb, i)
+	for_each_batch_create_order(eb, i) {
 		fences[i] = &eb->requests[i]->fence;
+		__set_bit(I915_FENCE_FLAG_COMPOSITE,
+			  &eb->requests[i]->fence.flags);
+	}
 
 	fence_array = dma_fence_array_create(eb->num_batches,
 					     fences,
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 24db8459376b..dc359242d1ae 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -156,6 +156,12 @@ enum {
 	 * submission / relationship encoutered an error.
 	 */
 	I915_FENCE_FLAG_SKIP_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
+	 * fence (dma_fence_array) and i915 generated for parallel submission.
+	 */
+	I915_FENCE_FLAG_COMPOSITE,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 25/26] drm/i915: Enable multi-bb execbuf
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Enable multi-bb execbuf by enabling the set_parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 6290bc20ccb1..605440388988 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* Disabling for now */
-	return -ENODEV;
-
 	/* FIXME: This is NIY for execlists */
 	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
 		return -ENODEV;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 25/26] drm/i915: Enable multi-bb execbuf
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Enable multi-bb execbuf by enabling the set_parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 6290bc20ccb1..605440388988 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* Disabling for now */
-	return -ENODEV;
-
 	/* FIXME: This is NIY for execlists */
 	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
 		return -ENODEV;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 26/26] drm/i915/execlists: Weak parallel submission support for execlists
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-04 22:06   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Doing as little as possible to support this interface for
execlists - basically just passing submit fences between each request
generated and virtual engines are not allowed. This is on par with what
is there for the existing (hopefully soon deprecated) bonding interface.

We perma-pin these execlists contexts to align with GuC implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 10 ++--
 drivers/gpu/drm/i915/gt/intel_context.c       |  4 +-
 .../drm/i915/gt/intel_execlists_submission.c  | 56 ++++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  2 -
 5 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 605440388988..732111457dd2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,10 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* FIXME: This is NIY for execlists */
-	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
-		return -ENODEV;
-
 	if (get_user(slot, &ext->engine_index))
 		return -EFAULT;
 
@@ -549,6 +545,12 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	if (get_user(num_siblings, &ext->num_siblings))
 		return -EFAULT;
 
+	if (!intel_uc_uses_guc_submission(&i915->gt.uc) && num_siblings != 1) {
+		drm_dbg(&i915->drm, "Only 1 sibling (%d) supported in non-GuC mode\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
 	if (slot >= set->num_engines) {
 		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
 			slot, set->num_engines);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index ee84259959d0..3fc1c5155fd4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -79,7 +79,8 @@ static int intel_context_active_acquire(struct intel_context *ce)
 
 	__i915_active_acquire(&ce->active);
 
-	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine))
+	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine) ||
+	    intel_context_is_parallel(ce))
 		return 0;
 
 	/* Preallocate tracking nodes */
@@ -562,7 +563,6 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	 * Callers responsibility to validate that this function is used
 	 * correctly but we use GEM_BUG_ON here ensure that they do.
 	 */
-	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
 	GEM_BUG_ON(intel_context_is_pinned(parent));
 	GEM_BUG_ON(intel_context_is_child(parent));
 	GEM_BUG_ON(intel_context_is_pinned(child));
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 8d7f571029df..a747fbbf18b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -927,8 +927,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
 {
-	return (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
-		intel_context_force_single_submission(ce));
+	return intel_context_force_single_submission(ce);
 }
 
 static bool can_merge_ctx(const struct intel_context *prev,
@@ -2598,6 +2597,58 @@ static void execlists_context_cancel_request(struct intel_context *ce,
 				      current->comm);
 }
 
+static struct intel_context *
+execlists_create_parallel(struct intel_engine_cs **engines,
+			  unsigned int num_siblings,
+			  unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	GEM_BUG_ON(num_siblings != 1);
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_context_create(siblings[0]);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0)
+			parent = ce;
+		else
+			intel_context_bind_parent_child(parent, ce);
+	}
+
+	parent->parallel.fence_context = dma_fence_context_alloc(1);
+
+	intel_context_set_nopreempt(parent);
+	intel_context_set_single_submission(parent);
+	for_each_child(parent, ce) {
+		intel_context_set_nopreempt(ce);
+		intel_context_set_single_submission(ce);
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static const struct intel_context_ops execlists_context_ops = {
 	.flags = COPS_HAS_INFLIGHT,
 
@@ -2616,6 +2667,7 @@ static const struct intel_context_ops execlists_context_ops = {
 	.reset = lrc_reset,
 	.destroy = lrc_destroy,
 
+	.create_parallel = execlists_create_parallel,
 	.create_virtual = execlists_create_virtual,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 57339d5c1fc8..8137d0aabf99 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1065,6 +1065,8 @@ lrc_pin(struct intel_context *ce,
 
 void lrc_unpin(struct intel_context *ce)
 {
+	if (unlikely(ce->parallel.last_rq))
+		i915_request_put(ce->parallel.last_rq);
 	check_redzone((void *)ce->lrc_reg_state - LRC_STATE_OFFSET,
 		      ce->engine);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 05e8b199e4ce..1f077024c4b8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2961,8 +2961,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
-	if (ce->parallel.last_rq)
-		i915_request_put(ce->parallel.last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 26/26] drm/i915/execlists: Weak parallel submission support for execlists
@ 2021-10-04 22:06   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-04 22:06 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Doing as little as possible to support this interface for
execlists - basically just passing submit fences between each request
generated and virtual engines are not allowed. This is on par with what
is there for the existing (hopefully soon deprecated) bonding interface.

We perma-pin these execlists contexts to align with GuC implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 10 ++--
 drivers/gpu/drm/i915/gt/intel_context.c       |  4 +-
 .../drm/i915/gt/intel_execlists_submission.c  | 56 ++++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  2 -
 5 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 605440388988..732111457dd2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,10 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* FIXME: This is NIY for execlists */
-	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
-		return -ENODEV;
-
 	if (get_user(slot, &ext->engine_index))
 		return -EFAULT;
 
@@ -549,6 +545,12 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	if (get_user(num_siblings, &ext->num_siblings))
 		return -EFAULT;
 
+	if (!intel_uc_uses_guc_submission(&i915->gt.uc) && num_siblings != 1) {
+		drm_dbg(&i915->drm, "Only 1 sibling (%d) supported in non-GuC mode\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
 	if (slot >= set->num_engines) {
 		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
 			slot, set->num_engines);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index ee84259959d0..3fc1c5155fd4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -79,7 +79,8 @@ static int intel_context_active_acquire(struct intel_context *ce)
 
 	__i915_active_acquire(&ce->active);
 
-	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine))
+	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine) ||
+	    intel_context_is_parallel(ce))
 		return 0;
 
 	/* Preallocate tracking nodes */
@@ -562,7 +563,6 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	 * Callers responsibility to validate that this function is used
 	 * correctly but we use GEM_BUG_ON here ensure that they do.
 	 */
-	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
 	GEM_BUG_ON(intel_context_is_pinned(parent));
 	GEM_BUG_ON(intel_context_is_child(parent));
 	GEM_BUG_ON(intel_context_is_pinned(child));
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 8d7f571029df..a747fbbf18b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -927,8 +927,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
 {
-	return (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
-		intel_context_force_single_submission(ce));
+	return intel_context_force_single_submission(ce);
 }
 
 static bool can_merge_ctx(const struct intel_context *prev,
@@ -2598,6 +2597,58 @@ static void execlists_context_cancel_request(struct intel_context *ce,
 				      current->comm);
 }
 
+static struct intel_context *
+execlists_create_parallel(struct intel_engine_cs **engines,
+			  unsigned int num_siblings,
+			  unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	GEM_BUG_ON(num_siblings != 1);
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_context_create(siblings[0]);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0)
+			parent = ce;
+		else
+			intel_context_bind_parent_child(parent, ce);
+	}
+
+	parent->parallel.fence_context = dma_fence_context_alloc(1);
+
+	intel_context_set_nopreempt(parent);
+	intel_context_set_single_submission(parent);
+	for_each_child(parent, ce) {
+		intel_context_set_nopreempt(ce);
+		intel_context_set_single_submission(ce);
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static const struct intel_context_ops execlists_context_ops = {
 	.flags = COPS_HAS_INFLIGHT,
 
@@ -2616,6 +2667,7 @@ static const struct intel_context_ops execlists_context_ops = {
 	.reset = lrc_reset,
 	.destroy = lrc_destroy,
 
+	.create_parallel = execlists_create_parallel,
 	.create_virtual = execlists_create_virtual,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 57339d5c1fc8..8137d0aabf99 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1065,6 +1065,8 @@ lrc_pin(struct intel_context *ce,
 
 void lrc_unpin(struct intel_context *ce)
 {
+	if (unlikely(ce->parallel.last_rq))
+		i915_request_put(ce->parallel.last_rq);
 	check_redzone((void *)ce->lrc_reg_state - LRC_STATE_OFFSET,
 		      ce->engine);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 05e8b199e4ce..1f077024c4b8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2961,8 +2961,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
-	if (ce->parallel.last_rq)
-		i915_request_put(ce->parallel.last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (26 preceding siblings ...)
  (?)
@ 2021-10-04 22:21 ` Patchwork
  2021-10-12 22:15   ` John Harrison
  -1 siblings, 1 reply; 165+ messages in thread
From: Patchwork @ 2021-10-04 22:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev4)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
e2a47a99bf9d drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
f83d8f1539fa drm/i915/guc: Take GT PM ref when deregistering context
-:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

total: 0 errors, 0 warnings, 2 checks, 290 lines checked
93e5284929b3 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
4dd6554d994d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
8629b55f536c drm/i915: Add logical engine mapping
8117ec0a1ca7 drm/i915: Expose logical engine instance to user
aa8e1eb4dd4e drm/i915/guc: Introduce context parent-child relationship
aaf50eacc2fd drm/i915/guc: Add multi-lrc context registration
e5f6f50e66d1 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
adf21ba138f3 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
40ef33318b81 drm/i915/guc: Implement parallel context pin / unpin functions
1ad560c70346 drm/i915/guc: Implement multi-lrc submission
-:364: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
#364: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:771:
+		*wqi++ = child->ring->tail / sizeof(u64);
 		^

total: 0 errors, 0 warnings, 1 checks, 570 lines checked
466c01457dec drm/i915/guc: Insert submit fences between requests in parent-child relationship
2ece815c1f18 drm/i915/guc: Implement multi-lrc reset
7add5784199f drm/i915/guc: Update debugfs for GuC multi-lrc
-:23: CHECK:LINE_SPACING: Please don't use multiple blank lines
#23: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3707:
 
+

total: 0 errors, 0 warnings, 1 checks, 67 lines checked
966991d7bbed drm/i915: Fix bug in user proto-context creation that leaked contexts
0eb3d3bf0c84 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
68c6596b649a drm/i915/doc: Update parallel submit doc to point to i915_drm.h
-:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 checks, 10 lines checked
8290f5d15ca2 drm/i915/guc: Add basic GuC multi-lrc selftest
-:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#22: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 190 lines checked
ade3768c42d5 drm/i915/guc: Implement no mid batch preemption for multi-lrc
57882939d788 drm/i915: Multi-BB execbuf
-:369: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#369: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1854:
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < (_eb)->num_batches; ++_i)

-:371: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)

-:371: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)

total: 1 errors, 0 warnings, 2 checks, 1298 lines checked
28b699ece289 drm/i915/guc: Handle errors in multi-lrc requests
962e6b3dce59 drm/i915: Make request conflict tracking understand parallel submits
368ab12f5205 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
b52570f01859 drm/i915: Enable multi-bb execbuf
8766155832d7 drm/i915/execlists: Weak parallel submission support for execlists



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (27 preceding siblings ...)
  (?)
@ 2021-10-04 22:23 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-04 22:23 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev4)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/intel_wakeref.c:137:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock
+drivers/gpu/drm/i915/selftests/i915_syncmap.c:80:54: warning: dubious: x | !y
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (28 preceding siblings ...)
  (?)
@ 2021-10-04 22:26 ` Patchwork
  2021-10-12 22:15   ` John Harrison
  -1 siblings, 1 reply; 165+ messages in thread
From: Patchwork @ 2021-10-04 22:26 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev4)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_stall_reason' not described in 'intel_guc'
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_state' not described in 'intel_guc'



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (29 preceding siblings ...)
  (?)
@ 2021-10-04 22:54 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-04 22:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 9849 bytes --]

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev4)
URL   : https://patchwork.freedesktop.org/series/92789/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10681 -> Patchwork_21239
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21239 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21239, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21239:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_render_tiled_blits@basic:
    - fi-ivb-3770:        [PASS][1] -> [FAIL][2] +2 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-ivb-3770/igt@gem_render_tiled_blits@basic.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-ivb-3770/igt@gem_render_tiled_blits@basic.html
    - fi-hsw-4770:        [PASS][3] -> [FAIL][4] +2 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10681 and Patchwork_21239:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc_multi_lrc:
    - Statuses : 29 pass(s)
    - Exec time: [0.44, 3.89] s

  

Known issues
------------

  Here are the changes found in Patchwork_21239 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@query-info:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][5] ([fdo#109315])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@amdgpu/amd_basic@query-info.html

  * igt@amdgpu/amd_basic@semaphore:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][6] ([fdo#109271]) +27 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-bdw-5557u/igt@amdgpu/amd_basic@semaphore.html

  * igt@amdgpu/amd_cs_nop@nop-gfx0:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][7] ([fdo#109315] / [i915#2575]) +16 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@amdgpu/amd_cs_nop@nop-gfx0.html

  * igt@core_hotunplug@unbind-rebind:
    - fi-bdw-5557u:       NOTRUN -> [WARN][8] ([i915#3718])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-bdw-5557u/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-apl-guc:         NOTRUN -> [SKIP][9] ([fdo#109271]) +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-apl-guc/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][10] ([i915#2190])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@gem_huc_copy@huc-copy.html

  * igt@i915_hangman@error-state-basic:
    - fi-apl-guc:         NOTRUN -> [DMESG-WARN][11] ([i915#1610])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-apl-guc/igt@i915_hangman@error-state-basic.html

  * igt@i915_pm_backlight@basic-brightness:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][12] ([i915#1155])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@i915_pm_backlight@basic-brightness.html

  * igt@i915_selftest@live@hangcheck:
    - fi-snb-2600:        [PASS][13] -> [INCOMPLETE][14] ([i915#3921])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-snb-2600/igt@i915_selftest@live@hangcheck.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][15] ([fdo#111827]) +8 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][16] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-bdw-5557u/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][17] ([i915#4103]) +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][18] ([fdo#109285])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][19] ([i915#1072]) +3 similar issues
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@kms_psr@primary_mmap_gtt.html

  * igt@prime_vgem@basic-userptr:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][20] ([i915#3301])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-tgl-1115g4/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-apl-guc:         NOTRUN -> [FAIL][21] ([i915#2426] / [i915#3363])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-apl-guc/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b:
    - fi-cfl-8109u:       [DMESG-WARN][22] ([i915#295]) -> [PASS][23] +12 similar issues
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-cfl-8109u/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/fi-cfl-8109u/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1610]: https://gitlab.freedesktop.org/drm/intel/issues/1610
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#295]: https://gitlab.freedesktop.org/drm/intel/issues/295
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3718]: https://gitlab.freedesktop.org/drm/intel/issues/3718
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103


Participating hosts (36 -> 32)
------------------------------

  Additional (2): fi-tgl-1115g4 fi-apl-guc 
  Missing    (6): fi-kbl-soraka bat-dg1-6 fi-bsw-cyan bat-adlp-4 bat-jsl-2 bat-jsl-1 


Build changes
-------------

  * Linux: CI_DRM_10681 -> Patchwork_21239

  CI-20190529: 20190529
  CI_DRM_10681: fe9b639a95a08713c8ee4ef110ce6a6388c9f9f2 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6228: 22643ce4014a0b2dc52ce7916b2f657e2a7757c3 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21239: 8766155832d72e53d811f8023a4d3e5545190ce9 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

8766155832d7 drm/i915/execlists: Weak parallel submission support for execlists
b52570f01859 drm/i915: Enable multi-bb execbuf
368ab12f5205 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
962e6b3dce59 drm/i915: Make request conflict tracking understand parallel submits
28b699ece289 drm/i915/guc: Handle errors in multi-lrc requests
57882939d788 drm/i915: Multi-BB execbuf
ade3768c42d5 drm/i915/guc: Implement no mid batch preemption for multi-lrc
8290f5d15ca2 drm/i915/guc: Add basic GuC multi-lrc selftest
68c6596b649a drm/i915/doc: Update parallel submit doc to point to i915_drm.h
0eb3d3bf0c84 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
966991d7bbed drm/i915: Fix bug in user proto-context creation that leaked contexts
7add5784199f drm/i915/guc: Update debugfs for GuC multi-lrc
2ece815c1f18 drm/i915/guc: Implement multi-lrc reset
466c01457dec drm/i915/guc: Insert submit fences between requests in parent-child relationship
1ad560c70346 drm/i915/guc: Implement multi-lrc submission
40ef33318b81 drm/i915/guc: Implement parallel context pin / unpin functions
adf21ba138f3 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
e5f6f50e66d1 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
aaf50eacc2fd drm/i915/guc: Add multi-lrc context registration
aa8e1eb4dd4e drm/i915/guc: Introduce context parent-child relationship
8117ec0a1ca7 drm/i915: Expose logical engine instance to user
8629b55f536c drm/i915: Add logical engine mapping
4dd6554d994d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
93e5284929b3 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
f83d8f1539fa drm/i915/guc: Take GT PM ref when deregistering context
e2a47a99bf9d drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21239/index.html

[-- Attachment #2: Type: text/html, Size: 11386 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev5)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (30 preceding siblings ...)
  (?)
@ 2021-10-05  1:49 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-05  1:49 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev5)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
fc9c9fb6630a drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
fa11631fe33a drm/i915/guc: Take GT PM ref when deregistering context
-:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

total: 0 errors, 0 warnings, 2 checks, 290 lines checked
f26913441370 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
00cd343ff096 drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
b94a2d8dd4a6 drm/i915: Add logical engine mapping
a352b3260782 drm/i915: Expose logical engine instance to user
b00df96b3c7a drm/i915/guc: Introduce context parent-child relationship
4a15247fee14 drm/i915/guc: Add multi-lrc context registration
d99a9b87a2b4 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
94fb468f6a15 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
4fefc07d9141 drm/i915/guc: Implement parallel context pin / unpin functions
cce8ed09d2b3 drm/i915/guc: Implement multi-lrc submission
-:364: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
#364: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:771:
+		*wqi++ = child->ring->tail / sizeof(u64);
 		^

total: 0 errors, 0 warnings, 1 checks, 570 lines checked
1327655fea5c drm/i915/guc: Insert submit fences between requests in parent-child relationship
faaaa22df6f9 drm/i915/guc: Implement multi-lrc reset
45f5266f4bc8 drm/i915/guc: Update debugfs for GuC multi-lrc
-:23: CHECK:LINE_SPACING: Please don't use multiple blank lines
#23: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3707:
 
+

total: 0 errors, 0 warnings, 1 checks, 67 lines checked
4121bd97a8b5 drm/i915: Fix bug in user proto-context creation that leaked contexts
1a690133eb25 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
2f9e9c7755e0 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
-:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 checks, 10 lines checked
77f007a50c5c drm/i915/guc: Add basic GuC multi-lrc selftest
-:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#22: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 190 lines checked
12e50491ae0d drm/i915/guc: Implement no mid batch preemption for multi-lrc
a2d809b95c10 drm/i915: Multi-BB execbuf
-:369: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#369: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1854:
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < (_eb)->num_batches; ++_i)

-:371: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)

-:371: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)

total: 1 errors, 0 warnings, 2 checks, 1298 lines checked
c45d7a15a6cc drm/i915/guc: Handle errors in multi-lrc requests
31ff5626db61 drm/i915: Make request conflict tracking understand parallel submits
b73b105b6c29 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
54ef99f9936c drm/i915: Enable multi-bb execbuf
aef348f7f8e8 drm/i915/execlists: Weak parallel submission support for execlists



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Parallel submission aka multi-bb execbuf (rev5)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (31 preceding siblings ...)
  (?)
@ 2021-10-05  1:51 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-05  1:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev5)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/intel_wakeref.c:137:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock
+drivers/gpu/drm/i915/selftests/i915_syncmap.c:80:54: warning: dubious: x | !y
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Parallel submission aka multi-bb execbuf (rev5)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (32 preceding siblings ...)
  (?)
@ 2021-10-05  1:54 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-05  1:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev5)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_stall_reason' not described in 'intel_guc'
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_state' not described in 'intel_guc'



^ permalink raw reply	[flat|nested] 165+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Parallel submission aka multi-bb execbuf (rev5)
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
                   ` (33 preceding siblings ...)
  (?)
@ 2021-10-05  2:21 ` Patchwork
  -1 siblings, 0 replies; 165+ messages in thread
From: Patchwork @ 2021-10-05  2:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 13227 bytes --]

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev5)
URL   : https://patchwork.freedesktop.org/series/92789/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10681 -> Patchwork_21241
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21241 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21241, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21241:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_close_race@basic-threads:
    - fi-kbl-7500u:       [PASS][1] -> [DMESG-FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-kbl-7500u/igt@gem_close_race@basic-threads.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-7500u/igt@gem_close_race@basic-threads.html
    - fi-kbl-r:           [PASS][3] -> [DMESG-FAIL][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-kbl-r/igt@gem_close_race@basic-threads.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-r/igt@gem_close_race@basic-threads.html
    - fi-ivb-3770:        [PASS][5] -> [DMESG-FAIL][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-ivb-3770/igt@gem_close_race@basic-threads.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-ivb-3770/igt@gem_close_race@basic-threads.html
    - fi-cml-u2:          [PASS][7] -> [DMESG-FAIL][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-cml-u2/igt@gem_close_race@basic-threads.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-cml-u2/igt@gem_close_race@basic-threads.html

  * igt@gem_render_linear_blits@basic:
    - fi-kbl-soraka:      [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-kbl-soraka/igt@gem_render_linear_blits@basic.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-soraka/igt@gem_render_linear_blits@basic.html

  * igt@gem_render_tiled_blits@basic:
    - fi-hsw-4770:        [PASS][11] -> [FAIL][12] +2 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html

  * igt@runner@aborted:
    - fi-ivb-3770:        NOTRUN -> [FAIL][13]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-ivb-3770/igt@runner@aborted.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10681 and Patchwork_21241:

### New IGT tests (1) ###

  * igt@i915_selftest@live@guc_multi_lrc:
    - Statuses : 24 pass(s)
    - Exec time: [0.42, 3.86] s

  

Known issues
------------

  Here are the changes found in Patchwork_21241 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@query-info:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][14] ([fdo#109315])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@amdgpu/amd_basic@query-info.html

  * igt@amdgpu/amd_basic@semaphore:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][15] ([fdo#109271]) +27 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bdw-5557u/igt@amdgpu/amd_basic@semaphore.html

  * igt@amdgpu/amd_cs_nop@nop-gfx0:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][16] ([fdo#109315] / [i915#2575]) +16 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@amdgpu/amd_cs_nop@nop-gfx0.html

  * igt@core_hotunplug@unbind-rebind:
    - fi-bdw-5557u:       NOTRUN -> [WARN][17] ([i915#3718])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bdw-5557u/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-apl-guc:         NOTRUN -> [SKIP][18] ([fdo#109271]) +1 similar issue
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-apl-guc/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][19] ([i915#2190])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@gem_huc_copy@huc-copy.html

  * igt@i915_hangman@error-state-basic:
    - fi-apl-guc:         NOTRUN -> [DMESG-WARN][20] ([i915#1610])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-apl-guc/igt@i915_hangman@error-state-basic.html

  * igt@i915_pm_backlight@basic-brightness:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][21] ([i915#1155])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@i915_pm_backlight@basic-brightness.html

  * igt@i915_selftest@live@execlists:
    - fi-bsw-kefka:       [PASS][22] -> [INCOMPLETE][23] ([i915#2940])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-bsw-kefka/igt@i915_selftest@live@execlists.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bsw-kefka/igt@i915_selftest@live@execlists.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-bdw-samus:       [PASS][24] -> [DMESG-FAIL][25] ([i915#2291] / [i915#541])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-bdw-samus/igt@i915_selftest@live@gt_heartbeat.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bdw-samus/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_selftest@live@hangcheck:
    - fi-snb-2600:        [PASS][26] -> [INCOMPLETE][27] ([i915#3921])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-snb-2600/igt@i915_selftest@live@hangcheck.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][28] ([fdo#111827]) +8 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][29] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bdw-5557u/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][30] ([i915#4103]) +1 similar issue
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][31] ([fdo#109285])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][32] ([i915#1072]) +3 similar issues
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@kms_psr@primary_mmap_gtt.html

  * igt@prime_vgem@basic-userptr:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][33] ([i915#3301])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-tgl-1115g4/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-bsw-kefka:       NOTRUN -> [FAIL][34] ([fdo#109271] / [i915#1436] / [i915#3428])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-bsw-kefka/igt@runner@aborted.html
    - fi-apl-guc:         NOTRUN -> [FAIL][35] ([i915#2426] / [i915#3363])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-apl-guc/igt@runner@aborted.html
    - fi-kbl-r:           NOTRUN -> [FAIL][36] ([i915#2722] / [i915#3363])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-r/igt@runner@aborted.html
    - fi-kbl-soraka:      NOTRUN -> [FAIL][37] ([i915#2426] / [i915#3363])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-soraka/igt@runner@aborted.html
    - fi-kbl-7500u:       NOTRUN -> [FAIL][38] ([i915#2722] / [i915#3363])
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-kbl-7500u/igt@runner@aborted.html
    - fi-cml-u2:          NOTRUN -> [FAIL][39] ([i915#2722] / [i915#3363])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-cml-u2/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b:
    - fi-cfl-8109u:       [DMESG-WARN][40] ([i915#295]) -> [PASS][41] +12 similar issues
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10681/fi-cfl-8109u/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/fi-cfl-8109u/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-b.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1436]: https://gitlab.freedesktop.org/drm/intel/issues/1436
  [i915#1610]: https://gitlab.freedesktop.org/drm/intel/issues/1610
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2291]: https://gitlab.freedesktop.org/drm/intel/issues/2291
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#2722]: https://gitlab.freedesktop.org/drm/intel/issues/2722
  [i915#2940]: https://gitlab.freedesktop.org/drm/intel/issues/2940
  [i915#295]: https://gitlab.freedesktop.org/drm/intel/issues/295
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3428]: https://gitlab.freedesktop.org/drm/intel/issues/3428
  [i915#3718]: https://gitlab.freedesktop.org/drm/intel/issues/3718
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#541]: https://gitlab.freedesktop.org/drm/intel/issues/541


Participating hosts (36 -> 33)
------------------------------

  Additional (2): fi-tgl-1115g4 fi-apl-guc 
  Missing    (5): bat-dg1-6 fi-bsw-cyan bat-adlp-4 bat-jsl-2 bat-jsl-1 


Build changes
-------------

  * Linux: CI_DRM_10681 -> Patchwork_21241

  CI-20190529: 20190529
  CI_DRM_10681: fe9b639a95a08713c8ee4ef110ce6a6388c9f9f2 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6228: 22643ce4014a0b2dc52ce7916b2f657e2a7757c3 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21241: aef348f7f8e8286bddfd22314d98b1c1b39503eb @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

aef348f7f8e8 drm/i915/execlists: Weak parallel submission support for execlists
54ef99f9936c drm/i915: Enable multi-bb execbuf
b73b105b6c29 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
31ff5626db61 drm/i915: Make request conflict tracking understand parallel submits
c45d7a15a6cc drm/i915/guc: Handle errors in multi-lrc requests
a2d809b95c10 drm/i915: Multi-BB execbuf
12e50491ae0d drm/i915/guc: Implement no mid batch preemption for multi-lrc
77f007a50c5c drm/i915/guc: Add basic GuC multi-lrc selftest
2f9e9c7755e0 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
1a690133eb25 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
4121bd97a8b5 drm/i915: Fix bug in user proto-context creation that leaked contexts
45f5266f4bc8 drm/i915/guc: Update debugfs for GuC multi-lrc
faaaa22df6f9 drm/i915/guc: Implement multi-lrc reset
1327655fea5c drm/i915/guc: Insert submit fences between requests in parent-child relationship
cce8ed09d2b3 drm/i915/guc: Implement multi-lrc submission
4fefc07d9141 drm/i915/guc: Implement parallel context pin / unpin functions
94fb468f6a15 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
d99a9b87a2b4 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
4a15247fee14 drm/i915/guc: Add multi-lrc context registration
b00df96b3c7a drm/i915/guc: Introduce context parent-child relationship
a352b3260782 drm/i915: Expose logical engine instance to user
b94a2d8dd4a6 drm/i915: Add logical engine mapping
00cd343ff096 drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
f26913441370 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
fa11631fe33a drm/i915/guc: Take GT PM ref when deregistering context
fc9c9fb6630a drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21241/index.html

[-- Attachment #2: Type: text/html, Size: 15512 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
  2021-10-04 22:06   ` Matthew Brost
@ 2021-10-05  7:55     ` kernel test robot
  -1 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05  7:55 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: llvm, kbuild-all, john.c.harrison, daniele.ceraolospurio

[-- Attachment #1: Type: text/plain, Size: 3247 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: i386-randconfig-a004-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
                   if (multi_lrc_submit(rq)) {
                       ^~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1499:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:3: note: remove the 'if' if its condition is always true
                   if (multi_lrc_submit(rq)) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1479:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +1486 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1475	
  1476	static int guc_bypass_tasklet_submit(struct intel_guc *guc,
  1477					     struct i915_request *rq)
  1478	{
  1479		int ret;
  1480	
  1481		__i915_request_submit(rq);
  1482	
  1483		trace_i915_request_in(rq, 0);
  1484	
  1485		if (is_multi_lrc_rq(rq)) {
> 1486			if (multi_lrc_submit(rq)) {
  1487				ret = guc_wq_item_append(guc, rq);
  1488				if (!ret)
  1489					ret = guc_add_request(guc, rq);
  1490			}
  1491		} else {
  1492			guc_set_lrc_tail(rq);
  1493			ret = guc_add_request(guc, rq);
  1494		}
  1495	
  1496		if (unlikely(ret == -EPIPE))
  1497			disable_submission(guc);
  1498	
  1499		return ret;
  1500	}
  1501	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32105 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
@ 2021-10-05  7:55     ` kernel test robot
  0 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05  7:55 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3327 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: i386-randconfig-a004-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
                   if (multi_lrc_submit(rq)) {
                       ^~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1499:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:3: note: remove the 'if' if its condition is always true
                   if (multi_lrc_submit(rq)) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1479:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +1486 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1475	
  1476	static int guc_bypass_tasklet_submit(struct intel_guc *guc,
  1477					     struct i915_request *rq)
  1478	{
  1479		int ret;
  1480	
  1481		__i915_request_submit(rq);
  1482	
  1483		trace_i915_request_in(rq, 0);
  1484	
  1485		if (is_multi_lrc_rq(rq)) {
> 1486			if (multi_lrc_submit(rq)) {
  1487				ret = guc_wq_item_append(guc, rq);
  1488				if (!ret)
  1489					ret = guc_add_request(guc, rq);
  1490			}
  1491		} else {
  1492			guc_set_lrc_tail(rq);
  1493			ret = guc_add_request(guc, rq);
  1494		}
  1495	
  1496		if (unlikely(ret == -EPIPE))
  1497			disable_submission(guc);
  1498	
  1499		return ret;
  1500	}
  1501	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32105 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-05  8:31     ` kernel test robot
  -1 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05  8:31 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: llvm, kbuild-all, john.c.harrison, daniele.ceraolospurio

[-- Attachment #1: Type: text/plain, Size: 8330 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: i386-randconfig-a004-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/758202922dad66c1b302eb34a141961acbefe417
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 758202922dad66c1b302eb34a141961acbefe417
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2361:6: warning: variable 'rq' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (throttle)
               ^~~~~~~~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2365:6: note: uninitialized use occurs here
           if (rq) {
               ^~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2361:2: note: remove the 'if' if its condition is always true
           if (throttle)
           ^~~~~~~~~~~~~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2346:25: note: initialize the variable 'rq' to silence this warning
           struct i915_request *rq;
                                  ^
                                   = NULL
   1 warning generated.


vim +2361 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2341  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2342  static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2343  			   bool throttle)
8f2a1057d6ec21 drivers/gpu/drm/i915/i915_gem_execbuffer.c     Chris Wilson      2019-04-25  2344  {
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2345  	struct intel_timeline *tl;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2346  	struct i915_request *rq;
8f2a1057d6ec21 drivers/gpu/drm/i915/i915_gem_execbuffer.c     Chris Wilson      2019-04-25  2347  
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2348  	/*
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2349  	 * Take a local wakeref for preparing to dispatch the execbuf as
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2350  	 * we expect to access the hardware fairly frequently in the
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2351  	 * process, and require the engine to be kept awake between accesses.
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2352  	 * Upon dispatch, we acquire another prolonged wakeref that we hold
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2353  	 * until the timeline is idle, which in turn releases the wakeref
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2354  	 * taken on the engine, and the parent device.
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2355  	 */
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2356  	tl = intel_context_timeline_lock(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2357  	if (IS_ERR(tl))
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2358  		return PTR_ERR(tl);
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2359  
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2360  	intel_context_enter(ce);
2bf541ff6d06f4 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Maarten Lankhorst 2020-08-19 @2361  	if (throttle)
2bf541ff6d06f4 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Maarten Lankhorst 2020-08-19  2362  		rq = eb_throttle(eb, ce);
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2363  	intel_context_timeline_unlock(tl);
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2364  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2365  	if (rq) {
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2366  		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2367  		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2368  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2369  		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2370  				      timeout) < 0) {
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2371  			i915_request_put(rq);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2372  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2373  			tl = intel_context_timeline_lock(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2374  			intel_context_exit(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2375  			intel_context_timeline_unlock(tl);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2376  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2377  			if (nonblock)
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2378  				return -EWOULDBLOCK;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2379  			else
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2380  				return -EINTR;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2381  		}
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2382  		i915_request_put(rq);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2383  	}
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2384  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2385  	return 0;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2386  }
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2387  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32105 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
@ 2021-10-05  8:31     ` kernel test robot
  0 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05  8:31 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 8428 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: i386-randconfig-a004-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/758202922dad66c1b302eb34a141961acbefe417
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 758202922dad66c1b302eb34a141961acbefe417
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2361:6: warning: variable 'rq' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (throttle)
               ^~~~~~~~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2365:6: note: uninitialized use occurs here
           if (rq) {
               ^~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2361:2: note: remove the 'if' if its condition is always true
           if (throttle)
           ^~~~~~~~~~~~~
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2346:25: note: initialize the variable 'rq' to silence this warning
           struct i915_request *rq;
                                  ^
                                   = NULL
   1 warning generated.


vim +2361 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2341  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2342  static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2343  			   bool throttle)
8f2a1057d6ec21 drivers/gpu/drm/i915/i915_gem_execbuffer.c     Chris Wilson      2019-04-25  2344  {
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2345  	struct intel_timeline *tl;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2346  	struct i915_request *rq;
8f2a1057d6ec21 drivers/gpu/drm/i915/i915_gem_execbuffer.c     Chris Wilson      2019-04-25  2347  
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2348  	/*
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2349  	 * Take a local wakeref for preparing to dispatch the execbuf as
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2350  	 * we expect to access the hardware fairly frequently in the
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2351  	 * process, and require the engine to be kept awake between accesses.
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2352  	 * Upon dispatch, we acquire another prolonged wakeref that we hold
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2353  	 * until the timeline is idle, which in turn releases the wakeref
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2354  	 * taken on the engine, and the parent device.
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2355  	 */
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2356  	tl = intel_context_timeline_lock(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2357  	if (IS_ERR(tl))
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2358  		return PTR_ERR(tl);
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2359  
a4e57f9031ccd5 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-04  2360  	intel_context_enter(ce);
2bf541ff6d06f4 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Maarten Lankhorst 2020-08-19 @2361  	if (throttle)
2bf541ff6d06f4 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Maarten Lankhorst 2020-08-19  2362  		rq = eb_throttle(eb, ce);
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2363  	intel_context_timeline_unlock(tl);
e5dadff4b09376 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Chris Wilson      2019-08-15  2364  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2365  	if (rq) {
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2366  		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2367  		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2368  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2369  		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2370  				      timeout) < 0) {
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2371  			i915_request_put(rq);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2372  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2373  			tl = intel_context_timeline_lock(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2374  			intel_context_exit(ce);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2375  			intel_context_timeline_unlock(tl);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2376  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2377  			if (nonblock)
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2378  				return -EWOULDBLOCK;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2379  			else
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2380  				return -EINTR;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2381  		}
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2382  		i915_request_put(rq);
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2383  	}
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2384  
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2385  	return 0;
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2386  }
758202922dad66 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c Matthew Brost     2021-10-04  2387  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32105 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
  2021-10-04 22:06   ` Matthew Brost
@ 2021-10-05 10:37     ` kernel test robot
  -1 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05 10:37 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: llvm, kbuild-all, john.c.harrison, daniele.ceraolospurio

[-- Attachment #1: Type: text/plain, Size: 3247 bytes --]

Hi Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: x86_64-randconfig-a003-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:7: error: variable 'ret' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
                   if (multi_lrc_submit(rq)) {
                       ^~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1499:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:3: note: remove the 'if' if its condition is always true
                   if (multi_lrc_submit(rq)) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1479:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 error generated.


vim +1486 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1475	
  1476	static int guc_bypass_tasklet_submit(struct intel_guc *guc,
  1477					     struct i915_request *rq)
  1478	{
  1479		int ret;
  1480	
  1481		__i915_request_submit(rq);
  1482	
  1483		trace_i915_request_in(rq, 0);
  1484	
  1485		if (is_multi_lrc_rq(rq)) {
> 1486			if (multi_lrc_submit(rq)) {
  1487				ret = guc_wq_item_append(guc, rq);
  1488				if (!ret)
  1489					ret = guc_add_request(guc, rq);
  1490			}
  1491		} else {
  1492			guc_set_lrc_tail(rq);
  1493			ret = guc_add_request(guc, rq);
  1494		}
  1495	
  1496		if (unlikely(ret == -EPIPE))
  1497			disable_submission(guc);
  1498	
  1499		return ret;
  1500	}
  1501	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36747 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
@ 2021-10-05 10:37     ` kernel test robot
  0 siblings, 0 replies; 165+ messages in thread
From: kernel test robot @ 2021-10-05 10:37 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3327 bytes --]

Hi Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-tip/drm-tip]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master airlied/drm-next v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: x86_64-randconfig-a003-20211004 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c0039de2953d15815448b4b3c3bafb45607781e0)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20211005-061424
        git checkout 3bea3cc438df1105d0d8c1bcc01b96559d4bb78c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:7: error: variable 'ret' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
                   if (multi_lrc_submit(rq)) {
                       ^~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1499:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1486:3: note: remove the 'if' if its condition is always true
                   if (multi_lrc_submit(rq)) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1479:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 error generated.


vim +1486 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1475	
  1476	static int guc_bypass_tasklet_submit(struct intel_guc *guc,
  1477					     struct i915_request *rq)
  1478	{
  1479		int ret;
  1480	
  1481		__i915_request_submit(rq);
  1482	
  1483		trace_i915_request_in(rq, 0);
  1484	
  1485		if (is_multi_lrc_rq(rq)) {
> 1486			if (multi_lrc_submit(rq)) {
  1487				ret = guc_wq_item_append(guc, rq);
  1488				if (!ret)
  1489					ret = guc_add_request(guc, rq);
  1490			}
  1491		} else {
  1492			guc_set_lrc_tail(rq);
  1493			ret = guc_add_request(guc, rq);
  1494		}
  1495	
  1496		if (unlikely(ret == -EPIPE))
  1497			disable_submission(guc);
  1498	
  1499		return ret;
  1500	}
  1501	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 36747 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
  (?)
  (?)
@ 2021-10-05 17:02   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-05 17:02 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

On Mon, Oct 04, 2021 at 03:06:32PM -0700, Matthew Brost wrote:
> Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> after a context has been configured with the 'set_parallel' extension.
> The number batches is implicit based on the contexts configuration.
> 
> This is implemented with a series of loops. First a loop is used to find
> all the batches, a loop to pin all the HW contexts, a loop to create all
> the requests, a loop to submit (emit BB start, etc...) all the requests,
> a loop to tie the requests to the VMAs they touch, and finally a loop to
> commit the requests to the backend.
> 
> A composite fence is also created for the generated requests to return
> to the user and to stick in dma resv slots.
> 
> No behavior from the existing IOCTL should be changed aside from when
> throttling because the ring for a context is full, wait on the request
> while holding the object locks.
> 
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
> 
> v2:
>  (Matthew Brost)
>   - Return proper error value if i915_request_create fails
> v3:
>  (John Harrison)
>   - Add comment explaining create / add order loops + locking
>   - Update commit message explaining different in IOCTL behavior
>   - Line wrap some comments
>   - eb_add_request returns void
>   - Return -EINVAL rather triggering BUG_ON if cmd parser used
>  (Checkpatch)
>   - Check eb->batch_len[*current_batch]
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
>  drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
>  drivers/gpu/drm/i915/i915_request.h           |   9 +
>  drivers/gpu/drm/i915/i915_vma.c               |  21 +-
>  drivers/gpu/drm/i915/i915_vma.h               |  13 +-
>  7 files changed, 599 insertions(+), 257 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 2f2434b52317..5c7fb6f68bbb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -244,17 +244,25 @@ struct i915_execbuffer {
>  	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
>  	struct eb_vma *vma;
>  
> -	struct intel_engine_cs *engine; /** engine to queue the request to */
> +	struct intel_gt *gt; /* gt for the execbuf */
>  	struct intel_context *context; /* logical state for the request */
>  	struct i915_gem_context *gem_context; /** caller's context */
>  
> -	struct i915_request *request; /** our request to build */
> -	struct eb_vma *batch; /** identity of the batch obj/vma */
> +	/** our requests to build */
> +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> +	/** identity of the batch obj/vma */
> +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
>  	struct i915_vma *trampoline; /** trampoline used for chaining */
>  
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>  	/** actual size of execobj[] as we may extend it for the cmdparser */
>  	unsigned int buffer_count;
>  
> +	/* number of batches in execbuf IOCTL */
> +	unsigned int num_batches;
> +
>  	/** list of vma not yet bound during reservation phase */
>  	struct list_head unbound;
>  
> @@ -281,7 +289,8 @@ struct i915_execbuffer {
>  
>  	u64 invalid_flags; /** Set of execobj.flags that are invalid */
>  
> -	u64 batch_len; /** Length of batch within object */
> +	/** Length of batch within object */
> +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
>  	u32 batch_start_offset; /** Location within object of batch */
>  	u32 batch_flags; /** Flags composed for emit_bb_start() */
>  	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> @@ -299,14 +308,13 @@ struct i915_execbuffer {
>  };
>  
>  static int eb_parse(struct i915_execbuffer *eb);
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> -					  bool throttle);
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
>  static void eb_unpin_engine(struct i915_execbuffer *eb);
>  
>  static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
>  {
> -	return intel_engine_requires_cmd_parser(eb->engine) ||
> -		(intel_engine_using_cmd_parser(eb->engine) &&
> +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> +		(intel_engine_using_cmd_parser(eb->context->engine) &&
>  		 eb->args->batch_len);
>  }
>  
> @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
>  	return 0;
>  }
>  
> -static void
> +static inline bool
> +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> +{
> +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> +		buffer_idx < eb->num_batches :
> +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> +}
> +
> +static int
>  eb_add_vma(struct i915_execbuffer *eb,
> -	   unsigned int i, unsigned batch_idx,
> +	   unsigned int *current_batch,
> +	   unsigned int i,
>  	   struct i915_vma *vma)
>  {
> +	struct drm_i915_private *i915 = eb->i915;
>  	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
>  	struct eb_vma *ev = &eb->vma[i];
>  
> @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
>  	 * Note that actual hangs have only been observed on gen7, but for
>  	 * paranoia do it everywhere.
>  	 */
> -	if (i == batch_idx) {
> +	if (is_batch_buffer(eb, i)) {
>  		if (entry->relocation_count &&
>  		    !(ev->flags & EXEC_OBJECT_PINNED))
>  			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
>  		if (eb->reloc_cache.has_fence)
>  			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
>  
> -		eb->batch = ev;
> +		eb->batches[*current_batch] = ev;
> +
> +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> +			drm_dbg(&i915->drm,
> +				"Attempting to use self-modifying batch buffer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (range_overflows_t(u64,
> +				      eb->batch_start_offset,
> +				      eb->args->batch_len,
> +				      ev->vma->size)) {
> +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> +			return -EINVAL;
> +		}
> +
> +		if (eb->args->batch_len == 0)
> +			eb->batch_len[*current_batch] = ev->vma->size -
> +				eb->batch_start_offset;
		else
			eb->batch_len[*current_batch] = eb->args->batch_len;

The fix should resolve the BAT CI issues as seen in below trybot series:
https://patchwork.freedesktop.org/series/95436/

Matt 

> +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> +			drm_dbg(&i915->drm, "Invalid batch length\n");
> +			return -EINVAL;
> +		}
> +
> +		++*current_batch;
>  	}
> +
> +	return 0;
>  }
>  
>  static inline int use_cpu_reloc(const struct reloc_cache *cache,
> @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>  	} while (1);
>  }
>  
> -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> -{
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		return 0;
> -	else
> -		return eb->buffer_count - 1;
> -}
> -
>  static int eb_select_context(struct i915_execbuffer *eb)
>  {
>  	struct i915_gem_context *ctx;
> @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
>  
>  static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  {
> -	struct drm_i915_private *i915 = eb->i915;
> -	unsigned int batch = eb_batch_index(eb);
> -	unsigned int i;
> +	unsigned int i, current_batch = 0;
>  	int err = 0;
>  
>  	INIT_LIST_HEAD(&eb->relocs);
> @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  			goto err;
>  		}
>  
> -		eb_add_vma(eb, i, batch, vma);
> +		err = eb_add_vma(eb, &current_batch, i, vma);
> +		if (err)
> +			return err;
>  
>  		if (i915_gem_object_is_userptr(vma->obj)) {
>  			err = i915_gem_object_userptr_submit_init(vma->obj);
> @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  		}
>  	}
>  
> -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> -		drm_dbg(&i915->drm,
> -			"Attempting to use self-modifying batch buffer\n");
> -		return -EINVAL;
> -	}
> -
> -	if (range_overflows_t(u64,
> -			      eb->batch_start_offset, eb->batch_len,
> -			      eb->batch->vma->size)) {
> -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> -		return -EINVAL;
> -	}
> -
> -	if (eb->batch_len == 0)
> -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> -		drm_dbg(&i915->drm, "Invalid batch length\n");
> -		return -EINVAL;
> -	}
> -
>  	return 0;
>  
>  err:
> @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> -					   struct i915_request *rq)
> +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
>  {
>  	bool have_copy = false;
>  	struct eb_vma *ev;
> @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	eb_release_vmas(eb, false);
>  	i915_gem_ww_ctx_fini(&eb->ww);
>  
> -	if (rq) {
> -		/* nonblocking is always false */
> -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> -			i915_request_put(rq);
> -			rq = NULL;
> -
> -			err = -EINTR;
> -			goto err_relock;
> -		}
> -
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/*
>  	 * We take 3 passes through the slowpatch.
>  	 *
> @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	if (!err)
>  		err = eb_reinit_userptr(eb);
>  
> -err_relock:
>  	i915_gem_ww_ctx_init(&eb->ww, true);
>  	if (err)
>  		goto out;
>  
>  	/* reacquire the objects */
>  repeat_validate:
> -	rq = eb_pin_engine(eb, false);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, false);
> +	if (err)
>  		goto err;
> -	}
> -
> -	/* We didn't throttle, should be NULL */
> -	GEM_WARN_ON(rq);
>  
>  	err = eb_validate_vmas(eb);
>  	if (err)
>  		goto err;
>  
> -	GEM_BUG_ON(!eb->batch);
> +	GEM_BUG_ON(!eb->batches[0]);
>  
>  	list_for_each_entry(ev, &eb->relocs, reloc_link) {
>  		if (!have_copy) {
> @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  		}
>  	}
>  
> -	if (rq)
> -		i915_request_put(rq);
> -
>  	return err;
>  }
>  
>  static int eb_relocate_parse(struct i915_execbuffer *eb)
>  {
>  	int err;
> -	struct i915_request *rq = NULL;
>  	bool throttle = true;
>  
>  retry:
> -	rq = eb_pin_engine(eb, throttle);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, throttle);
> +	if (err) {
>  		if (err != -EDEADLK)
>  			return err;
>  
>  		goto err;
>  	}
>  
> -	if (rq) {
> -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> -
> -		/* Need to drop all locks now for throttling, take slowpath */
> -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> -		if (err == -ETIME) {
> -			if (nonblock) {
> -				err = -EWOULDBLOCK;
> -				i915_request_put(rq);
> -				goto err;
> -			}
> -			goto slow;
> -		}
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/* only throttle once, even if we didn't need to throttle */
>  	throttle = false;
>  
> @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  
>  slow:
> -	err = eb_relocate_parse_slow(eb, rq);
> +	err = eb_relocate_parse_slow(eb);
>  	if (err)
>  		/*
>  		 * If the user expects the execobject.offset and
> @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> +/*
> + * Using two helper loops for the order of which requests / batches are created
> + * and added the to backend. Requests are created in order from the parent to
> + * the last child. Requests are add in the reverse order, from the last child to
> + * parent. This is down from locking reasons as the timeline lock is acquired
> + * during request creation and released when the request is added to the
> + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> + * the ordering.
> + */
> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> +
> +static struct i915_request *
> +eb_find_first_request_added(struct i915_execbuffer *eb)
> +{
> +	int i;
> +
> +	for_each_batch_add_order(eb, i)
> +		if (eb->requests[i])
> +			return eb->requests[i];
> +
> +	GEM_BUG_ON("Request not found");
> +
> +	return NULL;
> +}
> +
>  static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  {
>  	const unsigned int count = eb->buffer_count;
>  	unsigned int i = count;
> -	int err = 0;
> +	int err = 0, j;
>  
>  	while (i--) {
>  		struct eb_vma *ev = &eb->vma[i];
> @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		if (flags & EXEC_OBJECT_CAPTURE) {
>  			struct i915_capture_list *capture;
>  
> -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> -			if (capture) {
> -				capture->next = eb->request->capture_list;
> -				capture->vma = vma;
> -				eb->request->capture_list = capture;
> +			for_each_batch_create_order(eb, j) {
> +				if (!eb->requests[j])
> +					break;
> +
> +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> +				if (capture) {
> +					capture->next =
> +						eb->requests[j]->capture_list;
> +					capture->vma = vma;
> +					eb->requests[j]->capture_list = capture;
> +				}
>  			}
>  		}
>  
> @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  				flags &= ~EXEC_OBJECT_ASYNC;
>  		}
>  
> +		/* We only need to await on the first request */
>  		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
>  			err = i915_request_await_object
> -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> +				(eb_find_first_request_added(eb), obj,
> +				 flags & EXEC_OBJECT_WRITE);
>  		}
>  
> -		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +		for_each_batch_add_order(eb, j) {
> +			if (err)
> +				break;
> +			if (!eb->requests[j])
> +				continue;
> +
> +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> +						       j ? NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->requests[j]->fence,
> +						       flags | __EXEC_OBJECT_NO_RESERVE);
> +		}
>  	}
>  
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		goto err_skip;
>  
>  	/* Unconditionally flush any chipset caches (for streaming writes). */
> -	intel_gt_chipset_flush(eb->engine->gt);
> +	intel_gt_chipset_flush(eb->gt);
>  	return 0;
>  
>  err_skip:
> -	i915_request_set_error_once(eb->request, err);
> +	for_each_batch_create_order(eb, j) {
> +		if (!eb->requests[j])
> +			break;
> +
> +		i915_request_set_error_once(eb->requests[j], err);
> +	}
>  	return err;
>  }
>  
> @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	int err;
>  
>  	if (!eb_use_cmdparser(eb)) {
> -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
>  		if (IS_ERR(batch))
>  			return PTR_ERR(batch);
>  
>  		goto secure_batch;
>  	}
>  
> -	len = eb->batch_len;
> +	if (intel_context_is_parallel(eb->context))
> +		return -EINVAL;
> +
> +	len = eb->batch_len[0];
>  	if (!CMDPARSER_USES_GGTT(eb->i915)) {
>  		/*
>  		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	} else {
>  		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
>  	}
> -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
>  		return -EINVAL;
>  
>  	if (!pool) {
> -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> +		pool = intel_gt_get_buffer_pool(eb->gt, len,
>  						I915_MAP_WB);
>  		if (IS_ERR(pool))
>  			return PTR_ERR(pool);
> @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>  		trampoline = shadow;
>  
>  		shadow = shadow_batch_pin(eb, pool->obj,
> -					  &eb->engine->gt->ggtt->vm,
> +					  &eb->gt->ggtt->vm,
>  					  PIN_GLOBAL);
>  		if (IS_ERR(shadow)) {
>  			err = PTR_ERR(shadow);
> @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	if (err)
>  		goto err_trampoline;
>  
> -	err = intel_engine_cmd_parser(eb->engine,
> -				      eb->batch->vma,
> +	err = intel_engine_cmd_parser(eb->context->engine,
> +				      eb->batches[0]->vma,
>  				      eb->batch_start_offset,
> -				      eb->batch_len,
> +				      eb->batch_len[0],
>  				      shadow, trampoline);
>  	if (err)
>  		goto err_unpin_batch;
>  
> -	eb->batch = &eb->vma[eb->buffer_count++];
> -	eb->batch->vma = i915_vma_get(shadow);
> -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> +	eb->batches[0]->vma = i915_vma_get(shadow);
> +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
>  
>  	eb->trampoline = trampoline;
>  	eb->batch_start_offset = 0;
>  
>  secure_batch:
>  	if (batch) {
> -		eb->batch = &eb->vma[eb->buffer_count++];
> -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> -		eb->batch->vma = i915_vma_get(batch);
> +		if (intel_context_is_parallel(eb->context))
> +			return -EINVAL;
> +
> +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> +		eb->batches[0]->vma = i915_vma_get(batch);
>  	}
>  	return 0;
>  
> @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> +static int eb_request_submit(struct i915_execbuffer *eb,
> +			     struct i915_request *rq,
> +			     struct i915_vma *batch,
> +			     u64 batch_len)
>  {
>  	int err;
>  
> -	if (intel_context_nopreempt(eb->context))
> -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> -
> -	err = eb_move_to_gpu(eb);
> -	if (err)
> -		return err;
> +	if (intel_context_nopreempt(rq->context))
> +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
>  
>  	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> -		err = i915_reset_gen7_sol_offsets(eb->request);
> +		err = i915_reset_gen7_sol_offsets(rq);
>  		if (err)
>  			return err;
>  	}
> @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	 * allows us to determine if the batch is still waiting on the GPU
>  	 * or actually running by checking the breadcrumb.
>  	 */
> -	if (eb->engine->emit_init_breadcrumb) {
> -		err = eb->engine->emit_init_breadcrumb(eb->request);
> +	if (rq->context->engine->emit_init_breadcrumb) {
> +		err = rq->context->engine->emit_init_breadcrumb(rq);
>  		if (err)
>  			return err;
>  	}
>  
> -	err = eb->engine->emit_bb_start(eb->request,
> -					batch->node.start +
> -					eb->batch_start_offset,
> -					eb->batch_len,
> -					eb->batch_flags);
> +	err = rq->context->engine->emit_bb_start(rq,
> +						 batch->node.start +
> +						 eb->batch_start_offset,
> +						 batch_len,
> +						 eb->batch_flags);
>  	if (err)
>  		return err;
>  
>  	if (eb->trampoline) {
> +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
>  		GEM_BUG_ON(eb->batch_start_offset);
> -		err = eb->engine->emit_bb_start(eb->request,
> -						eb->trampoline->node.start +
> -						eb->batch_len,
> -						0, 0);
> +		err = rq->context->engine->emit_bb_start(rq,
> +							 eb->trampoline->node.start +
> +							 batch_len, 0, 0);
>  		if (err)
>  			return err;
>  	}
> @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	return 0;
>  }
>  
> +static int eb_submit(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	err = eb_move_to_gpu(eb);
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> +		if (!err)
> +			err = eb_request_submit(eb, eb->requests[i],
> +						eb->batches[i]->vma,
> +						eb->batch_len[i]);
> +	}
> +
> +	return err;
> +}
> +
>  static int num_vcs_engines(const struct drm_i915_private *i915)
>  {
>  	return hweight_long(VDBOX_MASK(&i915->gt));
> @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
>  	return i915_request_get(rq);
>  }
>  
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> +			   bool throttle)
>  {
> -	struct intel_context *ce = eb->context;
>  	struct intel_timeline *tl;
> -	struct i915_request *rq = NULL;
> -	int err;
> -
> -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> -
> -	if (unlikely(intel_context_is_banned(ce)))
> -		return ERR_PTR(-EIO);
> -
> -	/*
> -	 * Pinning the contexts may generate requests in order to acquire
> -	 * GGTT space, so do this first before we reserve a seqno for
> -	 * ourselves.
> -	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> -	if (err)
> -		return ERR_PTR(err);
> +	struct i915_request *rq;
>  
>  	/*
>  	 * Take a local wakeref for preparing to dispatch the execbuf as
> @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>  	 * taken on the engine, and the parent device.
>  	 */
>  	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl)) {
> -		intel_context_unpin(ce);
> -		return ERR_CAST(tl);
> -	}
> +	if (IS_ERR(tl))
> +		return PTR_ERR(tl);
>  
>  	intel_context_enter(ce);
>  	if (throttle)
>  		rq = eb_throttle(eb, ce);
>  	intel_context_timeline_unlock(tl);
>  
> +	if (rq) {
> +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> +
> +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> +				      timeout) < 0) {
> +			i915_request_put(rq);
> +
> +			tl = intel_context_timeline_lock(ce);
> +			intel_context_exit(ce);
> +			intel_context_timeline_unlock(tl);
> +
> +			if (nonblock)
> +				return -EWOULDBLOCK;
> +			else
> +				return -EINTR;
> +		}
> +		i915_request_put(rq);
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +{
> +	struct intel_context *ce = eb->context, *child;
> +	int err;
> +	int i = 0, j = 0;
> +
> +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> +
> +	if (unlikely(intel_context_is_banned(ce)))
> +		return -EIO;
> +
> +	/*
> +	 * Pinning the contexts may generate requests in order to acquire
> +	 * GGTT space, so do this first before we reserve a seqno for
> +	 * ourselves.
> +	 */
> +	err = intel_context_pin_ww(ce, &eb->ww);
> +	if (err)
> +		return err;
> +	for_each_child(ce, child) {
> +		err = intel_context_pin_ww(child, &eb->ww);
> +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = eb_pin_timeline(eb, child, throttle);
> +		if (err)
> +			goto unwind;
> +		++i;
> +	}
> +	err = eb_pin_timeline(eb, ce, throttle);
> +	if (err)
> +		goto unwind;
> +
>  	eb->args->flags |= __EXEC_ENGINE_PINNED;
> -	return rq;
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (j++ < i) {
> +			mutex_lock(&child->timeline->mutex);
> +			intel_context_exit(child);
> +			mutex_unlock(&child->timeline->mutex);
> +		}
> +	}
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +	return err;
>  }
>  
>  static void eb_unpin_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce = eb->context;
> -	struct intel_timeline *tl = ce->timeline;
> +	struct intel_context *ce = eb->context, *child;
>  
>  	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
>  		return;
>  
>  	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
>  
> -	mutex_lock(&tl->mutex);
> +	for_each_child(ce, child) {
> +		mutex_lock(&child->timeline->mutex);
> +		intel_context_exit(child);
> +		mutex_unlock(&child->timeline->mutex);
> +
> +		intel_context_unpin(child);
> +	}
> +
> +	mutex_lock(&ce->timeline->mutex);
>  	intel_context_exit(ce);
> -	mutex_unlock(&tl->mutex);
> +	mutex_unlock(&ce->timeline->mutex);
>  
>  	intel_context_unpin(ce);
>  }
> @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>  static int
>  eb_select_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce;
> +	struct intel_context *ce, *child;
>  	unsigned int idx;
>  	int err;
>  
> @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
>  	if (IS_ERR(ce))
>  		return PTR_ERR(ce);
>  
> +	if (intel_context_is_parallel(ce)) {
> +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +		if (eb->batch_start_offset || eb->args->batch_len) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +	}
> +	eb->num_batches = ce->parallel.number_children + 1;
> +
> +	for_each_child(ce, child)
> +		intel_context_get(child);
>  	intel_gt_pm_get(ce->engine->gt);
>  
>  	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		if (err)
>  			goto err;
>  	}
> +	for_each_child(ce, child) {
> +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> +			err = intel_context_alloc_state(child);
> +			if (err)
> +				goto err;
> +		}
> +	}
>  
>  	/*
>  	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		goto err;
>  
>  	eb->context = ce;
> -	eb->engine = ce->engine;
> +	eb->gt = ce->engine->gt;
>  
>  	/*
>  	 * Make sure engine pool stays alive even if we call intel_context_put
> @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
>  
>  err:
>  	intel_gt_pm_put(ce->engine->gt);
> +	for_each_child(ce, child)
> +		intel_context_put(child);
>  	intel_context_put(ce);
>  	return err;
>  }
> @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
>  static void
>  eb_put_engine(struct i915_execbuffer *eb)
>  {
> -	intel_gt_pm_put(eb->engine->gt);
> +	struct intel_context *child;
> +
> +	intel_gt_pm_put(eb->gt);
> +	for_each_child(eb->context, child)
> +		intel_context_put(child);
>  	intel_context_put(eb->context);
>  }
>  
> @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
>  }
>  
>  static int
> -await_fence_array(struct i915_execbuffer *eb)
> +await_fence_array(struct i915_execbuffer *eb,
> +		  struct i915_request *rq)
>  {
>  	unsigned int n;
>  	int err;
> @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
>  		if (!eb->fences[n].dma_fence)
>  			continue;
>  
> -		err = i915_request_await_dma_fence(eb->request,
> -						   eb->fences[n].dma_fence);
> +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
>  		if (err < 0)
>  			return err;
>  	}
> @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static void signal_fence_array(const struct i915_execbuffer *eb)
> +static void signal_fence_array(const struct i915_execbuffer *eb,
> +			       struct dma_fence * const fence)
>  {
> -	struct dma_fence * const fence = &eb->request->fence;
>  	unsigned int n;
>  
>  	for (n = 0; n < eb->num_fences; n++) {
> @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
>  			break;
>  }
>  
> -static int eb_request_add(struct i915_execbuffer *eb, int err)
> +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
>  {
> -	struct i915_request *rq = eb->request;
>  	struct intel_timeline * const tl = i915_request_timeline(rq);
>  	struct i915_sched_attr attr = {};
>  	struct i915_request *prev;
> @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  	/* Check that the context wasn't destroyed before submission */
>  	if (likely(!intel_context_is_closed(eb->context))) {
>  		attr = eb->gem_context->sched;
> -	} else {
> -		/* Serialise with context_close via the add_to_timeline */
> -		i915_request_set_error_once(rq, -ENOENT);
> -		__i915_request_skip(rq);
> -		err = -ENOENT; /* override any transient errors */
>  	}
>  
>  	__i915_request_queue(rq, &attr);
> @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  		retire_requests(tl, prev);
>  
>  	mutex_unlock(&tl->mutex);
> +}
> +
> +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> +{
> +	int i;
> +
> +	/*
> +	 * We iterate in reverse order of creation to release timeline mutexes in
> +	 * same order.
> +	 */
> +	for_each_batch_add_order(eb, i) {
> +		struct i915_request *rq = eb->requests[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		if (unlikely(intel_context_is_closed(eb->context))) {
> +			/* Serialise with context_close via the add_to_timeline */
> +			i915_request_set_error_once(rq, -ENOENT);
> +			__i915_request_skip(rq);
> +			err = -ENOENT; /* override any transient errors */
> +		}
> +
> +		if (intel_context_is_parallel(eb->context)) {
> +			if (err) {
> +				__i915_request_skip(rq);
> +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> +					&rq->fence.flags);
> +			}
> +			if (i == 0)
> +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +					&rq->fence.flags);
> +		}
> +
> +		eb_request_add(eb, rq);
> +	}
>  
>  	return err;
>  }
> @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>  				    eb);
>  }
>  
> +static void eb_requests_get(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_get(eb->requests[i]);
> +	}
> +}
> +
> +static void eb_requests_put(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_put(eb->requests[i]);
> +	}
> +}
> +
> +static struct sync_file *
> +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences;
> +	unsigned int i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> +
> +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> +	if (!fences)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for_each_batch_create_order(eb, i)
> +		fences[i] = &eb->requests[i]->fence;
> +
> +	fence_array = dma_fence_array_create(eb->num_batches,
> +					     fences,
> +					     eb->context->parallel.fence_context,
> +					     eb->context->parallel.seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for_each_batch_create_order(eb, i)
> +		dma_fence_get(fences[i]);
> +
> +	if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&fence_array->base);
> +		/* sync_file now owns fence_arry, drop creation ref */
> +		dma_fence_put(&fence_array->base);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return out_fence;
> +}
> +
> +static struct sync_file *
> +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> +	      struct dma_fence *in_fence, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	int err;
> +
> +	if (unlikely(eb->gem_context->syncobj)) {
> +		struct dma_fence *fence;
> +
> +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (in_fence) {
> +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> +			err = i915_request_await_execution(rq, in_fence);
> +		else
> +			err = i915_request_await_dma_fence(rq, in_fence);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (eb->fences) {
> +		err = await_fence_array(eb, rq);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (intel_context_is_parallel(eb->context)) {
> +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> +		if (IS_ERR(out_fence))
> +			return ERR_PTR(-ENOMEM);
> +	} else if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&rq->fence);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	return out_fence;
> +}
> +
> +static struct intel_context *
> +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> +{
> +	struct intel_context *child;
> +
> +	if (likely(context_number == 0))
> +		return eb->context;
> +
> +	for_each_child(eb->context, child)
> +		if (!--context_number)
> +			return child;
> +
> +	GEM_BUG_ON("Context not found");
> +
> +	return NULL;
> +}
> +
> +static struct sync_file *
> +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> +		   int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		/* Allocate a request for this batch buffer nice and early. */
> +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> +		if (IS_ERR(eb->requests[i])) {
> +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> +			eb->requests[i] = NULL;
> +			return out_fence;
> +		}
> +
> +		/*
> +		 * Only the first request added (committed to backend) has to
> +		 * take the in fences into account as all subsequent requests
> +		 * will have fences inserted inbetween them.
> +		 */
> +		if (i + 1 == eb->num_batches) {
> +			out_fence = eb_fences_add(eb, eb->requests[i],
> +						  in_fence, out_fence_fd);
> +			if (IS_ERR(out_fence))
> +				return out_fence;
> +		}
> +
> +		/*
> +		 * Whilst this request exists, batch_obj will be on the
> +		 * active_list, and so will hold the active reference. Only when
> +		 * this request is retired will the batch_obj be moved onto
> +		 * the inactive_list and lose its active reference. Hence we do
> +		 * not need to explicitly hold another reference here.
> +		 */
> +		eb->requests[i]->batch = eb->batches[i]->vma;
> +		if (eb->batch_pool) {
> +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> +							 eb->requests[i]);
> +		}
> +	}
> +
> +	return out_fence;
> +}
> +
>  static int
>  i915_gem_do_execbuffer(struct drm_device *dev,
>  		       struct drm_file *file,
> @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	struct i915_execbuffer eb;
>  	struct dma_fence *in_fence = NULL;
>  	struct sync_file *out_fence = NULL;
> -	struct i915_vma *batch;
>  	int out_fence_fd = -1;
>  	int err;
>  
> @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	eb.buffer_count = args->buffer_count;
>  	eb.batch_start_offset = args->batch_start_offset;
> -	eb.batch_len = args->batch_len;
>  	eb.trampoline = NULL;
>  
>  	eb.fences = NULL;
>  	eb.num_fences = 0;
>  
> +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> +	       ARRAY_SIZE(eb.requests));
> +	eb.composite_fence = NULL;
> +
>  	eb.batch_flags = 0;
>  	if (args->flags & I915_EXEC_SECURE) {
>  		if (GRAPHICS_VER(i915) >= 11)
> @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	ww_acquire_done(&eb.ww.ctx);
>  
> -	batch = eb.batch->vma;
> -
> -	/* Allocate a request for this batch buffer nice and early. */
> -	eb.request = i915_request_create(eb.context);
> -	if (IS_ERR(eb.request)) {
> -		err = PTR_ERR(eb.request);
> -		goto err_vma;
> -	}
> -
> -	if (unlikely(eb.gem_context->syncobj)) {
> -		struct dma_fence *fence;
> -
> -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> -		err = i915_request_await_dma_fence(eb.request, fence);
> -		dma_fence_put(fence);
> -		if (err)
> -			goto err_ext;
> -	}
> -
> -	if (in_fence) {
> -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> -			err = i915_request_await_execution(eb.request,
> -							   in_fence);
> -		else
> -			err = i915_request_await_dma_fence(eb.request,
> -							   in_fence);
> -		if (err < 0)
> -			goto err_request;
> -	}
> -
> -	if (eb.fences) {
> -		err = await_fence_array(&eb);
> -		if (err)
> +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> +	if (IS_ERR(out_fence)) {
> +		err = PTR_ERR(out_fence);
> +		if (eb.requests[0])
>  			goto err_request;
> +		else
> +			goto err_vma;
>  	}
>  
> -	if (out_fence_fd != -1) {
> -		out_fence = sync_file_create(&eb.request->fence);
> -		if (!out_fence) {
> -			err = -ENOMEM;
> -			goto err_request;
> -		}
> -	}
> -
> -	/*
> -	 * Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> -	 */
> -	eb.request->batch = batch;
> -	if (eb.batch_pool)
> -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> -
> -	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch);
> +	err = eb_submit(&eb);
>  
>  err_request:
> -	i915_request_get(eb.request);
> -	err = eb_request_add(&eb, err);
> +	eb_requests_get(&eb);
> +	err = eb_requests_add(&eb, err);
>  
>  	if (eb.fences)
> -		signal_fence_array(&eb);
> +		signal_fence_array(&eb, eb.composite_fence ?
> +				   eb.composite_fence :
> +				   &eb.requests[0]->fence);
>  
>  	if (out_fence) {
>  		if (err == 0) {
> @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	if (unlikely(eb.gem_context->syncobj)) {
>  		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> -					  &eb.request->fence);
> +					  eb.composite_fence ?
> +					  eb.composite_fence :
> +					  &eb.requests[0]->fence);
>  	}
>  
> -	i915_request_put(eb.request);
> +	if (!out_fence && eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
> +
> +	eb_requests_put(&eb);
>  
>  err_vma:
>  	eb_release_vmas(&eb, true);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 1bc705f98e2a..1781419fa105 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
>  	struct intel_timeline *tl = ce->timeline;
>  	int err;
>  
> -	err = mutex_lock_interruptible(&tl->mutex);
> +	if (intel_context_is_parent(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> +	else if (intel_context_is_child(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex,
> +						      ce->parallel.child_index + 1);
> +	else
> +		err = mutex_lock_interruptible(&tl->mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 95a5b94b4ece..9e0177dc5484 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -248,6 +248,16 @@ struct intel_context {
>  		 * context
>  		 */
>  		struct i915_request *last_rq;
> +		/**
> +		 * @fence_context: fence context composite fence when doing
> +		 * parallel submission
> +		 */
> +		u64 fence_context;
> +		/**
> +		 * @seqno: seqno for composite fence when doing parallel
> +		 * submission
> +		 */
> +		u32 seqno;
>  		/** @number_children: number of children if parent */
>  		u8 number_children;
>  		/** @child_index: index into child_list if child */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f28e36aa77c2..83b0d2a114af 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>  		}
>  	}
>  
> +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> +
>  	parent->engine->emit_bb_start =
>  		emit_bb_start_parent_no_preempt_mid_batch;
>  	parent->engine->emit_fini_breadcrumb =
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 8950785e55d6..24db8459376b 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -147,6 +147,15 @@ enum {
>  	 * tail.
>  	 */
>  	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) that
> +	 * hit an error while generating requests in the execbuf IOCTL.
> +	 * Indicates this request should be skipped as another request in
> +	 * submission / relationship encoutered an error.
> +	 */
> +	I915_FENCE_FLAG_SKIP_PARALLEL,
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..90546fa58fc1 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>  	return i915_active_add_request(&vma->active, rq);
>  }
>  
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     struct dma_fence *fence,
> +			     unsigned int flags)
>  {
>  	struct drm_i915_gem_object *obj = vma->obj;
>  	int err;
> @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  			intel_frontbuffer_put(front);
>  		}
>  
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> -		obj->read_domains = 0;
> +		if (fence) {
> +			dma_resv_add_excl_fence(vma->resv, fence);
> +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> +			obj->read_domains = 0;
> +		}
>  	} else {
>  		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
>  			err = dma_resv_reserve_shared(vma->resv, 1);
> @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  				return err;
>  		}
>  
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> -		obj->write_domain = 0;
> +		if (fence) {
> +			dma_resv_add_shared_fence(vma->resv, fence);
> +			obj->write_domain = 0;
> +		}
>  	}
>  
>  	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..648dbe744c96 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>  
>  int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>  					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  struct dma_fence *fence,
> +					  unsigned int flags);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> +}
>  
>  #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>  
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
                     ` (2 preceding siblings ...)
  (?)
@ 2021-10-06 20:46   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-06 20:46 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

On Mon, Oct 04, 2021 at 03:06:32PM -0700, Matthew Brost wrote:
> Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> after a context has been configured with the 'set_parallel' extension.
> The number batches is implicit based on the contexts configuration.
> 
> This is implemented with a series of loops. First a loop is used to find
> all the batches, a loop to pin all the HW contexts, a loop to create all
> the requests, a loop to submit (emit BB start, etc...) all the requests,
> a loop to tie the requests to the VMAs they touch, and finally a loop to
> commit the requests to the backend.
> 
> A composite fence is also created for the generated requests to return
> to the user and to stick in dma resv slots.
> 
> No behavior from the existing IOCTL should be changed aside from when
> throttling because the ring for a context is full, wait on the request
> while holding the object locks.
> 
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
> 
> v2:
>  (Matthew Brost)
>   - Return proper error value if i915_request_create fails
> v3:
>  (John Harrison)
>   - Add comment explaining create / add order loops + locking
>   - Update commit message explaining different in IOCTL behavior
>   - Line wrap some comments
>   - eb_add_request returns void
>   - Return -EINVAL rather triggering BUG_ON if cmd parser used
>  (Checkpatch)
>   - Check eb->batch_len[*current_batch]
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
>  drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
>  drivers/gpu/drm/i915/i915_request.h           |   9 +
>  drivers/gpu/drm/i915/i915_vma.c               |  21 +-
>  drivers/gpu/drm/i915/i915_vma.h               |  13 +-
>  7 files changed, 599 insertions(+), 257 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 2f2434b52317..5c7fb6f68bbb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -244,17 +244,25 @@ struct i915_execbuffer {
>  	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
>  	struct eb_vma *vma;
>  
> -	struct intel_engine_cs *engine; /** engine to queue the request to */
> +	struct intel_gt *gt; /* gt for the execbuf */
>  	struct intel_context *context; /* logical state for the request */
>  	struct i915_gem_context *gem_context; /** caller's context */
>  
> -	struct i915_request *request; /** our request to build */
> -	struct eb_vma *batch; /** identity of the batch obj/vma */
> +	/** our requests to build */
> +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> +	/** identity of the batch obj/vma */
> +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
>  	struct i915_vma *trampoline; /** trampoline used for chaining */
>  
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>  	/** actual size of execobj[] as we may extend it for the cmdparser */
>  	unsigned int buffer_count;
>  
> +	/* number of batches in execbuf IOCTL */
> +	unsigned int num_batches;
> +
>  	/** list of vma not yet bound during reservation phase */
>  	struct list_head unbound;
>  
> @@ -281,7 +289,8 @@ struct i915_execbuffer {
>  
>  	u64 invalid_flags; /** Set of execobj.flags that are invalid */
>  
> -	u64 batch_len; /** Length of batch within object */
> +	/** Length of batch within object */
> +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
>  	u32 batch_start_offset; /** Location within object of batch */
>  	u32 batch_flags; /** Flags composed for emit_bb_start() */
>  	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> @@ -299,14 +308,13 @@ struct i915_execbuffer {
>  };
>  
>  static int eb_parse(struct i915_execbuffer *eb);
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> -					  bool throttle);
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
>  static void eb_unpin_engine(struct i915_execbuffer *eb);
>  
>  static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
>  {
> -	return intel_engine_requires_cmd_parser(eb->engine) ||
> -		(intel_engine_using_cmd_parser(eb->engine) &&
> +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> +		(intel_engine_using_cmd_parser(eb->context->engine) &&
>  		 eb->args->batch_len);
>  }
>  
> @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
>  	return 0;
>  }
>  
> -static void
> +static inline bool
> +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> +{
> +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> +		buffer_idx < eb->num_batches :
> +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> +}
> +
> +static int
>  eb_add_vma(struct i915_execbuffer *eb,
> -	   unsigned int i, unsigned batch_idx,
> +	   unsigned int *current_batch,
> +	   unsigned int i,
>  	   struct i915_vma *vma)
>  {
> +	struct drm_i915_private *i915 = eb->i915;
>  	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
>  	struct eb_vma *ev = &eb->vma[i];
>  
> @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
>  	 * Note that actual hangs have only been observed on gen7, but for
>  	 * paranoia do it everywhere.
>  	 */
> -	if (i == batch_idx) {
> +	if (is_batch_buffer(eb, i)) {
>  		if (entry->relocation_count &&
>  		    !(ev->flags & EXEC_OBJECT_PINNED))
>  			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
>  		if (eb->reloc_cache.has_fence)
>  			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
>  
> -		eb->batch = ev;
> +		eb->batches[*current_batch] = ev;
> +
> +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> +			drm_dbg(&i915->drm,
> +				"Attempting to use self-modifying batch buffer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (range_overflows_t(u64,
> +				      eb->batch_start_offset,
> +				      eb->args->batch_len,
> +				      ev->vma->size)) {
> +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> +			return -EINVAL;
> +		}
> +
> +		if (eb->args->batch_len == 0)
> +			eb->batch_len[*current_batch] = ev->vma->size -
> +				eb->batch_start_offset;
> +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> +			drm_dbg(&i915->drm, "Invalid batch length\n");
> +			return -EINVAL;
> +		}
> +
> +		++*current_batch;
>  	}
> +
> +	return 0;
>  }
>  
>  static inline int use_cpu_reloc(const struct reloc_cache *cache,
> @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>  	} while (1);
>  }
>  
> -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> -{
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		return 0;
> -	else
> -		return eb->buffer_count - 1;
> -}
> -
>  static int eb_select_context(struct i915_execbuffer *eb)
>  {
>  	struct i915_gem_context *ctx;
> @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
>  
>  static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  {
> -	struct drm_i915_private *i915 = eb->i915;
> -	unsigned int batch = eb_batch_index(eb);
> -	unsigned int i;
> +	unsigned int i, current_batch = 0;
>  	int err = 0;
>  
>  	INIT_LIST_HEAD(&eb->relocs);
> @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  			goto err;
>  		}
>  
> -		eb_add_vma(eb, i, batch, vma);
> +		err = eb_add_vma(eb, &current_batch, i, vma);
> +		if (err)
> +			return err;
>  
>  		if (i915_gem_object_is_userptr(vma->obj)) {
>  			err = i915_gem_object_userptr_submit_init(vma->obj);
> @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  		}
>  	}
>  
> -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> -		drm_dbg(&i915->drm,
> -			"Attempting to use self-modifying batch buffer\n");
> -		return -EINVAL;
> -	}
> -
> -	if (range_overflows_t(u64,
> -			      eb->batch_start_offset, eb->batch_len,
> -			      eb->batch->vma->size)) {
> -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> -		return -EINVAL;
> -	}
> -
> -	if (eb->batch_len == 0)
> -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> -		drm_dbg(&i915->drm, "Invalid batch length\n");
> -		return -EINVAL;
> -	}
> -
>  	return 0;
>  
>  err:
> @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> -					   struct i915_request *rq)
> +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
>  {
>  	bool have_copy = false;
>  	struct eb_vma *ev;
> @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	eb_release_vmas(eb, false);
>  	i915_gem_ww_ctx_fini(&eb->ww);
>  
> -	if (rq) {
> -		/* nonblocking is always false */
> -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> -			i915_request_put(rq);
> -			rq = NULL;
> -
> -			err = -EINTR;
> -			goto err_relock;
> -		}
> -
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/*
>  	 * We take 3 passes through the slowpatch.
>  	 *
> @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	if (!err)
>  		err = eb_reinit_userptr(eb);
>  
> -err_relock:
>  	i915_gem_ww_ctx_init(&eb->ww, true);
>  	if (err)
>  		goto out;
>  
>  	/* reacquire the objects */
>  repeat_validate:
> -	rq = eb_pin_engine(eb, false);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, false);
> +	if (err)
>  		goto err;
> -	}
> -
> -	/* We didn't throttle, should be NULL */
> -	GEM_WARN_ON(rq);
>  
>  	err = eb_validate_vmas(eb);
>  	if (err)
>  		goto err;
>  
> -	GEM_BUG_ON(!eb->batch);
> +	GEM_BUG_ON(!eb->batches[0]);
>  
>  	list_for_each_entry(ev, &eb->relocs, reloc_link) {
>  		if (!have_copy) {
> @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  		}
>  	}
>  
> -	if (rq)
> -		i915_request_put(rq);
> -
>  	return err;
>  }
>  
>  static int eb_relocate_parse(struct i915_execbuffer *eb)
>  {
>  	int err;
> -	struct i915_request *rq = NULL;
>  	bool throttle = true;
>  
>  retry:
> -	rq = eb_pin_engine(eb, throttle);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, throttle);
> +	if (err) {
>  		if (err != -EDEADLK)
>  			return err;
>  
>  		goto err;
>  	}
>  
> -	if (rq) {
> -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> -
> -		/* Need to drop all locks now for throttling, take slowpath */
> -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> -		if (err == -ETIME) {
> -			if (nonblock) {
> -				err = -EWOULDBLOCK;
> -				i915_request_put(rq);
> -				goto err;
> -			}
> -			goto slow;
> -		}
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/* only throttle once, even if we didn't need to throttle */
>  	throttle = false;
>  
> @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  
>  slow:
> -	err = eb_relocate_parse_slow(eb, rq);
> +	err = eb_relocate_parse_slow(eb);
>  	if (err)
>  		/*
>  		 * If the user expects the execobject.offset and
> @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> +/*
> + * Using two helper loops for the order of which requests / batches are created
> + * and added the to backend. Requests are created in order from the parent to
> + * the last child. Requests are add in the reverse order, from the last child to
> + * parent. This is down from locking reasons as the timeline lock is acquired
> + * during request creation and released when the request is added to the
> + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> + * the ordering.
> + */
> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> +
> +static struct i915_request *
> +eb_find_first_request_added(struct i915_execbuffer *eb)
> +{
> +	int i;
> +
> +	for_each_batch_add_order(eb, i)
> +		if (eb->requests[i])
> +			return eb->requests[i];
> +
> +	GEM_BUG_ON("Request not found");
> +
> +	return NULL;
> +}
> +
>  static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  {
>  	const unsigned int count = eb->buffer_count;
>  	unsigned int i = count;
> -	int err = 0;
> +	int err = 0, j;
>  
>  	while (i--) {
>  		struct eb_vma *ev = &eb->vma[i];
> @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		if (flags & EXEC_OBJECT_CAPTURE) {
>  			struct i915_capture_list *capture;
>  
> -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> -			if (capture) {
> -				capture->next = eb->request->capture_list;
> -				capture->vma = vma;
> -				eb->request->capture_list = capture;
> +			for_each_batch_create_order(eb, j) {
> +				if (!eb->requests[j])
> +					break;
> +
> +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> +				if (capture) {
> +					capture->next =
> +						eb->requests[j]->capture_list;
> +					capture->vma = vma;
> +					eb->requests[j]->capture_list = capture;
> +				}
>  			}
>  		}
>  
> @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  				flags &= ~EXEC_OBJECT_ASYNC;
>  		}
>  
> +		/* We only need to await on the first request */
>  		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
>  			err = i915_request_await_object
> -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> +				(eb_find_first_request_added(eb), obj,
> +				 flags & EXEC_OBJECT_WRITE);
>  		}
>  
> -		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +		for_each_batch_add_order(eb, j) {
> +			if (err)
> +				break;
> +			if (!eb->requests[j])
> +				continue;
> +
> +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> +						       j ? NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->requests[j]->fence,
> +						       flags | __EXEC_OBJECT_NO_RESERVE);
> +		}
>  	}
>  
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		goto err_skip;
>  
>  	/* Unconditionally flush any chipset caches (for streaming writes). */
> -	intel_gt_chipset_flush(eb->engine->gt);
> +	intel_gt_chipset_flush(eb->gt);
>  	return 0;
>  
>  err_skip:
> -	i915_request_set_error_once(eb->request, err);
> +	for_each_batch_create_order(eb, j) {
> +		if (!eb->requests[j])
> +			break;
> +
> +		i915_request_set_error_once(eb->requests[j], err);
> +	}
>  	return err;
>  }
>  
> @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	int err;
>  
>  	if (!eb_use_cmdparser(eb)) {
> -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
>  		if (IS_ERR(batch))
>  			return PTR_ERR(batch);
>  
>  		goto secure_batch;
>  	}
>  
> -	len = eb->batch_len;
> +	if (intel_context_is_parallel(eb->context))
> +		return -EINVAL;
> +
> +	len = eb->batch_len[0];
>  	if (!CMDPARSER_USES_GGTT(eb->i915)) {
>  		/*
>  		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	} else {
>  		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
>  	}
> -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
>  		return -EINVAL;
>  
>  	if (!pool) {
> -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> +		pool = intel_gt_get_buffer_pool(eb->gt, len,
>  						I915_MAP_WB);
>  		if (IS_ERR(pool))
>  			return PTR_ERR(pool);
> @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>  		trampoline = shadow;
>  
>  		shadow = shadow_batch_pin(eb, pool->obj,
> -					  &eb->engine->gt->ggtt->vm,
> +					  &eb->gt->ggtt->vm,
>  					  PIN_GLOBAL);
>  		if (IS_ERR(shadow)) {
>  			err = PTR_ERR(shadow);
> @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	if (err)
>  		goto err_trampoline;
>  
> -	err = intel_engine_cmd_parser(eb->engine,
> -				      eb->batch->vma,
> +	err = intel_engine_cmd_parser(eb->context->engine,
> +				      eb->batches[0]->vma,
>  				      eb->batch_start_offset,
> -				      eb->batch_len,
> +				      eb->batch_len[0],
>  				      shadow, trampoline);
>  	if (err)
>  		goto err_unpin_batch;
>  
> -	eb->batch = &eb->vma[eb->buffer_count++];
> -	eb->batch->vma = i915_vma_get(shadow);
> -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> +	eb->batches[0]->vma = i915_vma_get(shadow);
> +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
>  
>  	eb->trampoline = trampoline;
>  	eb->batch_start_offset = 0;
>  
>  secure_batch:
>  	if (batch) {
> -		eb->batch = &eb->vma[eb->buffer_count++];
> -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> -		eb->batch->vma = i915_vma_get(batch);
> +		if (intel_context_is_parallel(eb->context))
> +			return -EINVAL;
> +
> +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> +		eb->batches[0]->vma = i915_vma_get(batch);
>  	}
>  	return 0;
>  
> @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> +static int eb_request_submit(struct i915_execbuffer *eb,
> +			     struct i915_request *rq,
> +			     struct i915_vma *batch,
> +			     u64 batch_len)
>  {
>  	int err;
>  
> -	if (intel_context_nopreempt(eb->context))
> -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> -
> -	err = eb_move_to_gpu(eb);
> -	if (err)
> -		return err;
> +	if (intel_context_nopreempt(rq->context))
> +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
>  
>  	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> -		err = i915_reset_gen7_sol_offsets(eb->request);
> +		err = i915_reset_gen7_sol_offsets(rq);
>  		if (err)
>  			return err;
>  	}
> @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	 * allows us to determine if the batch is still waiting on the GPU
>  	 * or actually running by checking the breadcrumb.
>  	 */
> -	if (eb->engine->emit_init_breadcrumb) {
> -		err = eb->engine->emit_init_breadcrumb(eb->request);
> +	if (rq->context->engine->emit_init_breadcrumb) {
> +		err = rq->context->engine->emit_init_breadcrumb(rq);
>  		if (err)
>  			return err;
>  	}
>  
> -	err = eb->engine->emit_bb_start(eb->request,
> -					batch->node.start +
> -					eb->batch_start_offset,
> -					eb->batch_len,
> -					eb->batch_flags);
> +	err = rq->context->engine->emit_bb_start(rq,
> +						 batch->node.start +
> +						 eb->batch_start_offset,
> +						 batch_len,
> +						 eb->batch_flags);
>  	if (err)
>  		return err;
>  
>  	if (eb->trampoline) {
> +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
>  		GEM_BUG_ON(eb->batch_start_offset);
> -		err = eb->engine->emit_bb_start(eb->request,
> -						eb->trampoline->node.start +
> -						eb->batch_len,
> -						0, 0);
> +		err = rq->context->engine->emit_bb_start(rq,
> +							 eb->trampoline->node.start +
> +							 batch_len, 0, 0);
>  		if (err)
>  			return err;
>  	}
> @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	return 0;
>  }
>  
> +static int eb_submit(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	err = eb_move_to_gpu(eb);
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> +		if (!err)
> +			err = eb_request_submit(eb, eb->requests[i],
> +						eb->batches[i]->vma,
> +						eb->batch_len[i]);
> +	}
> +
> +	return err;
> +}
> +
>  static int num_vcs_engines(const struct drm_i915_private *i915)
>  {
>  	return hweight_long(VDBOX_MASK(&i915->gt));
> @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
>  	return i915_request_get(rq);
>  }
>  
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> +			   bool throttle)
>  {
> -	struct intel_context *ce = eb->context;
>  	struct intel_timeline *tl;
> -	struct i915_request *rq = NULL;
> -	int err;
> -
> -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> -
> -	if (unlikely(intel_context_is_banned(ce)))
> -		return ERR_PTR(-EIO);
> -
> -	/*
> -	 * Pinning the contexts may generate requests in order to acquire
> -	 * GGTT space, so do this first before we reserve a seqno for
> -	 * ourselves.
> -	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> -	if (err)
> -		return ERR_PTR(err);
> +	struct i915_request *rq;
>  
>  	/*
>  	 * Take a local wakeref for preparing to dispatch the execbuf as
> @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>  	 * taken on the engine, and the parent device.
>  	 */
>  	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl)) {
> -		intel_context_unpin(ce);
> -		return ERR_CAST(tl);
> -	}
> +	if (IS_ERR(tl))
> +		return PTR_ERR(tl);
>  
>  	intel_context_enter(ce);
>  	if (throttle)
>  		rq = eb_throttle(eb, ce);
>  	intel_context_timeline_unlock(tl);
>  
> +	if (rq) {
> +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> +
> +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> +				      timeout) < 0) {
> +			i915_request_put(rq);
> +
> +			tl = intel_context_timeline_lock(ce);
> +			intel_context_exit(ce);
> +			intel_context_timeline_unlock(tl);
> +
> +			if (nonblock)
> +				return -EWOULDBLOCK;
> +			else
> +				return -EINTR;
> +		}
> +		i915_request_put(rq);
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +{
> +	struct intel_context *ce = eb->context, *child;
> +	int err;
> +	int i = 0, j = 0;
> +
> +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> +
> +	if (unlikely(intel_context_is_banned(ce)))
> +		return -EIO;
> +
> +	/*
> +	 * Pinning the contexts may generate requests in order to acquire
> +	 * GGTT space, so do this first before we reserve a seqno for
> +	 * ourselves.
> +	 */
> +	err = intel_context_pin_ww(ce, &eb->ww);
> +	if (err)
> +		return err;
> +	for_each_child(ce, child) {
> +		err = intel_context_pin_ww(child, &eb->ww);
> +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = eb_pin_timeline(eb, child, throttle);
> +		if (err)
> +			goto unwind;
> +		++i;
> +	}
> +	err = eb_pin_timeline(eb, ce, throttle);
> +	if (err)
> +		goto unwind;
> +
>  	eb->args->flags |= __EXEC_ENGINE_PINNED;
> -	return rq;
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (j++ < i) {
> +			mutex_lock(&child->timeline->mutex);
> +			intel_context_exit(child);
> +			mutex_unlock(&child->timeline->mutex);
> +		}
> +	}
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +	return err;
>  }
>  
>  static void eb_unpin_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce = eb->context;
> -	struct intel_timeline *tl = ce->timeline;
> +	struct intel_context *ce = eb->context, *child;
>  
>  	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
>  		return;
>  
>  	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
>  
> -	mutex_lock(&tl->mutex);
> +	for_each_child(ce, child) {
> +		mutex_lock(&child->timeline->mutex);
> +		intel_context_exit(child);
> +		mutex_unlock(&child->timeline->mutex);
> +
> +		intel_context_unpin(child);
> +	}
> +
> +	mutex_lock(&ce->timeline->mutex);
>  	intel_context_exit(ce);
> -	mutex_unlock(&tl->mutex);
> +	mutex_unlock(&ce->timeline->mutex);
>  
>  	intel_context_unpin(ce);
>  }
> @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>  static int
>  eb_select_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce;
> +	struct intel_context *ce, *child;
>  	unsigned int idx;
>  	int err;
>  
> @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
>  	if (IS_ERR(ce))
>  		return PTR_ERR(ce);
>  
> +	if (intel_context_is_parallel(ce)) {
> +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +		if (eb->batch_start_offset || eb->args->batch_len) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +	}
> +	eb->num_batches = ce->parallel.number_children + 1;
> +
> +	for_each_child(ce, child)
> +		intel_context_get(child);
>  	intel_gt_pm_get(ce->engine->gt);
>  
>  	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		if (err)
>  			goto err;
>  	}
> +	for_each_child(ce, child) {
> +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> +			err = intel_context_alloc_state(child);
> +			if (err)
> +				goto err;
> +		}
> +	}
>  
>  	/*
>  	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		goto err;
>  
>  	eb->context = ce;
> -	eb->engine = ce->engine;
> +	eb->gt = ce->engine->gt;
>  
>  	/*
>  	 * Make sure engine pool stays alive even if we call intel_context_put
> @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
>  
>  err:
>  	intel_gt_pm_put(ce->engine->gt);
> +	for_each_child(ce, child)
> +		intel_context_put(child);
>  	intel_context_put(ce);
>  	return err;
>  }
> @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
>  static void
>  eb_put_engine(struct i915_execbuffer *eb)
>  {
> -	intel_gt_pm_put(eb->engine->gt);
> +	struct intel_context *child;
> +
> +	intel_gt_pm_put(eb->gt);
> +	for_each_child(eb->context, child)
> +		intel_context_put(child);
>  	intel_context_put(eb->context);
>  }
>  
> @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
>  }
>  
>  static int
> -await_fence_array(struct i915_execbuffer *eb)
> +await_fence_array(struct i915_execbuffer *eb,
> +		  struct i915_request *rq)
>  {
>  	unsigned int n;
>  	int err;
> @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
>  		if (!eb->fences[n].dma_fence)
>  			continue;
>  
> -		err = i915_request_await_dma_fence(eb->request,
> -						   eb->fences[n].dma_fence);
> +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
>  		if (err < 0)
>  			return err;
>  	}
> @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static void signal_fence_array(const struct i915_execbuffer *eb)
> +static void signal_fence_array(const struct i915_execbuffer *eb,
> +			       struct dma_fence * const fence)
>  {
> -	struct dma_fence * const fence = &eb->request->fence;
>  	unsigned int n;
>  
>  	for (n = 0; n < eb->num_fences; n++) {
> @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
>  			break;
>  }
>  
> -static int eb_request_add(struct i915_execbuffer *eb, int err)
> +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
>  {
> -	struct i915_request *rq = eb->request;
>  	struct intel_timeline * const tl = i915_request_timeline(rq);
>  	struct i915_sched_attr attr = {};
>  	struct i915_request *prev;
> @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  	/* Check that the context wasn't destroyed before submission */
>  	if (likely(!intel_context_is_closed(eb->context))) {
>  		attr = eb->gem_context->sched;
> -	} else {
> -		/* Serialise with context_close via the add_to_timeline */
> -		i915_request_set_error_once(rq, -ENOENT);
> -		__i915_request_skip(rq);
> -		err = -ENOENT; /* override any transient errors */
>  	}

Moving this appears to be wrong too as this blows up if the
__i915_request_skip is done before the __i915_request_commit. The right
solution appears be to keep this code as is and pull the parallel check
codd into this function.

The below shown in below CI run:
https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_8041/shard-iclb6/igt@gem_ctx_exec@basic-close-race.html

Matt

>  
>  	__i915_request_queue(rq, &attr);
> @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  		retire_requests(tl, prev);
>  
>  	mutex_unlock(&tl->mutex);
> +}
> +
> +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> +{
> +	int i;
> +
> +	/*
> +	 * We iterate in reverse order of creation to release timeline mutexes in
> +	 * same order.
> +	 */
> +	for_each_batch_add_order(eb, i) {
> +		struct i915_request *rq = eb->requests[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		if (unlikely(intel_context_is_closed(eb->context))) {
> +			/* Serialise with context_close via the add_to_timeline */
> +			i915_request_set_error_once(rq, -ENOENT);
> +			__i915_request_skip(rq);
> +			err = -ENOENT; /* override any transient errors */
> +		}
> +
> +		if (intel_context_is_parallel(eb->context)) {
> +			if (err) {
> +				__i915_request_skip(rq);
> +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> +					&rq->fence.flags);
> +			}
> +			if (i == 0)
> +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +					&rq->fence.flags);
> +		}
> +
> +		eb_request_add(eb, rq);
> +	}
>  
>  	return err;
>  }
> @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>  				    eb);
>  }
>  
> +static void eb_requests_get(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_get(eb->requests[i]);
> +	}
> +}
> +
> +static void eb_requests_put(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_put(eb->requests[i]);
> +	}
> +}
> +
> +static struct sync_file *
> +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences;
> +	unsigned int i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> +
> +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> +	if (!fences)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for_each_batch_create_order(eb, i)
> +		fences[i] = &eb->requests[i]->fence;
> +
> +	fence_array = dma_fence_array_create(eb->num_batches,
> +					     fences,
> +					     eb->context->parallel.fence_context,
> +					     eb->context->parallel.seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for_each_batch_create_order(eb, i)
> +		dma_fence_get(fences[i]);
> +
> +	if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&fence_array->base);
> +		/* sync_file now owns fence_arry, drop creation ref */
> +		dma_fence_put(&fence_array->base);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return out_fence;
> +}
> +
> +static struct sync_file *
> +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> +	      struct dma_fence *in_fence, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	int err;
> +
> +	if (unlikely(eb->gem_context->syncobj)) {
> +		struct dma_fence *fence;
> +
> +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (in_fence) {
> +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> +			err = i915_request_await_execution(rq, in_fence);
> +		else
> +			err = i915_request_await_dma_fence(rq, in_fence);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (eb->fences) {
> +		err = await_fence_array(eb, rq);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (intel_context_is_parallel(eb->context)) {
> +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> +		if (IS_ERR(out_fence))
> +			return ERR_PTR(-ENOMEM);
> +	} else if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&rq->fence);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	return out_fence;
> +}
> +
> +static struct intel_context *
> +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> +{
> +	struct intel_context *child;
> +
> +	if (likely(context_number == 0))
> +		return eb->context;
> +
> +	for_each_child(eb->context, child)
> +		if (!--context_number)
> +			return child;
> +
> +	GEM_BUG_ON("Context not found");
> +
> +	return NULL;
> +}
> +
> +static struct sync_file *
> +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> +		   int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		/* Allocate a request for this batch buffer nice and early. */
> +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> +		if (IS_ERR(eb->requests[i])) {
> +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> +			eb->requests[i] = NULL;
> +			return out_fence;
> +		}
> +
> +		/*
> +		 * Only the first request added (committed to backend) has to
> +		 * take the in fences into account as all subsequent requests
> +		 * will have fences inserted inbetween them.
> +		 */
> +		if (i + 1 == eb->num_batches) {
> +			out_fence = eb_fences_add(eb, eb->requests[i],
> +						  in_fence, out_fence_fd);
> +			if (IS_ERR(out_fence))
> +				return out_fence;
> +		}
> +
> +		/*
> +		 * Whilst this request exists, batch_obj will be on the
> +		 * active_list, and so will hold the active reference. Only when
> +		 * this request is retired will the batch_obj be moved onto
> +		 * the inactive_list and lose its active reference. Hence we do
> +		 * not need to explicitly hold another reference here.
> +		 */
> +		eb->requests[i]->batch = eb->batches[i]->vma;
> +		if (eb->batch_pool) {
> +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> +							 eb->requests[i]);
> +		}
> +	}
> +
> +	return out_fence;
> +}
> +
>  static int
>  i915_gem_do_execbuffer(struct drm_device *dev,
>  		       struct drm_file *file,
> @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	struct i915_execbuffer eb;
>  	struct dma_fence *in_fence = NULL;
>  	struct sync_file *out_fence = NULL;
> -	struct i915_vma *batch;
>  	int out_fence_fd = -1;
>  	int err;
>  
> @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	eb.buffer_count = args->buffer_count;
>  	eb.batch_start_offset = args->batch_start_offset;
> -	eb.batch_len = args->batch_len;
>  	eb.trampoline = NULL;
>  
>  	eb.fences = NULL;
>  	eb.num_fences = 0;
>  
> +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> +	       ARRAY_SIZE(eb.requests));
> +	eb.composite_fence = NULL;
> +
>  	eb.batch_flags = 0;
>  	if (args->flags & I915_EXEC_SECURE) {
>  		if (GRAPHICS_VER(i915) >= 11)
> @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	ww_acquire_done(&eb.ww.ctx);
>  
> -	batch = eb.batch->vma;
> -
> -	/* Allocate a request for this batch buffer nice and early. */
> -	eb.request = i915_request_create(eb.context);
> -	if (IS_ERR(eb.request)) {
> -		err = PTR_ERR(eb.request);
> -		goto err_vma;
> -	}
> -
> -	if (unlikely(eb.gem_context->syncobj)) {
> -		struct dma_fence *fence;
> -
> -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> -		err = i915_request_await_dma_fence(eb.request, fence);
> -		dma_fence_put(fence);
> -		if (err)
> -			goto err_ext;
> -	}
> -
> -	if (in_fence) {
> -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> -			err = i915_request_await_execution(eb.request,
> -							   in_fence);
> -		else
> -			err = i915_request_await_dma_fence(eb.request,
> -							   in_fence);
> -		if (err < 0)
> -			goto err_request;
> -	}
> -
> -	if (eb.fences) {
> -		err = await_fence_array(&eb);
> -		if (err)
> +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> +	if (IS_ERR(out_fence)) {
> +		err = PTR_ERR(out_fence);
> +		if (eb.requests[0])
>  			goto err_request;
> +		else
> +			goto err_vma;
>  	}
>  
> -	if (out_fence_fd != -1) {
> -		out_fence = sync_file_create(&eb.request->fence);
> -		if (!out_fence) {
> -			err = -ENOMEM;
> -			goto err_request;
> -		}
> -	}
> -
> -	/*
> -	 * Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> -	 */
> -	eb.request->batch = batch;
> -	if (eb.batch_pool)
> -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> -
> -	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch);
> +	err = eb_submit(&eb);
>  
>  err_request:
> -	i915_request_get(eb.request);
> -	err = eb_request_add(&eb, err);
> +	eb_requests_get(&eb);
> +	err = eb_requests_add(&eb, err);
>  
>  	if (eb.fences)
> -		signal_fence_array(&eb);
> +		signal_fence_array(&eb, eb.composite_fence ?
> +				   eb.composite_fence :
> +				   &eb.requests[0]->fence);
>  
>  	if (out_fence) {
>  		if (err == 0) {
> @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	if (unlikely(eb.gem_context->syncobj)) {
>  		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> -					  &eb.request->fence);
> +					  eb.composite_fence ?
> +					  eb.composite_fence :
> +					  &eb.requests[0]->fence);
>  	}
>  
> -	i915_request_put(eb.request);
> +	if (!out_fence && eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
> +
> +	eb_requests_put(&eb);
>  
>  err_vma:
>  	eb_release_vmas(&eb, true);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 1bc705f98e2a..1781419fa105 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
>  	struct intel_timeline *tl = ce->timeline;
>  	int err;
>  
> -	err = mutex_lock_interruptible(&tl->mutex);
> +	if (intel_context_is_parent(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> +	else if (intel_context_is_child(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex,
> +						      ce->parallel.child_index + 1);
> +	else
> +		err = mutex_lock_interruptible(&tl->mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 95a5b94b4ece..9e0177dc5484 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -248,6 +248,16 @@ struct intel_context {
>  		 * context
>  		 */
>  		struct i915_request *last_rq;
> +		/**
> +		 * @fence_context: fence context composite fence when doing
> +		 * parallel submission
> +		 */
> +		u64 fence_context;
> +		/**
> +		 * @seqno: seqno for composite fence when doing parallel
> +		 * submission
> +		 */
> +		u32 seqno;
>  		/** @number_children: number of children if parent */
>  		u8 number_children;
>  		/** @child_index: index into child_list if child */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f28e36aa77c2..83b0d2a114af 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>  		}
>  	}
>  
> +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> +
>  	parent->engine->emit_bb_start =
>  		emit_bb_start_parent_no_preempt_mid_batch;
>  	parent->engine->emit_fini_breadcrumb =
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 8950785e55d6..24db8459376b 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -147,6 +147,15 @@ enum {
>  	 * tail.
>  	 */
>  	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) that
> +	 * hit an error while generating requests in the execbuf IOCTL.
> +	 * Indicates this request should be skipped as another request in
> +	 * submission / relationship encoutered an error.
> +	 */
> +	I915_FENCE_FLAG_SKIP_PARALLEL,
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..90546fa58fc1 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>  	return i915_active_add_request(&vma->active, rq);
>  }
>  
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     struct dma_fence *fence,
> +			     unsigned int flags)
>  {
>  	struct drm_i915_gem_object *obj = vma->obj;
>  	int err;
> @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  			intel_frontbuffer_put(front);
>  		}
>  
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> -		obj->read_domains = 0;
> +		if (fence) {
> +			dma_resv_add_excl_fence(vma->resv, fence);
> +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> +			obj->read_domains = 0;
> +		}
>  	} else {
>  		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
>  			err = dma_resv_reserve_shared(vma->resv, 1);
> @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  				return err;
>  		}
>  
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> -		obj->write_domain = 0;
> +		if (fence) {
> +			dma_resv_add_shared_fence(vma->resv, fence);
> +			obj->write_domain = 0;
> +		}
>  	}
>  
>  	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..648dbe744c96 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>  
>  int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>  					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  struct dma_fence *fence,
> +					  unsigned int flags);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> +}
>  
>  #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>  
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07  3:06     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:06 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Move guc_id allocation under submission state sub-struct as a future
> patch will reuse the spin lock as a global submission state lock. Moving
> this into sub-struct makes ownership of fields / lock clear.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
>   3 files changed, 47 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 12252c411159..e7e3984aab78 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -197,18 +197,18 @@ struct intel_context {
>   	struct {
>   		/**
>   		 * @id: handle which is used to uniquely identify this context
> -		 * with the GuC, protected by guc->contexts_lock
> +		 * with the GuC, protected by guc->submission_state.lock
>   		 */
>   		u16 id;
>   		/**
>   		 * @ref: the number of references to the guc_id, when
>   		 * transitioning in and out of zero protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		atomic_t ref;
>   		/**
>   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> -		 * still valid, protected by guc->contexts_lock
> +		 * still valid, protected by guc->submission_state.lock
>   		 */
>   		struct list_head link;
>   	} guc_id;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..65b5e8eeef96 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -70,17 +70,21 @@ struct intel_guc {
>   		void (*disable)(struct intel_guc *guc);
>   	} interrupts;
>   
> -	/**
> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> -	 * ce->guc_id.ref when transitioning in and out of zero
> -	 */
> -	spinlock_t contexts_lock;
> -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
> -	struct ida guc_ids;
> -	/**
> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> -	 */
> -	struct list_head guc_id_list;
> +	struct {
> +		/**
> +		 * @lock: protects everything in submission_state
> +		 */
> +		spinlock_t lock;
The old version also mentioned 'ce->guc_id.ref'. Should this not also 
mention that transition? Or was the old comment inaccurate. I'm not 
seeing any actual behaviour changes in the patch.


> +		/**
> +		 * @guc_ids: used to allocate new guc_ids
> +		 */
> +		struct ida guc_ids;
> +		/**
> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> +		 * refs
> +		 */
> +		struct list_head guc_id_list;
> +	} submission_state;
>   
>   	/**
>   	 * @submission_supported: tracks whether we support GuC submission on
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..ad5c18119d92 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -68,16 +68,16 @@
>    * fence is used to stall all requests associated with this guc_id until the
>    * corresponding G2H returns indicating the guc_id has been deregistered.
>    *
> - * guc_ids:
> + * submission_state.guc_ids:
>    * Unique number associated with private GuC context data passed in during
>    * context registration / submission / deregistration. 64k available. Simple ida
>    * is used for allocation.
>    *
> - * Stealing guc_ids:
> - * If no guc_ids are available they can be stolen from another context at
> - * request creation time if that context is unpinned. If a guc_id can't be found
> - * we punt this problem to the user as we believe this is near impossible to hit
> - * during normal use cases.
> + * Stealing submission_state.guc_ids:
> + * If no submission_state.guc_ids are available they can be stolen from another
I would abbreviate this instance as well, submission_state.guc_id is 
quite the mouthful. Unless this somehow magically links back to the 
structure entry in the kerneldoc output?

John.

> + * context at request creation time if that context is unpinned. If a guc_id
> + * can't be found we punt this problem to the user as we believe this is near
> + * impossible to hit during normal use cases.
>    *
>    * Locking:
>    * In the GuC submission code we have 3 basic spin locks which protect
> @@ -89,7 +89,7 @@
>    * sched_engine can be submitting at a time. Currently only one sched_engine is
>    * used for all of GuC submission but that could change in the future.
>    *
> - * guc->contexts_lock
> + * guc->submission_state.lock
>    * Protects guc_id allocation for the given GuC, i.e. only one context can be
>    * doing guc_id allocation operations at a time for each GuC in the system.
>    *
> @@ -103,7 +103,7 @@
>    *
>    * Lock ordering rules:
>    * sched_engine->lock -> ce->guc_state.lock
> - * guc->contexts_lock -> ce->guc_state.lock
> + * guc->submission_state.lock -> ce->guc_state.lock
>    *
>    * Reset races:
>    * When a full GT reset is triggered it is assumed that some G2H responses to
> @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   
>   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>   
> -	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list);
> -	ida_init(&guc->guc_ids);
> +	spin_lock_init(&guc->submission_state.lock);
> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> +	ida_init(&guc->submission_state.guc_ids);
>   
>   	return 0;
>   }
> @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
>   
>   static int new_guc_id(struct intel_guc *guc)
>   {
> -	return ida_simple_get(&guc->guc_ids, 0,
> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>   			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>   }
> @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> +		ida_simple_remove(&guc->submission_state.guc_ids,
> +				  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	__release_guc_id(guc, ce);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int steal_guc_id(struct intel_guc *guc)
> @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
>   	struct intel_context *ce;
>   	int guc_id;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
> -	if (!list_empty(&guc->guc_id_list)) {
> -		ce = list_first_entry(&guc->guc_id_list,
> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>   {
>   	int ret;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
>   	ret = new_guc_id(guc);
>   	if (unlikely(ret < 0)) {
> @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   
>   try_again:
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   
>   	might_lock(&ce->guc_state.lock);
>   
> @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	atomic_inc(&ce->guc_id.ref);
>   
>   out_unlock:
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/*
>   	 * -EAGAIN indicates no guc_id are available, let's retire any
> @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	if (unlikely(context_guc_id_invalid(ce)))
>   		return;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>   	    !atomic_read(&ce->guc_id.ref))
> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		list_add_tail(&ce->guc_id.link,
> +			      &guc->submission_state.guc_id_list);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int __guc_action_register_context(struct intel_guc *guc,
> @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
>   	 * returns indicating this context has been deregistered the guc_id is
>   	 * returned to the pool of available guc_id.
>   	 */
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	if (!list_empty(&ce->guc_id.link))
>   		list_del_init(&ce->guc_id.link);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/* Seal race with Reset */
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
@ 2021-10-07  3:06     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:06 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Move guc_id allocation under submission state sub-struct as a future
> patch will reuse the spin lock as a global submission state lock. Moving
> this into sub-struct makes ownership of fields / lock clear.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
>   3 files changed, 47 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 12252c411159..e7e3984aab78 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -197,18 +197,18 @@ struct intel_context {
>   	struct {
>   		/**
>   		 * @id: handle which is used to uniquely identify this context
> -		 * with the GuC, protected by guc->contexts_lock
> +		 * with the GuC, protected by guc->submission_state.lock
>   		 */
>   		u16 id;
>   		/**
>   		 * @ref: the number of references to the guc_id, when
>   		 * transitioning in and out of zero protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		atomic_t ref;
>   		/**
>   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> -		 * still valid, protected by guc->contexts_lock
> +		 * still valid, protected by guc->submission_state.lock
>   		 */
>   		struct list_head link;
>   	} guc_id;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..65b5e8eeef96 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -70,17 +70,21 @@ struct intel_guc {
>   		void (*disable)(struct intel_guc *guc);
>   	} interrupts;
>   
> -	/**
> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> -	 * ce->guc_id.ref when transitioning in and out of zero
> -	 */
> -	spinlock_t contexts_lock;
> -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
> -	struct ida guc_ids;
> -	/**
> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> -	 */
> -	struct list_head guc_id_list;
> +	struct {
> +		/**
> +		 * @lock: protects everything in submission_state
> +		 */
> +		spinlock_t lock;
The old version also mentioned 'ce->guc_id.ref'. Should this not also 
mention that transition? Or was the old comment inaccurate. I'm not 
seeing any actual behaviour changes in the patch.


> +		/**
> +		 * @guc_ids: used to allocate new guc_ids
> +		 */
> +		struct ida guc_ids;
> +		/**
> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> +		 * refs
> +		 */
> +		struct list_head guc_id_list;
> +	} submission_state;
>   
>   	/**
>   	 * @submission_supported: tracks whether we support GuC submission on
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..ad5c18119d92 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -68,16 +68,16 @@
>    * fence is used to stall all requests associated with this guc_id until the
>    * corresponding G2H returns indicating the guc_id has been deregistered.
>    *
> - * guc_ids:
> + * submission_state.guc_ids:
>    * Unique number associated with private GuC context data passed in during
>    * context registration / submission / deregistration. 64k available. Simple ida
>    * is used for allocation.
>    *
> - * Stealing guc_ids:
> - * If no guc_ids are available they can be stolen from another context at
> - * request creation time if that context is unpinned. If a guc_id can't be found
> - * we punt this problem to the user as we believe this is near impossible to hit
> - * during normal use cases.
> + * Stealing submission_state.guc_ids:
> + * If no submission_state.guc_ids are available they can be stolen from another
I would abbreviate this instance as well, submission_state.guc_id is 
quite the mouthful. Unless this somehow magically links back to the 
structure entry in the kerneldoc output?

John.

> + * context at request creation time if that context is unpinned. If a guc_id
> + * can't be found we punt this problem to the user as we believe this is near
> + * impossible to hit during normal use cases.
>    *
>    * Locking:
>    * In the GuC submission code we have 3 basic spin locks which protect
> @@ -89,7 +89,7 @@
>    * sched_engine can be submitting at a time. Currently only one sched_engine is
>    * used for all of GuC submission but that could change in the future.
>    *
> - * guc->contexts_lock
> + * guc->submission_state.lock
>    * Protects guc_id allocation for the given GuC, i.e. only one context can be
>    * doing guc_id allocation operations at a time for each GuC in the system.
>    *
> @@ -103,7 +103,7 @@
>    *
>    * Lock ordering rules:
>    * sched_engine->lock -> ce->guc_state.lock
> - * guc->contexts_lock -> ce->guc_state.lock
> + * guc->submission_state.lock -> ce->guc_state.lock
>    *
>    * Reset races:
>    * When a full GT reset is triggered it is assumed that some G2H responses to
> @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   
>   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>   
> -	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list);
> -	ida_init(&guc->guc_ids);
> +	spin_lock_init(&guc->submission_state.lock);
> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> +	ida_init(&guc->submission_state.guc_ids);
>   
>   	return 0;
>   }
> @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
>   
>   static int new_guc_id(struct intel_guc *guc)
>   {
> -	return ida_simple_get(&guc->guc_ids, 0,
> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>   			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>   }
> @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> +		ida_simple_remove(&guc->submission_state.guc_ids,
> +				  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	__release_guc_id(guc, ce);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int steal_guc_id(struct intel_guc *guc)
> @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
>   	struct intel_context *ce;
>   	int guc_id;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
> -	if (!list_empty(&guc->guc_id_list)) {
> -		ce = list_first_entry(&guc->guc_id_list,
> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>   {
>   	int ret;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
>   	ret = new_guc_id(guc);
>   	if (unlikely(ret < 0)) {
> @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   
>   try_again:
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   
>   	might_lock(&ce->guc_state.lock);
>   
> @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	atomic_inc(&ce->guc_id.ref);
>   
>   out_unlock:
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/*
>   	 * -EAGAIN indicates no guc_id are available, let's retire any
> @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	if (unlikely(context_guc_id_invalid(ce)))
>   		return;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>   	    !atomic_read(&ce->guc_id.ref))
> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		list_add_tail(&ce->guc_id.link,
> +			      &guc->submission_state.guc_id_list);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int __guc_action_register_context(struct intel_guc *guc,
> @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
>   	 * returns indicating this context has been deregistered the guc_id is
>   	 * returned to the pool of available guc_id.
>   	 */
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	if (!list_empty(&ce->guc_id.link))
>   		list_del_init(&ce->guc_id.link);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/* Seal race with Reset */
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07  3:37     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:37 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a deregister context H2G is in flight. To do this must
> issue the deregister H2G from a worker as context can be destroyed from
> an atomic context and taking GT PM ref blows up. Previously we took a
> runtime PM from this atomic context which worked but will stop working
> once runtime pm autosuspend in enabled.
>
> So this patch is two fold, stop intel_gt_wait_for_idle from short
> circuting and fix runtime pm autosuspend.
>
> v2:
>   (John Harrison)
>    - Split structure changes out in different patch
>   (Tvrtko)
>    - Don't drop lock in deregister_destroyed_contexts
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
>   6 files changed, 121 insertions(+), 54 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e9a0cad5c34d..1076066f41e0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   	INIT_LIST_HEAD(&ce->guc_id.link);
>   
> +	INIT_LIST_HEAD(&ce->destroyed_link);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index e7e3984aab78..4613d027cbc3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -213,6 +213,13 @@ struct intel_context {
>   		struct list_head link;
>   	} guc_id;
>   
> +	/**
> +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> +	 * list when context is pending to be destroyed (deregistered with the
> +	 * GuC), protected by guc->submission_state.lock
> +	 */
> +	struct list_head destroyed_link;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 8520c595f5e1..6fdeae668e6e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>   	return intel_wakeref_is_active(&engine->wakeref);
>   }
>   
> +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> +{
> +	__intel_wakeref_get(&engine->wakeref);
> +}
> +
>   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_get(&engine->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index d0588d8aaa44..05de6c1af25b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +
>   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>   {
>   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 65b5e8eeef96..25a598e2b6e8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -84,6 +84,17 @@ struct intel_guc {
>   		 * refs
>   		 */
>   		struct list_head guc_id_list;
> +		/**
> +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> +		 * (deregistered with the GuC)
> +		 */
> +		struct list_head destroyed_contexts;
> +		/**
> +		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * need to take a GT PM reference and can't from destroy
> +		 * function as it might be in an atomic context (no sleeping)
> +		 */
> +		struct work_struct destroyed_worker;
>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ad5c18119d92..17da2fea1bff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -90,8 +90,8 @@
>    * used for all of GuC submission but that could change in the future.
>    *
>    * guc->submission_state.lock
> - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> - * doing guc_id allocation operations at a time for each GuC in the system.
> + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> + * list.
Feels like this should not be removing explanations, only adding to 
them. The patch itself is only adding new features not removing them. 
Either the details about id allocation are not worth mentioning and 
should not have been added in the previous patch. Or they are and should 
be kept rather than removed in this patch. Either way works for me. The 
comment was valid information but does maybe count as obvious from the 
guc_id member (and friends) are within a per GuC instance structure.

>    *
>    * ce->guc_state.lock
>    * Protects everything under ce->guc_state. Ensures that a context is in the
> @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   			if (deregister)
>   				guc_signal_context_fence(ce);
>   			if (destroyed) {
> +				intel_gt_pm_put_async(guc_to_gt(guc));
>   				release_guc_id(guc, ce);
>   				__guc_context_destroy(ce);
>   			}
> @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> +
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
>   	int i;
> @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>   
>   	guc_flush_submissions(guc);
> +	guc_flush_destroyed_contexts(guc);
>   
>   	/*
>   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +static void destroyed_worker_func(struct work_struct *w);
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
> @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	spin_lock_init(&guc->submission_state.lock);
>   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>   	ida_init(&guc->submission_state.guc_ids);
> +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> +	INIT_WORK(&guc->submission_state.destroyed_worker,
> +		  destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   		return;
>   
>   	guc_lrc_desc_pool_destroy(guc);
> +	guc_flush_destroyed_contexts(guc);
Seems like these lines should be reversed. We should destroy the higher 
level constructs before the lower level ones that they could be built on.

>   	i915_sched_engine_put(guc->sched_engine);
>   }
>   
> @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	unsigned long flags;
> +	bool disabled;
>   
> +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> +	/* Seal race with Reset */
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	disabled = submission_disabled(guc);
> +	if (likely(!disabled)) {
> +		__intel_gt_pm_get(gt);
> +		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(disabled)) {
> +		release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +		return;
> +	}
> +
>   	deregister_context(ce, ce->guc_id.id);
>   }
>   
> @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
>   	}
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(!submission_disabled(guc) &&
> +		   guc_submission_initialized(guc));
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		__release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void deregister_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		guc_lrc_desc_unpin(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void destroyed_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.destroyed_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int tmp;
> +
> +	with_intel_gt_pm(gt, tmp)
> +		deregister_destroyed_contexts(guc);
> +}
> +
>   static void guc_context_destroy(struct kref *kref)
>   {
>   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	intel_wakeref_t wakeref;
>   	unsigned long flags;
> -	bool disabled;
> +	bool destroy;
>   
>   	/*
>   	 * If the guc_id is invalid this context has been stolen and we can free
>   	 * it immediately. Also can be freed immediately if the context is not
>   	 * registered with the GuC or the GuC is in the middle of a reset.
>   	 */
> -	if (context_guc_id_invalid(ce)) {
> -		__guc_context_destroy(ce);
> -		return;
> -	} else if (submission_disabled(guc) ||
> -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> -		release_guc_id(guc, ce);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	/*
> -	 * We have to acquire the context spinlock and check guc_id again, if it
> -	 * is valid it hasn't been stolen and needs to be deregistered. We
> -	 * delete this context from the list of unpinned guc_id available to
> -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> -	 * returns indicating this context has been deregistered the guc_id is
> -	 * returned to the pool of available guc_id.
> -	 */
>   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> -	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> -		__guc_context_destroy(ce);
> -		return;
> +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +	if (likely(!destroy)) {
> +		if (!list_empty(&ce->guc_id.link))
> +			list_del_init(&ce->guc_id.link);
> +		list_add_tail(&ce->destroyed_link,
> +			      &guc->submission_state.destroyed_contexts);
> +	} else {
> +		__release_guc_id(guc, ce);
'destroy' can be true if the guc_id is invalid. Is it good to call 
release on an invalid id?

John.

>   	}
> -
> -	if (!list_empty(&ce->guc_id.link))
> -		list_del_init(&ce->guc_id.link);
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> -
> -	/* Seal race with Reset */
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	disabled = submission_disabled(guc);
> -	if (likely(!disabled)) {
> -		set_context_destroyed(ce);
> -		clr_context_registered(ce);
> -	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -	if (unlikely(disabled)) {
> -		release_guc_id(guc, ce);
> +	if (unlikely(destroy)) {
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	/*
> -	 * We defer GuC context deregistration until the context is destroyed
> -	 * in order to save on CTBs. With this optimization ideally we only need
> -	 * 1 CTB to register the context during the first pin and 1 CTB to
> -	 * deregister the context when the context is destroyed. Without this
> -	 * optimization, a CTB would be needed every pin & unpin.
> -	 *
> -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> -	 * from context_free_worker when runtime wakeref is not held.
> -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> -	 * in H2G CTB to deregister the context. A future patch may defer this
> -	 * H2G CTB if the runtime wakeref is zero.
> +	 * We use a worker to issue the H2G to deregister the context as we can
> +	 * take the GT PM for the first time which isn't allowed from an atomic
> +	 * context.
>   	 */
> -	with_intel_runtime_pm(runtime_pm, wakeref)
> -		guc_lrc_desc_unpin(ce);
> +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   		intel_context_put(ce);
>   	} else if (context_destroyed(ce)) {
>   		/* Context has been destroyed */
> +		intel_gt_pm_put_async(guc_to_gt(guc));
>   		release_guc_id(guc, ce);
>   		__guc_context_destroy(ce);
>   	}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-10-07  3:37     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:37 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a deregister context H2G is in flight. To do this must
> issue the deregister H2G from a worker as context can be destroyed from
> an atomic context and taking GT PM ref blows up. Previously we took a
> runtime PM from this atomic context which worked but will stop working
> once runtime pm autosuspend in enabled.
>
> So this patch is two fold, stop intel_gt_wait_for_idle from short
> circuting and fix runtime pm autosuspend.
>
> v2:
>   (John Harrison)
>    - Split structure changes out in different patch
>   (Tvrtko)
>    - Don't drop lock in deregister_destroyed_contexts
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
>   6 files changed, 121 insertions(+), 54 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e9a0cad5c34d..1076066f41e0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   	INIT_LIST_HEAD(&ce->guc_id.link);
>   
> +	INIT_LIST_HEAD(&ce->destroyed_link);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index e7e3984aab78..4613d027cbc3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -213,6 +213,13 @@ struct intel_context {
>   		struct list_head link;
>   	} guc_id;
>   
> +	/**
> +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> +	 * list when context is pending to be destroyed (deregistered with the
> +	 * GuC), protected by guc->submission_state.lock
> +	 */
> +	struct list_head destroyed_link;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 8520c595f5e1..6fdeae668e6e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>   	return intel_wakeref_is_active(&engine->wakeref);
>   }
>   
> +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> +{
> +	__intel_wakeref_get(&engine->wakeref);
> +}
> +
>   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_get(&engine->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index d0588d8aaa44..05de6c1af25b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +
>   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>   {
>   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 65b5e8eeef96..25a598e2b6e8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -84,6 +84,17 @@ struct intel_guc {
>   		 * refs
>   		 */
>   		struct list_head guc_id_list;
> +		/**
> +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> +		 * (deregistered with the GuC)
> +		 */
> +		struct list_head destroyed_contexts;
> +		/**
> +		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * need to take a GT PM reference and can't from destroy
> +		 * function as it might be in an atomic context (no sleeping)
> +		 */
> +		struct work_struct destroyed_worker;
>   	} submission_state;
>   
>   	/**
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ad5c18119d92..17da2fea1bff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -90,8 +90,8 @@
>    * used for all of GuC submission but that could change in the future.
>    *
>    * guc->submission_state.lock
> - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> - * doing guc_id allocation operations at a time for each GuC in the system.
> + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> + * list.
Feels like this should not be removing explanations, only adding to 
them. The patch itself is only adding new features not removing them. 
Either the details about id allocation are not worth mentioning and 
should not have been added in the previous patch. Or they are and should 
be kept rather than removed in this patch. Either way works for me. The 
comment was valid information but does maybe count as obvious from the 
guc_id member (and friends) are within a per GuC instance structure.

>    *
>    * ce->guc_state.lock
>    * Protects everything under ce->guc_state. Ensures that a context is in the
> @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   			if (deregister)
>   				guc_signal_context_fence(ce);
>   			if (destroyed) {
> +				intel_gt_pm_put_async(guc_to_gt(guc));
>   				release_guc_id(guc, ce);
>   				__guc_context_destroy(ce);
>   			}
> @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> +
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
>   	int i;
> @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>   
>   	guc_flush_submissions(guc);
> +	guc_flush_destroyed_contexts(guc);
>   
>   	/*
>   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +static void destroyed_worker_func(struct work_struct *w);
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
> @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	spin_lock_init(&guc->submission_state.lock);
>   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>   	ida_init(&guc->submission_state.guc_ids);
> +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> +	INIT_WORK(&guc->submission_state.destroyed_worker,
> +		  destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   		return;
>   
>   	guc_lrc_desc_pool_destroy(guc);
> +	guc_flush_destroyed_contexts(guc);
Seems like these lines should be reversed. We should destroy the higher 
level constructs before the lower level ones that they could be built on.

>   	i915_sched_engine_put(guc->sched_engine);
>   }
>   
> @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	unsigned long flags;
> +	bool disabled;
>   
> +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> +	/* Seal race with Reset */
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	disabled = submission_disabled(guc);
> +	if (likely(!disabled)) {
> +		__intel_gt_pm_get(gt);
> +		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(disabled)) {
> +		release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +		return;
> +	}
> +
>   	deregister_context(ce, ce->guc_id.id);
>   }
>   
> @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
>   	}
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(!submission_disabled(guc) &&
> +		   guc_submission_initialized(guc));
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		__release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void deregister_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		guc_lrc_desc_unpin(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void destroyed_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.destroyed_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int tmp;
> +
> +	with_intel_gt_pm(gt, tmp)
> +		deregister_destroyed_contexts(guc);
> +}
> +
>   static void guc_context_destroy(struct kref *kref)
>   {
>   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	intel_wakeref_t wakeref;
>   	unsigned long flags;
> -	bool disabled;
> +	bool destroy;
>   
>   	/*
>   	 * If the guc_id is invalid this context has been stolen and we can free
>   	 * it immediately. Also can be freed immediately if the context is not
>   	 * registered with the GuC or the GuC is in the middle of a reset.
>   	 */
> -	if (context_guc_id_invalid(ce)) {
> -		__guc_context_destroy(ce);
> -		return;
> -	} else if (submission_disabled(guc) ||
> -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> -		release_guc_id(guc, ce);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	/*
> -	 * We have to acquire the context spinlock and check guc_id again, if it
> -	 * is valid it hasn't been stolen and needs to be deregistered. We
> -	 * delete this context from the list of unpinned guc_id available to
> -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> -	 * returns indicating this context has been deregistered the guc_id is
> -	 * returned to the pool of available guc_id.
> -	 */
>   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> -	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> -		__guc_context_destroy(ce);
> -		return;
> +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +	if (likely(!destroy)) {
> +		if (!list_empty(&ce->guc_id.link))
> +			list_del_init(&ce->guc_id.link);
> +		list_add_tail(&ce->destroyed_link,
> +			      &guc->submission_state.destroyed_contexts);
> +	} else {
> +		__release_guc_id(guc, ce);
'destroy' can be true if the guc_id is invalid. Is it good to call 
release on an invalid id?

John.

>   	}
> -
> -	if (!list_empty(&ce->guc_id.link))
> -		list_del_init(&ce->guc_id.link);
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> -
> -	/* Seal race with Reset */
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	disabled = submission_disabled(guc);
> -	if (likely(!disabled)) {
> -		set_context_destroyed(ce);
> -		clr_context_registered(ce);
> -	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -	if (unlikely(disabled)) {
> -		release_guc_id(guc, ce);
> +	if (unlikely(destroy)) {
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	/*
> -	 * We defer GuC context deregistration until the context is destroyed
> -	 * in order to save on CTBs. With this optimization ideally we only need
> -	 * 1 CTB to register the context during the first pin and 1 CTB to
> -	 * deregister the context when the context is destroyed. Without this
> -	 * optimization, a CTB would be needed every pin & unpin.
> -	 *
> -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> -	 * from context_free_worker when runtime wakeref is not held.
> -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> -	 * in H2G CTB to deregister the context. A future patch may defer this
> -	 * H2G CTB if the runtime wakeref is zero.
> +	 * We use a worker to issue the H2G to deregister the context as we can
> +	 * take the GT PM for the first time which isn't allowed from an atomic
> +	 * context.
>   	 */
> -	with_intel_runtime_pm(runtime_pm, wakeref)
> -		guc_lrc_desc_unpin(ce);
> +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   		intel_context_put(ce);
>   	} else if (context_destroyed(ce)) {
>   		/* Context has been destroyed */
> +		intel_gt_pm_put_async(guc_to_gt(guc));
>   		release_guc_id(guc, ce);
>   		__guc_context_destroy(ce);
>   	}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-10-04 22:06   ` Matthew Brost
@ 2021-10-07  3:45     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:45 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a scheduling of user context could be enabled.
I'm not sure what 'while a scheduling of user context could be enabled' 
means.

John.

> Returning GT idle when it is not can cause all sorts of issues
> throughout the stack.
>
> v2:
>   (Daniel Vetter)
>    - Add might_lock annotations to pin / unpin function
> v3:
>   (CI)
>    - Drop intel_engine_pm_might_put from unpin path as an async put is
>      used
> v4:
>   (John Harrison)
>    - Make intel_engine_pm_might_get/put work with GuC virtual engines
>    - Update commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>   5 files changed, 89 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 1076066f41e0..f601323b939f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>   	if (err)
>   		goto err_post_unpin;
>   
> +	intel_engine_pm_might_get(ce->engine);
> +
>   	if (unlikely(intel_context_is_closed(ce))) {
>   		err = -ENOENT;
>   		goto err_unlock;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 6fdeae668e6e..d68675925b79 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -6,9 +6,11 @@
>   #ifndef INTEL_ENGINE_PM_H
>   #define INTEL_ENGINE_PM_H
>   
> +#include "i915_drv.h"
>   #include "i915_request.h"
>   #include "intel_engine_types.h"
>   #include "intel_wakeref.h"
> +#include "intel_gt_pm.h"
>   
>   static inline bool
>   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>   	return intel_wakeref_get_if_active(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine)) {
> +		intel_wakeref_might_get(&engine->wakeref);
> +	} else {
> +		struct intel_gt *gt = engine->gt;
> +		struct intel_engine_cs *tengine;
> +		intel_engine_mask_t tmp, mask = engine->mask;
> +
> +		for_each_engine_masked(tengine, gt, mask, tmp)
> +			intel_wakeref_might_get(&tengine->wakeref);
> +	}
> +	intel_gt_pm_might_get(engine->gt);
> +}
> +
>   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_put(&engine->wakeref);
> @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>   	intel_wakeref_unlock_wait(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine)) {
> +		intel_wakeref_might_put(&engine->wakeref);
> +	} else {
> +		struct intel_gt *gt = engine->gt;
> +		struct intel_engine_cs *tengine;
> +		intel_engine_mask_t tmp, mask = engine->mask;
> +
> +		for_each_engine_masked(tengine, gt, mask, tmp)
> +			intel_wakeref_might_put(&tengine->wakeref);
> +	}
> +	intel_gt_pm_might_put(engine->gt);
> +}
> +
>   static inline struct i915_request *
>   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index 05de6c1af25b..bc898df7a48c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>   	return intel_wakeref_get_if_active(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_get(&gt->wakeref);
> +}
> +
>   static inline void intel_gt_pm_put(struct intel_gt *gt)
>   {
>   	intel_wakeref_put(&gt->wakeref);
> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_put(&gt->wakeref);
> +}
> +
>   #define with_intel_gt_pm(gt, tmp) \
>   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>   	     intel_gt_pm_put(gt), tmp = 0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 17da2fea1bff..8b82da50c2bc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>   
>   static int guc_context_pin(struct intel_context *ce, void *vaddr)
>   {
> -	return __guc_context_pin(ce, ce->engine, vaddr);
> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> +
> +	if (likely(!ret && !intel_context_is_barrier(ce)))
> +		intel_engine_pm_get(ce->engine);
> +
> +	return ret;
>   }
>   
>   static void guc_context_unpin(struct intel_context *ce)
> @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
>   
>   	unpin_guc_id(guc, ce);
>   	lrc_unpin(ce);
> +
> +	if (likely(!intel_context_is_barrier(ce)))
> +		intel_engine_pm_put_async(ce->engine);
>   }
>   
>   static void guc_context_post_unpin(struct intel_context *ce)
> @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> +	int ret = __guc_context_pin(ce, engine, vaddr);
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +
> +	if (likely(!ret))
> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +			intel_engine_pm_get(engine);
>   
> -	return __guc_context_pin(ce, engine, vaddr);
> +	return ret;
> +}
> +
> +static void guc_virtual_context_unpin(struct intel_context *ce)
> +{
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +	struct intel_engine_cs *engine;
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_barrier(ce));
> +
> +	unpin_guc_id(guc, ce);
> +	lrc_unpin(ce);
> +
> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +		intel_engine_pm_put_async(engine);
>   }
>   
>   static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   
>   	.pre_pin = guc_virtual_context_pre_pin,
>   	.pin = guc_virtual_context_pin,
> -	.unpin = guc_context_unpin,
> +	.unpin = guc_virtual_context_unpin,
>   	.post_unpin = guc_context_post_unpin,
>   
>   	.ban = guc_context_ban,
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index 545c8f277c46..4f4c2e15e736 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -123,6 +123,12 @@ enum {
>   	__INTEL_WAKEREF_PUT_LAST_BIT__
>   };
>   
> +static inline void
> +intel_wakeref_might_get(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_put_flags: Release the wakeref
>    * @wf: the wakeref
> @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>   }
>   
> +static inline void
> +intel_wakeref_might_put(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_lock: Lock the wakeref (mutex)
>    * @wf: the wakeref


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-10-07  3:45     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:45 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a scheduling of user context could be enabled.
I'm not sure what 'while a scheduling of user context could be enabled' 
means.

John.

> Returning GT idle when it is not can cause all sorts of issues
> throughout the stack.
>
> v2:
>   (Daniel Vetter)
>    - Add might_lock annotations to pin / unpin function
> v3:
>   (CI)
>    - Drop intel_engine_pm_might_put from unpin path as an async put is
>      used
> v4:
>   (John Harrison)
>    - Make intel_engine_pm_might_get/put work with GuC virtual engines
>    - Update commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>   5 files changed, 89 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 1076066f41e0..f601323b939f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>   	if (err)
>   		goto err_post_unpin;
>   
> +	intel_engine_pm_might_get(ce->engine);
> +
>   	if (unlikely(intel_context_is_closed(ce))) {
>   		err = -ENOENT;
>   		goto err_unlock;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 6fdeae668e6e..d68675925b79 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -6,9 +6,11 @@
>   #ifndef INTEL_ENGINE_PM_H
>   #define INTEL_ENGINE_PM_H
>   
> +#include "i915_drv.h"
>   #include "i915_request.h"
>   #include "intel_engine_types.h"
>   #include "intel_wakeref.h"
> +#include "intel_gt_pm.h"
>   
>   static inline bool
>   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>   	return intel_wakeref_get_if_active(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine)) {
> +		intel_wakeref_might_get(&engine->wakeref);
> +	} else {
> +		struct intel_gt *gt = engine->gt;
> +		struct intel_engine_cs *tengine;
> +		intel_engine_mask_t tmp, mask = engine->mask;
> +
> +		for_each_engine_masked(tengine, gt, mask, tmp)
> +			intel_wakeref_might_get(&tengine->wakeref);
> +	}
> +	intel_gt_pm_might_get(engine->gt);
> +}
> +
>   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_put(&engine->wakeref);
> @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>   	intel_wakeref_unlock_wait(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine)) {
> +		intel_wakeref_might_put(&engine->wakeref);
> +	} else {
> +		struct intel_gt *gt = engine->gt;
> +		struct intel_engine_cs *tengine;
> +		intel_engine_mask_t tmp, mask = engine->mask;
> +
> +		for_each_engine_masked(tengine, gt, mask, tmp)
> +			intel_wakeref_might_put(&tengine->wakeref);
> +	}
> +	intel_gt_pm_might_put(engine->gt);
> +}
> +
>   static inline struct i915_request *
>   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index 05de6c1af25b..bc898df7a48c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>   	return intel_wakeref_get_if_active(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_get(&gt->wakeref);
> +}
> +
>   static inline void intel_gt_pm_put(struct intel_gt *gt)
>   {
>   	intel_wakeref_put(&gt->wakeref);
> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_put(&gt->wakeref);
> +}
> +
>   #define with_intel_gt_pm(gt, tmp) \
>   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>   	     intel_gt_pm_put(gt), tmp = 0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 17da2fea1bff..8b82da50c2bc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>   
>   static int guc_context_pin(struct intel_context *ce, void *vaddr)
>   {
> -	return __guc_context_pin(ce, ce->engine, vaddr);
> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> +
> +	if (likely(!ret && !intel_context_is_barrier(ce)))
> +		intel_engine_pm_get(ce->engine);
> +
> +	return ret;
>   }
>   
>   static void guc_context_unpin(struct intel_context *ce)
> @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
>   
>   	unpin_guc_id(guc, ce);
>   	lrc_unpin(ce);
> +
> +	if (likely(!intel_context_is_barrier(ce)))
> +		intel_engine_pm_put_async(ce->engine);
>   }
>   
>   static void guc_context_post_unpin(struct intel_context *ce)
> @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> +	int ret = __guc_context_pin(ce, engine, vaddr);
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +
> +	if (likely(!ret))
> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +			intel_engine_pm_get(engine);
>   
> -	return __guc_context_pin(ce, engine, vaddr);
> +	return ret;
> +}
> +
> +static void guc_virtual_context_unpin(struct intel_context *ce)
> +{
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +	struct intel_engine_cs *engine;
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_barrier(ce));
> +
> +	unpin_guc_id(guc, ce);
> +	lrc_unpin(ce);
> +
> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +		intel_engine_pm_put_async(engine);
>   }
>   
>   static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   
>   	.pre_pin = guc_virtual_context_pre_pin,
>   	.pin = guc_virtual_context_pin,
> -	.unpin = guc_context_unpin,
> +	.unpin = guc_virtual_context_unpin,
>   	.post_unpin = guc_context_post_unpin,
>   
>   	.ban = guc_context_ban,
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index 545c8f277c46..4f4c2e15e736 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -123,6 +123,12 @@ enum {
>   	__INTEL_WAKEREF_PUT_LAST_BIT__
>   };
>   
> +static inline void
> +intel_wakeref_might_get(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_put_flags: Release the wakeref
>    * @wf: the wakeref
> @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>   }
>   
> +static inline void
> +intel_wakeref_might_put(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_lock: Lock the wakeref (mutex)
>    * @wf: the wakeref


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 04/26] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07  3:49     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:49 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Calling switch_to_kernel_context isn't needed if the engine PM reference
> is taken while all user contexts are pinned as if don't have PM ref that
> guarantees that all user contexts scheduling is disabled. By not calling
> switch_to_kernel_context we save on issuing a request to the engine.
>
> v2:
>   (Daniel Vetter)
>    - Add FIXME comment about pushing switch_to_kernel_context to backend
> v3:
>   (John Harrison)
>    - Update commit message
>    - Fix workding comment
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index dacd62773735..a1334b48dde7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -162,6 +162,19 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	unsigned long flags;
>   	bool result = true;
>   
> +	/*
> +	 * This is execlist specific behaviour intended to ensure the GPU is
> +	 * idle by switching to a known 'safe' context. With GuC submission, the
> +	 * same idle guarantee is achieved by other means (disabling
> +	 * scheduling). Further, switching to a 'safe' context has no effect
> +	 * with GuC submission as the scheduler can just switch back again.
> +	 *
> +	 * FIXME: Move this backend scheduler specific behaviour into the
> +	 * scheduler backend.
> +	 */
> +	if (intel_engine_uses_guc(engine))
> +		return true;
> +
>   	/* GPU is pointing to the void, as good as in the kernel context. */
>   	if (intel_gt_is_wedged(engine->gt))
>   		return true;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 04/26] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
@ 2021-10-07  3:49     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07  3:49 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Calling switch_to_kernel_context isn't needed if the engine PM reference
> is taken while all user contexts are pinned as if don't have PM ref that
> guarantees that all user contexts scheduling is disabled. By not calling
> switch_to_kernel_context we save on issuing a request to the engine.
>
> v2:
>   (Daniel Vetter)
>    - Add FIXME comment about pushing switch_to_kernel_context to backend
> v3:
>   (John Harrison)
>    - Update commit message
>    - Fix workding comment
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index dacd62773735..a1334b48dde7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -162,6 +162,19 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	unsigned long flags;
>   	bool result = true;
>   
> +	/*
> +	 * This is execlist specific behaviour intended to ensure the GPU is
> +	 * idle by switching to a known 'safe' context. With GuC submission, the
> +	 * same idle guarantee is achieved by other means (disabling
> +	 * scheduling). Further, switching to a 'safe' context has no effect
> +	 * with GuC submission as the scheduler can just switch back again.
> +	 *
> +	 * FIXME: Move this backend scheduler specific behaviour into the
> +	 * scheduler backend.
> +	 */
> +	if (intel_engine_uses_guc(engine))
> +		return true;
> +
>   	/* GPU is pointing to the void, as good as in the kernel context. */
>   	if (intel_gt_is_wedged(engine->gt))
>   		return true;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
  2021-10-07  3:06     ` [Intel-gfx] " John Harrison
@ 2021-10-07 15:05       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-07 15:05 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:06:41PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Move guc_id allocation under submission state sub-struct as a future
> > patch will reuse the spin lock as a global submission state lock. Moving
> > this into sub-struct makes ownership of fields / lock clear.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
> >   3 files changed, 47 insertions(+), 41 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 12252c411159..e7e3984aab78 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -197,18 +197,18 @@ struct intel_context {
> >   	struct {
> >   		/**
> >   		 * @id: handle which is used to uniquely identify this context
> > -		 * with the GuC, protected by guc->contexts_lock
> > +		 * with the GuC, protected by guc->submission_state.lock
> >   		 */
> >   		u16 id;
> >   		/**
> >   		 * @ref: the number of references to the guc_id, when
> >   		 * transitioning in and out of zero protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		atomic_t ref;
> >   		/**
> >   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> > -		 * still valid, protected by guc->contexts_lock
> > +		 * still valid, protected by guc->submission_state.lock
> >   		 */
> >   		struct list_head link;
> >   	} guc_id;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 5dd174babf7a..65b5e8eeef96 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -70,17 +70,21 @@ struct intel_guc {
> >   		void (*disable)(struct intel_guc *guc);
> >   	} interrupts;
> > -	/**
> > -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> > -	 * ce->guc_id.ref when transitioning in and out of zero
> > -	 */
> > -	spinlock_t contexts_lock;
> > -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
> > -	struct ida guc_ids;
> > -	/**
> > -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > -	 */
> > -	struct list_head guc_id_list;
> > +	struct {
> > +		/**
> > +		 * @lock: protects everything in submission_state
> > +		 */
> > +		spinlock_t lock;
> The old version also mentioned 'ce->guc_id.ref'. Should this not also
> mention that transition? Or was the old comment inaccurate. I'm not seeing
> any actual behaviour changes in the patch.
> 
> 

Can add that back in.

> > +		/**
> > +		 * @guc_ids: used to allocate new guc_ids
> > +		 */
> > +		struct ida guc_ids;
> > +		/**
> > +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > +		 * refs
> > +		 */
> > +		struct list_head guc_id_list;
> > +	} submission_state;
> >   	/**
> >   	 * @submission_supported: tracks whether we support GuC submission on
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ba0de35f6323..ad5c18119d92 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -68,16 +68,16 @@
> >    * fence is used to stall all requests associated with this guc_id until the
> >    * corresponding G2H returns indicating the guc_id has been deregistered.
> >    *
> > - * guc_ids:
> > + * submission_state.guc_ids:
> >    * Unique number associated with private GuC context data passed in during
> >    * context registration / submission / deregistration. 64k available. Simple ida
> >    * is used for allocation.
> >    *
> > - * Stealing guc_ids:
> > - * If no guc_ids are available they can be stolen from another context at
> > - * request creation time if that context is unpinned. If a guc_id can't be found
> > - * we punt this problem to the user as we believe this is near impossible to hit
> > - * during normal use cases.
> > + * Stealing submission_state.guc_ids:
> > + * If no submission_state.guc_ids are available they can be stolen from another
> I would abbreviate this instance as well, submission_state.guc_id is quite
> the mouthful. Unless this somehow magically links back to the structure
> entry in the kerneldoc output?
>

It might, not really sure but agree the submission_state should be
dropped. Think changed because of global find replace.

Matt

> John.
> 
> > + * context at request creation time if that context is unpinned. If a guc_id
> > + * can't be found we punt this problem to the user as we believe this is near
> > + * impossible to hit during normal use cases.
> >    *
> >    * Locking:
> >    * In the GuC submission code we have 3 basic spin locks which protect
> > @@ -89,7 +89,7 @@
> >    * sched_engine can be submitting at a time. Currently only one sched_engine is
> >    * used for all of GuC submission but that could change in the future.
> >    *
> > - * guc->contexts_lock
> > + * guc->submission_state.lock
> >    * Protects guc_id allocation for the given GuC, i.e. only one context can be
> >    * doing guc_id allocation operations at a time for each GuC in the system.
> >    *
> > @@ -103,7 +103,7 @@
> >    *
> >    * Lock ordering rules:
> >    * sched_engine->lock -> ce->guc_state.lock
> > - * guc->contexts_lock -> ce->guc_state.lock
> > + * guc->submission_state.lock -> ce->guc_state.lock
> >    *
> >    * Reset races:
> >    * When a full GT reset is triggered it is assumed that some G2H responses to
> > @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> > -	spin_lock_init(&guc->contexts_lock);
> > -	INIT_LIST_HEAD(&guc->guc_id_list);
> > -	ida_init(&guc->guc_ids);
> > +	spin_lock_init(&guc->submission_state.lock);
> > +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> > +	ida_init(&guc->submission_state.guc_ids);
> >   	return 0;
> >   }
> > @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   static int new_guc_id(struct intel_guc *guc)
> >   {
> > -	return ida_simple_get(&guc->guc_ids, 0,
> > +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> >   			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> >   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> >   }
> > @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> > +		ida_simple_remove(&guc->submission_state.guc_ids,
> > +				  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	unsigned long flags;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	__release_guc_id(guc, ce);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int steal_guc_id(struct intel_guc *guc)
> > @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
> >   	struct intel_context *ce;
> >   	int guc_id;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> > -	if (!list_empty(&guc->guc_id_list)) {
> > -		ce = list_first_entry(&guc->guc_id_list,
> > +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > +		ce = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
> >   {
> >   	int ret;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> >   	ret = new_guc_id(guc);
> >   	if (unlikely(ret < 0)) {
> > @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> >   try_again:
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	might_lock(&ce->guc_state.lock);
> > @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	atomic_inc(&ce->guc_id.ref);
> >   out_unlock:
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/*
> >   	 * -EAGAIN indicates no guc_id are available, let's retire any
> > @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	if (unlikely(context_guc_id_invalid(ce)))
> >   		return;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
> >   	    !atomic_read(&ce->guc_id.ref))
> > -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		list_add_tail(&ce->guc_id.link,
> > +			      &guc->submission_state.guc_id_list);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int __guc_action_register_context(struct intel_guc *guc,
> > @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
> >   	 * returns indicating this context has been deregistered the guc_id is
> >   	 * returned to the pool of available guc_id.
> >   	 */
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	if (!list_empty(&ce->guc_id.link))
> >   		list_del_init(&ce->guc_id.link);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/* Seal race with Reset */
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
@ 2021-10-07 15:05       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-07 15:05 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:06:41PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Move guc_id allocation under submission state sub-struct as a future
> > patch will reuse the spin lock as a global submission state lock. Moving
> > this into sub-struct makes ownership of fields / lock clear.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
> >   3 files changed, 47 insertions(+), 41 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 12252c411159..e7e3984aab78 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -197,18 +197,18 @@ struct intel_context {
> >   	struct {
> >   		/**
> >   		 * @id: handle which is used to uniquely identify this context
> > -		 * with the GuC, protected by guc->contexts_lock
> > +		 * with the GuC, protected by guc->submission_state.lock
> >   		 */
> >   		u16 id;
> >   		/**
> >   		 * @ref: the number of references to the guc_id, when
> >   		 * transitioning in and out of zero protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		atomic_t ref;
> >   		/**
> >   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> > -		 * still valid, protected by guc->contexts_lock
> > +		 * still valid, protected by guc->submission_state.lock
> >   		 */
> >   		struct list_head link;
> >   	} guc_id;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 5dd174babf7a..65b5e8eeef96 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -70,17 +70,21 @@ struct intel_guc {
> >   		void (*disable)(struct intel_guc *guc);
> >   	} interrupts;
> > -	/**
> > -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> > -	 * ce->guc_id.ref when transitioning in and out of zero
> > -	 */
> > -	spinlock_t contexts_lock;
> > -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
> > -	struct ida guc_ids;
> > -	/**
> > -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > -	 */
> > -	struct list_head guc_id_list;
> > +	struct {
> > +		/**
> > +		 * @lock: protects everything in submission_state
> > +		 */
> > +		spinlock_t lock;
> The old version also mentioned 'ce->guc_id.ref'. Should this not also
> mention that transition? Or was the old comment inaccurate. I'm not seeing
> any actual behaviour changes in the patch.
> 
> 

Can add that back in.

> > +		/**
> > +		 * @guc_ids: used to allocate new guc_ids
> > +		 */
> > +		struct ida guc_ids;
> > +		/**
> > +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > +		 * refs
> > +		 */
> > +		struct list_head guc_id_list;
> > +	} submission_state;
> >   	/**
> >   	 * @submission_supported: tracks whether we support GuC submission on
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ba0de35f6323..ad5c18119d92 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -68,16 +68,16 @@
> >    * fence is used to stall all requests associated with this guc_id until the
> >    * corresponding G2H returns indicating the guc_id has been deregistered.
> >    *
> > - * guc_ids:
> > + * submission_state.guc_ids:
> >    * Unique number associated with private GuC context data passed in during
> >    * context registration / submission / deregistration. 64k available. Simple ida
> >    * is used for allocation.
> >    *
> > - * Stealing guc_ids:
> > - * If no guc_ids are available they can be stolen from another context at
> > - * request creation time if that context is unpinned. If a guc_id can't be found
> > - * we punt this problem to the user as we believe this is near impossible to hit
> > - * during normal use cases.
> > + * Stealing submission_state.guc_ids:
> > + * If no submission_state.guc_ids are available they can be stolen from another
> I would abbreviate this instance as well, submission_state.guc_id is quite
> the mouthful. Unless this somehow magically links back to the structure
> entry in the kerneldoc output?
>

It might, not really sure but agree the submission_state should be
dropped. Think changed because of global find replace.

Matt

> John.
> 
> > + * context at request creation time if that context is unpinned. If a guc_id
> > + * can't be found we punt this problem to the user as we believe this is near
> > + * impossible to hit during normal use cases.
> >    *
> >    * Locking:
> >    * In the GuC submission code we have 3 basic spin locks which protect
> > @@ -89,7 +89,7 @@
> >    * sched_engine can be submitting at a time. Currently only one sched_engine is
> >    * used for all of GuC submission but that could change in the future.
> >    *
> > - * guc->contexts_lock
> > + * guc->submission_state.lock
> >    * Protects guc_id allocation for the given GuC, i.e. only one context can be
> >    * doing guc_id allocation operations at a time for each GuC in the system.
> >    *
> > @@ -103,7 +103,7 @@
> >    *
> >    * Lock ordering rules:
> >    * sched_engine->lock -> ce->guc_state.lock
> > - * guc->contexts_lock -> ce->guc_state.lock
> > + * guc->submission_state.lock -> ce->guc_state.lock
> >    *
> >    * Reset races:
> >    * When a full GT reset is triggered it is assumed that some G2H responses to
> > @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> > -	spin_lock_init(&guc->contexts_lock);
> > -	INIT_LIST_HEAD(&guc->guc_id_list);
> > -	ida_init(&guc->guc_ids);
> > +	spin_lock_init(&guc->submission_state.lock);
> > +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> > +	ida_init(&guc->submission_state.guc_ids);
> >   	return 0;
> >   }
> > @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   static int new_guc_id(struct intel_guc *guc)
> >   {
> > -	return ida_simple_get(&guc->guc_ids, 0,
> > +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> >   			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> >   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> >   }
> > @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> > +		ida_simple_remove(&guc->submission_state.guc_ids,
> > +				  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	unsigned long flags;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	__release_guc_id(guc, ce);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int steal_guc_id(struct intel_guc *guc)
> > @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
> >   	struct intel_context *ce;
> >   	int guc_id;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> > -	if (!list_empty(&guc->guc_id_list)) {
> > -		ce = list_first_entry(&guc->guc_id_list,
> > +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > +		ce = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
> >   {
> >   	int ret;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> >   	ret = new_guc_id(guc);
> >   	if (unlikely(ret < 0)) {
> > @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> >   try_again:
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	might_lock(&ce->guc_state.lock);
> > @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	atomic_inc(&ce->guc_id.ref);
> >   out_unlock:
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/*
> >   	 * -EAGAIN indicates no guc_id are available, let's retire any
> > @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	if (unlikely(context_guc_id_invalid(ce)))
> >   		return;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
> >   	    !atomic_read(&ce->guc_id.ref))
> > -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		list_add_tail(&ce->guc_id.link,
> > +			      &guc->submission_state.guc_id_list);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int __guc_action_register_context(struct intel_guc *guc,
> > @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
> >   	 * returns indicating this context has been deregistered the guc_id is
> >   	 * returned to the pool of available guc_id.
> >   	 */
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	if (!list_empty(&ce->guc_id.link))
> >   		list_del_init(&ce->guc_id.link);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/* Seal race with Reset */
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-10-07  3:45     ` [Intel-gfx] " John Harrison
@ 2021-10-07 15:19       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-07 15:19 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a scheduling of user context could be enabled.
> I'm not sure what 'while a scheduling of user context could be enabled'
> means.
>

Not really sure how this isn't clear.

It means if a user context has scheduling enabled this function cannot
short circuit returning idle.

Matt
 
> John.
> 
> > Returning GT idle when it is not can cause all sorts of issues
> > throughout the stack.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add might_lock annotations to pin / unpin function
> > v3:
> >   (CI)
> >    - Drop intel_engine_pm_might_put from unpin path as an async put is
> >      used
> > v4:
> >   (John Harrison)
> >    - Make intel_engine_pm_might_get/put work with GuC virtual engines
> >    - Update commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> >   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> >   5 files changed, 89 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 1076066f41e0..f601323b939f 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >   	if (err)
> >   		goto err_post_unpin;
> > +	intel_engine_pm_might_get(ce->engine);
> > +
> >   	if (unlikely(intel_context_is_closed(ce))) {
> >   		err = -ENOENT;
> >   		goto err_unlock;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 6fdeae668e6e..d68675925b79 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -6,9 +6,11 @@
> >   #ifndef INTEL_ENGINE_PM_H
> >   #define INTEL_ENGINE_PM_H
> > +#include "i915_drv.h"
> >   #include "i915_request.h"
> >   #include "intel_engine_types.h"
> >   #include "intel_wakeref.h"
> > +#include "intel_gt_pm.h"
> >   static inline bool
> >   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> >   	return intel_wakeref_get_if_active(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine)) {
> > +		intel_wakeref_might_get(&engine->wakeref);
> > +	} else {
> > +		struct intel_gt *gt = engine->gt;
> > +		struct intel_engine_cs *tengine;
> > +		intel_engine_mask_t tmp, mask = engine->mask;
> > +
> > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > +			intel_wakeref_might_get(&tengine->wakeref);
> > +	}
> > +	intel_gt_pm_might_get(engine->gt);
> > +}
> > +
> >   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_put(&engine->wakeref);
> > @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> >   	intel_wakeref_unlock_wait(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine)) {
> > +		intel_wakeref_might_put(&engine->wakeref);
> > +	} else {
> > +		struct intel_gt *gt = engine->gt;
> > +		struct intel_engine_cs *tengine;
> > +		intel_engine_mask_t tmp, mask = engine->mask;
> > +
> > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > +			intel_wakeref_might_put(&tengine->wakeref);
> > +	}
> > +	intel_gt_pm_might_put(engine->gt);
> > +}
> > +
> >   static inline struct i915_request *
> >   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> >   {
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index 05de6c1af25b..bc898df7a48c 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> >   	return intel_wakeref_get_if_active(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_get(&gt->wakeref);
> > +}
> > +
> >   static inline void intel_gt_pm_put(struct intel_gt *gt)
> >   {
> >   	intel_wakeref_put(&gt->wakeref);
> > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_put(&gt->wakeref);
> > +}
> > +
> >   #define with_intel_gt_pm(gt, tmp) \
> >   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> >   	     intel_gt_pm_put(gt), tmp = 0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 17da2fea1bff..8b82da50c2bc 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> >   static int guc_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > +
> > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > +		intel_engine_pm_get(ce->engine);
> > +
> > +	return ret;
> >   }
> >   static void guc_context_unpin(struct intel_context *ce)
> > @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
> >   	unpin_guc_id(guc, ce);
> >   	lrc_unpin(ce);
> > +
> > +	if (likely(!intel_context_is_barrier(ce)))
> > +		intel_engine_pm_put_async(ce->engine);
> >   }
> >   static void guc_context_post_unpin(struct intel_context *ce)
> > @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> >   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +
> > +	if (likely(!ret))
> > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +			intel_engine_pm_get(engine);
> > -	return __guc_context_pin(ce, engine, vaddr);
> > +	return ret;
> > +}
> > +
> > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > +{
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +
> > +	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > +
> > +	unpin_guc_id(guc, ce);
> > +	lrc_unpin(ce);
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +		intel_engine_pm_put_async(engine);
> >   }
> >   static void guc_virtual_context_enter(struct intel_context *ce)
> > @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.pre_pin = guc_virtual_context_pre_pin,
> >   	.pin = guc_virtual_context_pin,
> > -	.unpin = guc_context_unpin,
> > +	.unpin = guc_virtual_context_unpin,
> >   	.post_unpin = guc_context_post_unpin,
> >   	.ban = guc_context_ban,
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > index 545c8f277c46..4f4c2e15e736 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > @@ -123,6 +123,12 @@ enum {
> >   	__INTEL_WAKEREF_PUT_LAST_BIT__
> >   };
> > +static inline void
> > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_put_flags: Release the wakeref
> >    * @wf: the wakeref
> > @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> >   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> >   }
> > +static inline void
> > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_lock: Lock the wakeref (mutex)
> >    * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-10-07 15:19       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-07 15:19 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a scheduling of user context could be enabled.
> I'm not sure what 'while a scheduling of user context could be enabled'
> means.
>

Not really sure how this isn't clear.

It means if a user context has scheduling enabled this function cannot
short circuit returning idle.

Matt
 
> John.
> 
> > Returning GT idle when it is not can cause all sorts of issues
> > throughout the stack.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add might_lock annotations to pin / unpin function
> > v3:
> >   (CI)
> >    - Drop intel_engine_pm_might_put from unpin path as an async put is
> >      used
> > v4:
> >   (John Harrison)
> >    - Make intel_engine_pm_might_get/put work with GuC virtual engines
> >    - Update commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> >   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> >   5 files changed, 89 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 1076066f41e0..f601323b939f 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >   	if (err)
> >   		goto err_post_unpin;
> > +	intel_engine_pm_might_get(ce->engine);
> > +
> >   	if (unlikely(intel_context_is_closed(ce))) {
> >   		err = -ENOENT;
> >   		goto err_unlock;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 6fdeae668e6e..d68675925b79 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -6,9 +6,11 @@
> >   #ifndef INTEL_ENGINE_PM_H
> >   #define INTEL_ENGINE_PM_H
> > +#include "i915_drv.h"
> >   #include "i915_request.h"
> >   #include "intel_engine_types.h"
> >   #include "intel_wakeref.h"
> > +#include "intel_gt_pm.h"
> >   static inline bool
> >   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> >   	return intel_wakeref_get_if_active(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine)) {
> > +		intel_wakeref_might_get(&engine->wakeref);
> > +	} else {
> > +		struct intel_gt *gt = engine->gt;
> > +		struct intel_engine_cs *tengine;
> > +		intel_engine_mask_t tmp, mask = engine->mask;
> > +
> > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > +			intel_wakeref_might_get(&tengine->wakeref);
> > +	}
> > +	intel_gt_pm_might_get(engine->gt);
> > +}
> > +
> >   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_put(&engine->wakeref);
> > @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> >   	intel_wakeref_unlock_wait(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine)) {
> > +		intel_wakeref_might_put(&engine->wakeref);
> > +	} else {
> > +		struct intel_gt *gt = engine->gt;
> > +		struct intel_engine_cs *tengine;
> > +		intel_engine_mask_t tmp, mask = engine->mask;
> > +
> > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > +			intel_wakeref_might_put(&tengine->wakeref);
> > +	}
> > +	intel_gt_pm_might_put(engine->gt);
> > +}
> > +
> >   static inline struct i915_request *
> >   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> >   {
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index 05de6c1af25b..bc898df7a48c 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> >   	return intel_wakeref_get_if_active(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_get(&gt->wakeref);
> > +}
> > +
> >   static inline void intel_gt_pm_put(struct intel_gt *gt)
> >   {
> >   	intel_wakeref_put(&gt->wakeref);
> > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_put(&gt->wakeref);
> > +}
> > +
> >   #define with_intel_gt_pm(gt, tmp) \
> >   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> >   	     intel_gt_pm_put(gt), tmp = 0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 17da2fea1bff..8b82da50c2bc 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> >   static int guc_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > +
> > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > +		intel_engine_pm_get(ce->engine);
> > +
> > +	return ret;
> >   }
> >   static void guc_context_unpin(struct intel_context *ce)
> > @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
> >   	unpin_guc_id(guc, ce);
> >   	lrc_unpin(ce);
> > +
> > +	if (likely(!intel_context_is_barrier(ce)))
> > +		intel_engine_pm_put_async(ce->engine);
> >   }
> >   static void guc_context_post_unpin(struct intel_context *ce)
> > @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> >   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +
> > +	if (likely(!ret))
> > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +			intel_engine_pm_get(engine);
> > -	return __guc_context_pin(ce, engine, vaddr);
> > +	return ret;
> > +}
> > +
> > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > +{
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +
> > +	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > +
> > +	unpin_guc_id(guc, ce);
> > +	lrc_unpin(ce);
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +		intel_engine_pm_put_async(engine);
> >   }
> >   static void guc_virtual_context_enter(struct intel_context *ce)
> > @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.pre_pin = guc_virtual_context_pre_pin,
> >   	.pin = guc_virtual_context_pin,
> > -	.unpin = guc_context_unpin,
> > +	.unpin = guc_virtual_context_unpin,
> >   	.post_unpin = guc_context_post_unpin,
> >   	.ban = guc_context_ban,
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > index 545c8f277c46..4f4c2e15e736 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > @@ -123,6 +123,12 @@ enum {
> >   	__INTEL_WAKEREF_PUT_LAST_BIT__
> >   };
> > +static inline void
> > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_put_flags: Release the wakeref
> >    * @wf: the wakeref
> > @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> >   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> >   }
> > +static inline void
> > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_lock: Lock the wakeref (mutex)
> >    * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
  2021-10-07 15:05       ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 18:13         ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 18:13 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 08:05, Matthew Brost wrote:
> On Wed, Oct 06, 2021 at 08:06:41PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Move guc_id allocation under submission state sub-struct as a future
>>> patch will reuse the spin lock as a global submission state lock. Moving
>>> this into sub-struct makes ownership of fields / lock clear.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
>>>    3 files changed, 47 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> index 12252c411159..e7e3984aab78 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> @@ -197,18 +197,18 @@ struct intel_context {
>>>    	struct {
>>>    		/**
>>>    		 * @id: handle which is used to uniquely identify this context
>>> -		 * with the GuC, protected by guc->contexts_lock
>>> +		 * with the GuC, protected by guc->submission_state.lock
>>>    		 */
>>>    		u16 id;
>>>    		/**
>>>    		 * @ref: the number of references to the guc_id, when
>>>    		 * transitioning in and out of zero protected by
>>> -		 * guc->contexts_lock
>>> +		 * guc->submission_state.lock
>>>    		 */
>>>    		atomic_t ref;
>>>    		/**
>>>    		 * @link: in guc->guc_id_list when the guc_id has no refs but is
>>> -		 * still valid, protected by guc->contexts_lock
>>> +		 * still valid, protected by guc->submission_state.lock
>>>    		 */
>>>    		struct list_head link;
>>>    	} guc_id;
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 5dd174babf7a..65b5e8eeef96 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -70,17 +70,21 @@ struct intel_guc {
>>>    		void (*disable)(struct intel_guc *guc);
>>>    	} interrupts;
>>> -	/**
>>> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
>>> -	 * ce->guc_id.ref when transitioning in and out of zero
>>> -	 */
>>> -	spinlock_t contexts_lock;
>>> -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
>>> -	struct ida guc_ids;
>>> -	/**
>>> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
>>> -	 */
>>> -	struct list_head guc_id_list;
>>> +	struct {
>>> +		/**
>>> +		 * @lock: protects everything in submission_state
>>> +		 */
>>> +		spinlock_t lock;
>> The old version also mentioned 'ce->guc_id.ref'. Should this not also
>> mention that transition? Or was the old comment inaccurate. I'm not seeing
>> any actual behaviour changes in the patch.
>>
>>
> Can add that back in.
>
>>> +		/**
>>> +		 * @guc_ids: used to allocate new guc_ids
>>> +		 */
>>> +		struct ida guc_ids;
>>> +		/**
>>> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>> +		 * refs
>>> +		 */
>>> +		struct list_head guc_id_list;
>>> +	} submission_state;
>>>    	/**
>>>    	 * @submission_supported: tracks whether we support GuC submission on
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index ba0de35f6323..ad5c18119d92 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -68,16 +68,16 @@
>>>     * fence is used to stall all requests associated with this guc_id until the
>>>     * corresponding G2H returns indicating the guc_id has been deregistered.
>>>     *
>>> - * guc_ids:
>>> + * submission_state.guc_ids:
>>>     * Unique number associated with private GuC context data passed in during
>>>     * context registration / submission / deregistration. 64k available. Simple ida
>>>     * is used for allocation.
>>>     *
>>> - * Stealing guc_ids:
>>> - * If no guc_ids are available they can be stolen from another context at
>>> - * request creation time if that context is unpinned. If a guc_id can't be found
>>> - * we punt this problem to the user as we believe this is near impossible to hit
>>> - * during normal use cases.
>>> + * Stealing submission_state.guc_ids:
>>> + * If no submission_state.guc_ids are available they can be stolen from another
>> I would abbreviate this instance as well, submission_state.guc_id is quite
>> the mouthful. Unless this somehow magically links back to the structure
>> entry in the kerneldoc output?
>>
> It might, not really sure but agree the submission_state should be
> dropped. Think changed because of global find replace.
>
> Matt
Okay. With those nits fixed:
Reviewed by: John Harrison <John.C.Harrison@Intel.com>

>> John.
>>
>>> + * context at request creation time if that context is unpinned. If a guc_id
>>> + * can't be found we punt this problem to the user as we believe this is near
>>> + * impossible to hit during normal use cases.
>>>     *
>>>     * Locking:
>>>     * In the GuC submission code we have 3 basic spin locks which protect
>>> @@ -89,7 +89,7 @@
>>>     * sched_engine can be submitting at a time. Currently only one sched_engine is
>>>     * used for all of GuC submission but that could change in the future.
>>>     *
>>> - * guc->contexts_lock
>>> + * guc->submission_state.lock
>>>     * Protects guc_id allocation for the given GuC, i.e. only one context can be
>>>     * doing guc_id allocation operations at a time for each GuC in the system.
>>>     *
>>> @@ -103,7 +103,7 @@
>>>     *
>>>     * Lock ordering rules:
>>>     * sched_engine->lock -> ce->guc_state.lock
>>> - * guc->contexts_lock -> ce->guc_state.lock
>>> + * guc->submission_state.lock -> ce->guc_state.lock
>>>     *
>>>     * Reset races:
>>>     * When a full GT reset is triggered it is assumed that some G2H responses to
>>> @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>>> -	spin_lock_init(&guc->contexts_lock);
>>> -	INIT_LIST_HEAD(&guc->guc_id_list);
>>> -	ida_init(&guc->guc_ids);
>>> +	spin_lock_init(&guc->submission_state.lock);
>>> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>>> +	ida_init(&guc->submission_state.guc_ids);
>>>    	return 0;
>>>    }
>>> @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
>>>    static int new_guc_id(struct intel_guc *guc)
>>>    {
>>> -	return ida_simple_get(&guc->guc_ids, 0,
>>> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>>    			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>>    			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>>    }
>>> @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
>>>    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	if (!context_guc_id_invalid(ce)) {
>>> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
>>> +		ida_simple_remove(&guc->submission_state.guc_ids,
>>> +				  ce->guc_id.id);
>>>    		reset_lrc_desc(guc, ce->guc_id.id);
>>>    		set_context_guc_id_invalid(ce);
>>>    	}
>>> @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	unsigned long flags;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	__release_guc_id(guc, ce);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int steal_guc_id(struct intel_guc *guc)
>>> @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
>>>    	struct intel_context *ce;
>>>    	int guc_id;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>> -	if (!list_empty(&guc->guc_id_list)) {
>>> -		ce = list_first_entry(&guc->guc_id_list,
>>> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>>    				      struct intel_context,
>>>    				      guc_id.link);
>>> @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>>    {
>>>    	int ret;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>>    	ret = new_guc_id(guc);
>>>    	if (unlikely(ret < 0)) {
>>> @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>>    try_again:
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	might_lock(&ce->guc_state.lock);
>>> @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	atomic_inc(&ce->guc_id.ref);
>>>    out_unlock:
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	/*
>>>    	 * -EAGAIN indicates no guc_id are available, let's retire any
>>> @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	if (unlikely(context_guc_id_invalid(ce)))
>>>    		return;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>>>    	    !atomic_read(&ce->guc_id.ref))
>>> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +		list_add_tail(&ce->guc_id.link,
>>> +			      &guc->submission_state.guc_id_list);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int __guc_action_register_context(struct intel_guc *guc,
>>> @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
>>>    	 * returns indicating this context has been deregistered the guc_id is
>>>    	 * returned to the pool of available guc_id.
>>>    	 */
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	if (context_guc_id_invalid(ce)) {
>>> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    		__guc_context_destroy(ce);
>>>    		return;
>>>    	}
>>>    	if (!list_empty(&ce->guc_id.link))
>>>    		list_del_init(&ce->guc_id.link);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	/* Seal race with Reset */
>>>    	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
@ 2021-10-07 18:13         ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 18:13 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 08:05, Matthew Brost wrote:
> On Wed, Oct 06, 2021 at 08:06:41PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Move guc_id allocation under submission state sub-struct as a future
>>> patch will reuse the spin lock as a global submission state lock. Moving
>>> this into sub-struct makes ownership of fields / lock clear.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context_types.h |  6 +-
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 +++++----
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++++++---------
>>>    3 files changed, 47 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> index 12252c411159..e7e3984aab78 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> @@ -197,18 +197,18 @@ struct intel_context {
>>>    	struct {
>>>    		/**
>>>    		 * @id: handle which is used to uniquely identify this context
>>> -		 * with the GuC, protected by guc->contexts_lock
>>> +		 * with the GuC, protected by guc->submission_state.lock
>>>    		 */
>>>    		u16 id;
>>>    		/**
>>>    		 * @ref: the number of references to the guc_id, when
>>>    		 * transitioning in and out of zero protected by
>>> -		 * guc->contexts_lock
>>> +		 * guc->submission_state.lock
>>>    		 */
>>>    		atomic_t ref;
>>>    		/**
>>>    		 * @link: in guc->guc_id_list when the guc_id has no refs but is
>>> -		 * still valid, protected by guc->contexts_lock
>>> +		 * still valid, protected by guc->submission_state.lock
>>>    		 */
>>>    		struct list_head link;
>>>    	} guc_id;
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 5dd174babf7a..65b5e8eeef96 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -70,17 +70,21 @@ struct intel_guc {
>>>    		void (*disable)(struct intel_guc *guc);
>>>    	} interrupts;
>>> -	/**
>>> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
>>> -	 * ce->guc_id.ref when transitioning in and out of zero
>>> -	 */
>>> -	spinlock_t contexts_lock;
>>> -	/** @guc_ids: used to allocate unique ce->guc_id.id values */
>>> -	struct ida guc_ids;
>>> -	/**
>>> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
>>> -	 */
>>> -	struct list_head guc_id_list;
>>> +	struct {
>>> +		/**
>>> +		 * @lock: protects everything in submission_state
>>> +		 */
>>> +		spinlock_t lock;
>> The old version also mentioned 'ce->guc_id.ref'. Should this not also
>> mention that transition? Or was the old comment inaccurate. I'm not seeing
>> any actual behaviour changes in the patch.
>>
>>
> Can add that back in.
>
>>> +		/**
>>> +		 * @guc_ids: used to allocate new guc_ids
>>> +		 */
>>> +		struct ida guc_ids;
>>> +		/**
>>> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>> +		 * refs
>>> +		 */
>>> +		struct list_head guc_id_list;
>>> +	} submission_state;
>>>    	/**
>>>    	 * @submission_supported: tracks whether we support GuC submission on
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index ba0de35f6323..ad5c18119d92 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -68,16 +68,16 @@
>>>     * fence is used to stall all requests associated with this guc_id until the
>>>     * corresponding G2H returns indicating the guc_id has been deregistered.
>>>     *
>>> - * guc_ids:
>>> + * submission_state.guc_ids:
>>>     * Unique number associated with private GuC context data passed in during
>>>     * context registration / submission / deregistration. 64k available. Simple ida
>>>     * is used for allocation.
>>>     *
>>> - * Stealing guc_ids:
>>> - * If no guc_ids are available they can be stolen from another context at
>>> - * request creation time if that context is unpinned. If a guc_id can't be found
>>> - * we punt this problem to the user as we believe this is near impossible to hit
>>> - * during normal use cases.
>>> + * Stealing submission_state.guc_ids:
>>> + * If no submission_state.guc_ids are available they can be stolen from another
>> I would abbreviate this instance as well, submission_state.guc_id is quite
>> the mouthful. Unless this somehow magically links back to the structure
>> entry in the kerneldoc output?
>>
> It might, not really sure but agree the submission_state should be
> dropped. Think changed because of global find replace.
>
> Matt
Okay. With those nits fixed:
Reviewed by: John Harrison <John.C.Harrison@Intel.com>

>> John.
>>
>>> + * context at request creation time if that context is unpinned. If a guc_id
>>> + * can't be found we punt this problem to the user as we believe this is near
>>> + * impossible to hit during normal use cases.
>>>     *
>>>     * Locking:
>>>     * In the GuC submission code we have 3 basic spin locks which protect
>>> @@ -89,7 +89,7 @@
>>>     * sched_engine can be submitting at a time. Currently only one sched_engine is
>>>     * used for all of GuC submission but that could change in the future.
>>>     *
>>> - * guc->contexts_lock
>>> + * guc->submission_state.lock
>>>     * Protects guc_id allocation for the given GuC, i.e. only one context can be
>>>     * doing guc_id allocation operations at a time for each GuC in the system.
>>>     *
>>> @@ -103,7 +103,7 @@
>>>     *
>>>     * Lock ordering rules:
>>>     * sched_engine->lock -> ce->guc_state.lock
>>> - * guc->contexts_lock -> ce->guc_state.lock
>>> + * guc->submission_state.lock -> ce->guc_state.lock
>>>     *
>>>     * Reset races:
>>>     * When a full GT reset is triggered it is assumed that some G2H responses to
>>> @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>>> -	spin_lock_init(&guc->contexts_lock);
>>> -	INIT_LIST_HEAD(&guc->guc_id_list);
>>> -	ida_init(&guc->guc_ids);
>>> +	spin_lock_init(&guc->submission_state.lock);
>>> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>>> +	ida_init(&guc->submission_state.guc_ids);
>>>    	return 0;
>>>    }
>>> @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
>>>    static int new_guc_id(struct intel_guc *guc)
>>>    {
>>> -	return ida_simple_get(&guc->guc_ids, 0,
>>> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>>    			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>>    			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>>    }
>>> @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
>>>    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	if (!context_guc_id_invalid(ce)) {
>>> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
>>> +		ida_simple_remove(&guc->submission_state.guc_ids,
>>> +				  ce->guc_id.id);
>>>    		reset_lrc_desc(guc, ce->guc_id.id);
>>>    		set_context_guc_id_invalid(ce);
>>>    	}
>>> @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	unsigned long flags;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	__release_guc_id(guc, ce);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int steal_guc_id(struct intel_guc *guc)
>>> @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
>>>    	struct intel_context *ce;
>>>    	int guc_id;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>> -	if (!list_empty(&guc->guc_id_list)) {
>>> -		ce = list_first_entry(&guc->guc_id_list,
>>> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>>    				      struct intel_context,
>>>    				      guc_id.link);
>>> @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>>    {
>>>    	int ret;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>>    	ret = new_guc_id(guc);
>>>    	if (unlikely(ret < 0)) {
>>> @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>>    try_again:
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	might_lock(&ce->guc_state.lock);
>>> @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	atomic_inc(&ce->guc_id.ref);
>>>    out_unlock:
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	/*
>>>    	 * -EAGAIN indicates no guc_id are available, let's retire any
>>> @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	if (unlikely(context_guc_id_invalid(ce)))
>>>    		return;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>>>    	    !atomic_read(&ce->guc_id.ref))
>>> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +		list_add_tail(&ce->guc_id.link,
>>> +			      &guc->submission_state.guc_id_list);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int __guc_action_register_context(struct intel_guc *guc,
>>> @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
>>>    	 * returns indicating this context has been deregistered the guc_id is
>>>    	 * returned to the pool of available guc_id.
>>>    	 */
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	if (context_guc_id_invalid(ce)) {
>>> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    		__guc_context_destroy(ce);
>>>    		return;
>>>    	}
>>>    	if (!list_empty(&ce->guc_id.link))
>>>    		list_del_init(&ce->guc_id.link);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	/* Seal race with Reset */
>>>    	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-10-07 15:19       ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 18:15         ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 18:15 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 08:19, Matthew Brost wrote:
> On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Taking a PM reference to prevent intel_gt_wait_for_idle from short
>>> circuiting while a scheduling of user context could be enabled.
>> I'm not sure what 'while a scheduling of user context could be enabled'
>> means.
>>
> Not really sure how this isn't clear.
>
> It means if a user context has scheduling enabled this function cannot
> short circuit returning idle.
>
> Matt
Okay. The 'a scheduling' was throwing me off. And I was reading 'could 
be enabled' as saying something that might happen in the future. English 
is great at being ambiguous ;). Maybe 'while any user context has 
scheduling enabled' would be simpler?

John.

>   
>> John.
>>
>>> Returning GT idle when it is not can cause all sorts of issues
>>> throughout the stack.
>>>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Add might_lock annotations to pin / unpin function
>>> v3:
>>>    (CI)
>>>     - Drop intel_engine_pm_might_put from unpin path as an async put is
>>>       used
>>> v4:
>>>    (John Harrison)
>>>     - Make intel_engine_pm_might_get/put work with GuC virtual engines
>>>     - Update commit message
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>>>    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>>>    5 files changed, 89 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>>> index 1076066f41e0..f601323b939f 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>>>    	if (err)
>>>    		goto err_post_unpin;
>>> +	intel_engine_pm_might_get(ce->engine);
>>> +
>>>    	if (unlikely(intel_context_is_closed(ce))) {
>>>    		err = -ENOENT;
>>>    		goto err_unlock;
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> index 6fdeae668e6e..d68675925b79 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> @@ -6,9 +6,11 @@
>>>    #ifndef INTEL_ENGINE_PM_H
>>>    #define INTEL_ENGINE_PM_H
>>> +#include "i915_drv.h"
>>>    #include "i915_request.h"
>>>    #include "intel_engine_types.h"
>>>    #include "intel_wakeref.h"
>>> +#include "intel_gt_pm.h"
>>>    static inline bool
>>>    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>>> @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>>>    	return intel_wakeref_get_if_active(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine)) {
>>> +		intel_wakeref_might_get(&engine->wakeref);
>>> +	} else {
>>> +		struct intel_gt *gt = engine->gt;
>>> +		struct intel_engine_cs *tengine;
>>> +		intel_engine_mask_t tmp, mask = engine->mask;
>>> +
>>> +		for_each_engine_masked(tengine, gt, mask, tmp)
>>> +			intel_wakeref_might_get(&tengine->wakeref);
>>> +	}
>>> +	intel_gt_pm_might_get(engine->gt);
>>> +}
>>> +
>>>    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>>>    {
>>>    	intel_wakeref_put(&engine->wakeref);
>>> @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>>>    	intel_wakeref_unlock_wait(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine)) {
>>> +		intel_wakeref_might_put(&engine->wakeref);
>>> +	} else {
>>> +		struct intel_gt *gt = engine->gt;
>>> +		struct intel_engine_cs *tengine;
>>> +		intel_engine_mask_t tmp, mask = engine->mask;
>>> +
>>> +		for_each_engine_masked(tengine, gt, mask, tmp)
>>> +			intel_wakeref_might_put(&tengine->wakeref);
>>> +	}
>>> +	intel_gt_pm_might_put(engine->gt);
>>> +}
>>> +
>>>    static inline struct i915_request *
>>>    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>>>    {
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> index 05de6c1af25b..bc898df7a48c 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>>>    	return intel_wakeref_get_if_active(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_get(&gt->wakeref);
>>> +}
>>> +
>>>    static inline void intel_gt_pm_put(struct intel_gt *gt)
>>>    {
>>>    	intel_wakeref_put(&gt->wakeref);
>>> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>>>    	intel_wakeref_put_async(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_put(&gt->wakeref);
>>> +}
>>> +
>>>    #define with_intel_gt_pm(gt, tmp) \
>>>    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>>    	     intel_gt_pm_put(gt), tmp = 0)
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 17da2fea1bff..8b82da50c2bc 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>>>    static int guc_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>> -	return __guc_context_pin(ce, ce->engine, vaddr);
>>> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
>>> +
>>> +	if (likely(!ret && !intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_get(ce->engine);
>>> +
>>> +	return ret;
>>>    }
>>>    static void guc_context_unpin(struct intel_context *ce)
>>> @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
>>>    	unpin_guc_id(guc, ce);
>>>    	lrc_unpin(ce);
>>> +
>>> +	if (likely(!intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_put_async(ce->engine);
>>>    }
>>>    static void guc_context_post_unpin(struct intel_context *ce)
>>> @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>>>    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>>    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
>>> +	int ret = __guc_context_pin(ce, engine, vaddr);
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +
>>> +	if (likely(!ret))
>>> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +			intel_engine_pm_get(engine);
>>> -	return __guc_context_pin(ce, engine, vaddr);
>>> +	return ret;
>>> +}
>>> +
>>> +static void guc_virtual_context_unpin(struct intel_context *ce)
>>> +{
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +	struct intel_engine_cs *engine;
>>> +	struct intel_guc *guc = ce_to_guc(ce);
>>> +
>>> +	GEM_BUG_ON(context_enabled(ce));
>>> +	GEM_BUG_ON(intel_context_is_barrier(ce));
>>> +
>>> +	unpin_guc_id(guc, ce);
>>> +	lrc_unpin(ce);
>>> +
>>> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +		intel_engine_pm_put_async(engine);
>>>    }
>>>    static void guc_virtual_context_enter(struct intel_context *ce)
>>> @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>>>    	.pre_pin = guc_virtual_context_pre_pin,
>>>    	.pin = guc_virtual_context_pin,
>>> -	.unpin = guc_context_unpin,
>>> +	.unpin = guc_virtual_context_unpin,
>>>    	.post_unpin = guc_context_post_unpin,
>>>    	.ban = guc_context_ban,
>>> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
>>> index 545c8f277c46..4f4c2e15e736 100644
>>> --- a/drivers/gpu/drm/i915/intel_wakeref.h
>>> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
>>> @@ -123,6 +123,12 @@ enum {
>>>    	__INTEL_WAKEREF_PUT_LAST_BIT__
>>>    };
>>> +static inline void
>>> +intel_wakeref_might_get(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_put_flags: Release the wakeref
>>>     * @wf: the wakeref
>>> @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>>>    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>>>    }
>>> +static inline void
>>> +intel_wakeref_might_put(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_lock: Lock the wakeref (mutex)
>>>     * @wf: the wakeref


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-10-07 18:15         ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 18:15 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 08:19, Matthew Brost wrote:
> On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Taking a PM reference to prevent intel_gt_wait_for_idle from short
>>> circuiting while a scheduling of user context could be enabled.
>> I'm not sure what 'while a scheduling of user context could be enabled'
>> means.
>>
> Not really sure how this isn't clear.
>
> It means if a user context has scheduling enabled this function cannot
> short circuit returning idle.
>
> Matt
Okay. The 'a scheduling' was throwing me off. And I was reading 'could 
be enabled' as saying something that might happen in the future. English 
is great at being ambiguous ;). Maybe 'while any user context has 
scheduling enabled' would be simpler?

John.

>   
>> John.
>>
>>> Returning GT idle when it is not can cause all sorts of issues
>>> throughout the stack.
>>>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Add might_lock annotations to pin / unpin function
>>> v3:
>>>    (CI)
>>>     - Drop intel_engine_pm_might_put from unpin path as an async put is
>>>       used
>>> v4:
>>>    (John Harrison)
>>>     - Make intel_engine_pm_might_get/put work with GuC virtual engines
>>>     - Update commit message
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>>>    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>>>    5 files changed, 89 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>>> index 1076066f41e0..f601323b939f 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>>>    	if (err)
>>>    		goto err_post_unpin;
>>> +	intel_engine_pm_might_get(ce->engine);
>>> +
>>>    	if (unlikely(intel_context_is_closed(ce))) {
>>>    		err = -ENOENT;
>>>    		goto err_unlock;
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> index 6fdeae668e6e..d68675925b79 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> @@ -6,9 +6,11 @@
>>>    #ifndef INTEL_ENGINE_PM_H
>>>    #define INTEL_ENGINE_PM_H
>>> +#include "i915_drv.h"
>>>    #include "i915_request.h"
>>>    #include "intel_engine_types.h"
>>>    #include "intel_wakeref.h"
>>> +#include "intel_gt_pm.h"
>>>    static inline bool
>>>    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>>> @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>>>    	return intel_wakeref_get_if_active(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine)) {
>>> +		intel_wakeref_might_get(&engine->wakeref);
>>> +	} else {
>>> +		struct intel_gt *gt = engine->gt;
>>> +		struct intel_engine_cs *tengine;
>>> +		intel_engine_mask_t tmp, mask = engine->mask;
>>> +
>>> +		for_each_engine_masked(tengine, gt, mask, tmp)
>>> +			intel_wakeref_might_get(&tengine->wakeref);
>>> +	}
>>> +	intel_gt_pm_might_get(engine->gt);
>>> +}
>>> +
>>>    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>>>    {
>>>    	intel_wakeref_put(&engine->wakeref);
>>> @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>>>    	intel_wakeref_unlock_wait(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine)) {
>>> +		intel_wakeref_might_put(&engine->wakeref);
>>> +	} else {
>>> +		struct intel_gt *gt = engine->gt;
>>> +		struct intel_engine_cs *tengine;
>>> +		intel_engine_mask_t tmp, mask = engine->mask;
>>> +
>>> +		for_each_engine_masked(tengine, gt, mask, tmp)
>>> +			intel_wakeref_might_put(&tengine->wakeref);
>>> +	}
>>> +	intel_gt_pm_might_put(engine->gt);
>>> +}
>>> +
>>>    static inline struct i915_request *
>>>    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>>>    {
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> index 05de6c1af25b..bc898df7a48c 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>>>    	return intel_wakeref_get_if_active(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_get(&gt->wakeref);
>>> +}
>>> +
>>>    static inline void intel_gt_pm_put(struct intel_gt *gt)
>>>    {
>>>    	intel_wakeref_put(&gt->wakeref);
>>> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>>>    	intel_wakeref_put_async(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_put(&gt->wakeref);
>>> +}
>>> +
>>>    #define with_intel_gt_pm(gt, tmp) \
>>>    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>>    	     intel_gt_pm_put(gt), tmp = 0)
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 17da2fea1bff..8b82da50c2bc 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>>>    static int guc_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>> -	return __guc_context_pin(ce, ce->engine, vaddr);
>>> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
>>> +
>>> +	if (likely(!ret && !intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_get(ce->engine);
>>> +
>>> +	return ret;
>>>    }
>>>    static void guc_context_unpin(struct intel_context *ce)
>>> @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
>>>    	unpin_guc_id(guc, ce);
>>>    	lrc_unpin(ce);
>>> +
>>> +	if (likely(!intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_put_async(ce->engine);
>>>    }
>>>    static void guc_context_post_unpin(struct intel_context *ce)
>>> @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>>>    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>>    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
>>> +	int ret = __guc_context_pin(ce, engine, vaddr);
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +
>>> +	if (likely(!ret))
>>> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +			intel_engine_pm_get(engine);
>>> -	return __guc_context_pin(ce, engine, vaddr);
>>> +	return ret;
>>> +}
>>> +
>>> +static void guc_virtual_context_unpin(struct intel_context *ce)
>>> +{
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +	struct intel_engine_cs *engine;
>>> +	struct intel_guc *guc = ce_to_guc(ce);
>>> +
>>> +	GEM_BUG_ON(context_enabled(ce));
>>> +	GEM_BUG_ON(intel_context_is_barrier(ce));
>>> +
>>> +	unpin_guc_id(guc, ce);
>>> +	lrc_unpin(ce);
>>> +
>>> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +		intel_engine_pm_put_async(engine);
>>>    }
>>>    static void guc_virtual_context_enter(struct intel_context *ce)
>>> @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>>>    	.pre_pin = guc_virtual_context_pre_pin,
>>>    	.pin = guc_virtual_context_pin,
>>> -	.unpin = guc_context_unpin,
>>> +	.unpin = guc_virtual_context_unpin,
>>>    	.post_unpin = guc_context_post_unpin,
>>>    	.ban = guc_context_ban,
>>> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
>>> index 545c8f277c46..4f4c2e15e736 100644
>>> --- a/drivers/gpu/drm/i915/intel_wakeref.h
>>> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
>>> @@ -123,6 +123,12 @@ enum {
>>>    	__INTEL_WAKEREF_PUT_LAST_BIT__
>>>    };
>>> +static inline void
>>> +intel_wakeref_might_get(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_put_flags: Release the wakeref
>>>     * @wf: the wakeref
>>> @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>>>    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>>>    }
>>> +static inline void
>>> +intel_wakeref_might_put(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_lock: Lock the wakeref (mutex)
>>>     * @wf: the wakeref


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 05/26] drm/i915: Add logical engine mapping
  2021-10-04 22:06   ` Matthew Brost
@ 2021-10-07 19:03     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:03 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Add logical engine mapping. This is required for split-frame, as
> workloads need to be placed on engines in a logically contiguous manner.
>
> v2:
>   (Daniel Vetter)
>    - Add kernel doc for new fields
> v3
>   (Tvrtko)
>    - Update comment for new logical_mask field
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  7 +++
>   .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>   5 files changed, 62 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..2eb798ad068b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>   	GEM_DEBUG_WARN_ON(iir);
>   }
>   
> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> +			      u8 logical_instance)
>   {
>   	const struct engine_info *info = &intel_engines[id];
>   	struct drm_i915_private *i915 = gt->i915;
> @@ -335,6 +336,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>   
>   	engine->class = info->class;
>   	engine->instance = info->instance;
> +	engine->logical_mask = BIT(logical_instance);
>   	__sprint_engine_name(engine);
>   
>   	engine->props.heartbeat_interval_ms =
> @@ -588,6 +590,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>   	return info->engine_mask;
>   }
>   
> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> +				 u8 class, const u8 *map, u8 num_instances)
> +{
> +	int i, j;
> +	u8 current_logical_id = 0;
> +
> +	for (j = 0; j < num_instances; ++j) {
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			if (!HAS_ENGINE(gt, i) ||
> +			    intel_engines[i].class != class)
> +				continue;
> +
> +			if (intel_engines[i].instance == map[j]) {
> +				logical_ids[intel_engines[i].instance] =
> +					current_logical_id++;
> +				break;
> +			}
> +		}
> +	}
> +}
> +
> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> +{
> +	int i;
> +	u8 map[MAX_ENGINE_INSTANCE + 1];
> +
> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> +		map[i] = i;
> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> +}
> +
>   /**
>    * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>    * @gt: pointer to struct intel_gt
> @@ -599,7 +632,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	struct drm_i915_private *i915 = gt->i915;
>   	const unsigned int engine_mask = init_engine_mask(gt);
>   	unsigned int mask = 0;
> -	unsigned int i;
> +	unsigned int i, class;
> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>   	int err;
>   
>   	drm_WARN_ON(&i915->drm, engine_mask == 0);
> @@ -609,15 +643,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	if (i915_inject_probe_failure(i915))
>   		return -ENODEV;
>   
> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> -		if (!HAS_ENGINE(gt, i))
> -			continue;
> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> +		setup_logical_ids(gt, logical_ids, class);
>   
> -		err = intel_engine_setup(gt, i);
> -		if (err)
> -			goto cleanup;
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			u8 instance = intel_engines[i].instance;
> +
> +			if (intel_engines[i].class != class ||
> +			    !HAS_ENGINE(gt, i))
> +				continue;
>   
> -		mask |= BIT(i);
> +			err = intel_engine_setup(gt, i,
> +						 logical_ids[instance]);
> +			if (err)
> +				goto cleanup;
> +
> +			mask |= BIT(i);
> +		}
>   	}
>   
>   	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..68010da468a4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -269,6 +269,13 @@ struct intel_engine_cs {
>   	unsigned int guc_id;
>   
>   	intel_engine_mask_t mask;
> +	/**
> +	 * @logical_mask: logical mask of engine, reported to user space via
> +	 * query IOCTL and used to communicate with the GuC in logical space.
> +	 * The logical instance of a physical engine can change based on product
> +	 * / fusing and defined in the bspec.
I would use 'and' rather than '/' when it line wraps like that. 
Otherwise, it looks like you tried to end the comment, but failed and 
then kept typing!

Also, not sure about 'and defined in the bspec'. I would just drop that 
line. I think 'based on product and fusing' is sufficient. Otherwise, 
you should be including the bspec link.

With that tweaked:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

John.

> +	 */
> +	intel_engine_mask_t logical_mask;
>   
>   	u8 class;
>   	u8 instance;
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5ed1e222c308 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3877,6 +3877,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   
>   		ve->siblings[ve->num_siblings++] = sibling;
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		/*
>   		 * All physical engines must be compatible for their emission
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..621c893a009f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>   	for_each_engine(engine, gt, id) {
>   		u8 guc_class = engine_class_to_guc_class(engine->class);
>   
> -		system_info->mapping_table[guc_class][engine->instance] =
> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>   			engine->instance;
>   	}
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 8b82da50c2bc..451d9ae861a6 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1423,23 +1423,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   	return __guc_action_deregister_context(guc, guc_id);
>   }
>   
> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> -{
> -	switch (class) {
> -	case RENDER_CLASS:
> -		return mask >> RCS0;
> -	case VIDEO_ENHANCEMENT_CLASS:
> -		return mask >> VECS0;
> -	case VIDEO_DECODE_CLASS:
> -		return mask >> VCS0;
> -	case COPY_ENGINE_CLASS:
> -		return mask >> BCS0;
> -	default:
> -		MISSING_CASE(class);
> -		return 0;
> -	}
> -}
> -
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1481,8 +1464,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   
>   	desc = __get_lrc_desc(guc, desc_idx);
>   	desc->engine_class = engine_class_to_guc_class(engine->class);
> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> -						      engine->mask);
> +	desc->engine_submit_mask = engine->logical_mask;
>   	desc->hw_context_desc = ce->lrc.lrca;
>   	desc->priority = ce->guc_state.prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> @@ -3271,6 +3253,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   		}
>   
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		if (n != 0 && ve->base.class != sibling->class) {
>   			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 05/26] drm/i915: Add logical engine mapping
@ 2021-10-07 19:03     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:03 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Add logical engine mapping. This is required for split-frame, as
> workloads need to be placed on engines in a logically contiguous manner.
>
> v2:
>   (Daniel Vetter)
>    - Add kernel doc for new fields
> v3
>   (Tvrtko)
>    - Update comment for new logical_mask field
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  7 +++
>   .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>   5 files changed, 62 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..2eb798ad068b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>   	GEM_DEBUG_WARN_ON(iir);
>   }
>   
> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> +			      u8 logical_instance)
>   {
>   	const struct engine_info *info = &intel_engines[id];
>   	struct drm_i915_private *i915 = gt->i915;
> @@ -335,6 +336,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>   
>   	engine->class = info->class;
>   	engine->instance = info->instance;
> +	engine->logical_mask = BIT(logical_instance);
>   	__sprint_engine_name(engine);
>   
>   	engine->props.heartbeat_interval_ms =
> @@ -588,6 +590,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>   	return info->engine_mask;
>   }
>   
> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> +				 u8 class, const u8 *map, u8 num_instances)
> +{
> +	int i, j;
> +	u8 current_logical_id = 0;
> +
> +	for (j = 0; j < num_instances; ++j) {
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			if (!HAS_ENGINE(gt, i) ||
> +			    intel_engines[i].class != class)
> +				continue;
> +
> +			if (intel_engines[i].instance == map[j]) {
> +				logical_ids[intel_engines[i].instance] =
> +					current_logical_id++;
> +				break;
> +			}
> +		}
> +	}
> +}
> +
> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> +{
> +	int i;
> +	u8 map[MAX_ENGINE_INSTANCE + 1];
> +
> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> +		map[i] = i;
> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> +}
> +
>   /**
>    * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>    * @gt: pointer to struct intel_gt
> @@ -599,7 +632,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	struct drm_i915_private *i915 = gt->i915;
>   	const unsigned int engine_mask = init_engine_mask(gt);
>   	unsigned int mask = 0;
> -	unsigned int i;
> +	unsigned int i, class;
> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>   	int err;
>   
>   	drm_WARN_ON(&i915->drm, engine_mask == 0);
> @@ -609,15 +643,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	if (i915_inject_probe_failure(i915))
>   		return -ENODEV;
>   
> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> -		if (!HAS_ENGINE(gt, i))
> -			continue;
> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> +		setup_logical_ids(gt, logical_ids, class);
>   
> -		err = intel_engine_setup(gt, i);
> -		if (err)
> -			goto cleanup;
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			u8 instance = intel_engines[i].instance;
> +
> +			if (intel_engines[i].class != class ||
> +			    !HAS_ENGINE(gt, i))
> +				continue;
>   
> -		mask |= BIT(i);
> +			err = intel_engine_setup(gt, i,
> +						 logical_ids[instance]);
> +			if (err)
> +				goto cleanup;
> +
> +			mask |= BIT(i);
> +		}
>   	}
>   
>   	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..68010da468a4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -269,6 +269,13 @@ struct intel_engine_cs {
>   	unsigned int guc_id;
>   
>   	intel_engine_mask_t mask;
> +	/**
> +	 * @logical_mask: logical mask of engine, reported to user space via
> +	 * query IOCTL and used to communicate with the GuC in logical space.
> +	 * The logical instance of a physical engine can change based on product
> +	 * / fusing and defined in the bspec.
I would use 'and' rather than '/' when it line wraps like that. 
Otherwise, it looks like you tried to end the comment, but failed and 
then kept typing!

Also, not sure about 'and defined in the bspec'. I would just drop that 
line. I think 'based on product and fusing' is sufficient. Otherwise, 
you should be including the bspec link.

With that tweaked:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

John.

> +	 */
> +	intel_engine_mask_t logical_mask;
>   
>   	u8 class;
>   	u8 instance;
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5ed1e222c308 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3877,6 +3877,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   
>   		ve->siblings[ve->num_siblings++] = sibling;
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		/*
>   		 * All physical engines must be compatible for their emission
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..621c893a009f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>   	for_each_engine(engine, gt, id) {
>   		u8 guc_class = engine_class_to_guc_class(engine->class);
>   
> -		system_info->mapping_table[guc_class][engine->instance] =
> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>   			engine->instance;
>   	}
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 8b82da50c2bc..451d9ae861a6 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1423,23 +1423,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   	return __guc_action_deregister_context(guc, guc_id);
>   }
>   
> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> -{
> -	switch (class) {
> -	case RENDER_CLASS:
> -		return mask >> RCS0;
> -	case VIDEO_ENHANCEMENT_CLASS:
> -		return mask >> VECS0;
> -	case VIDEO_DECODE_CLASS:
> -		return mask >> VCS0;
> -	case COPY_ENGINE_CLASS:
> -		return mask >> BCS0;
> -	default:
> -		MISSING_CASE(class);
> -		return 0;
> -	}
> -}
> -
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1481,8 +1464,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   
>   	desc = __get_lrc_desc(guc, desc_idx);
>   	desc->engine_class = engine_class_to_guc_class(engine->class);
> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> -						      engine->mask);
> +	desc->engine_submit_mask = engine->logical_mask;
>   	desc->hw_context_desc = ce->lrc.lrca;
>   	desc->priority = ce->guc_state.prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> @@ -3271,6 +3253,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   		}
>   
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		if (n != 0 && ve->base.class != sibling->class) {
>   			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 19:35     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:35 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Introduce context parent-child relationship. Once this relationship is
> created all pinning / unpinning operations are directed to the parent
> context. The parent context is responsible for pinning all of its'
No need for an apostrophe.

> children and itself.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - a single H2G is used
> register / deregister all of the contexts simultaneously.
>
> Subsequent patches in the series will implement the pinning / unpinning
> operations for parent / child contexts.
>
> v2:
>   (Daniel Vetter)
>    - Add kernel doc, add wrapper to access parent to ensure safety
> v3:
>   (John Harrison)
>    - Fix comment explaing GEM_BUG_ON in to_parent()
>    - Make variable names generic (non-GuC specific)
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
>   3 files changed, 91 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index f601323b939f..c5bb7ccfb3f8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   	INIT_LIST_HEAD(&ce->destroyed_link);
>   
> +	INIT_LIST_HEAD(&ce->parallel.child_list);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> @@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   void intel_context_fini(struct intel_context *ce)
>   {
> +	struct intel_context *child, *next;
> +
>   	if (ce->timeline)
>   		intel_timeline_put(ce->timeline);
>   	i915_vm_put(ce->vm);
>   
> +	/* Need to put the creation ref for the children */
> +	if (intel_context_is_parent(ce))
> +		for_each_child_safe(ce, child, next)
> +			intel_context_put(child);
> +
>   	mutex_destroy(&ce->pin_mutex);
>   	i915_active_fini(&ce->active);
>   	i915_sw_fence_fini(&ce->guc_state.blocked);
> @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   	return active;
>   }
>   
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child)
> +{
> +	/*
> +	 * Callers responsibility to validate that this function is used
> +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> +	 */
> +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> +	GEM_BUG_ON(intel_context_is_pinned(parent));
> +	GEM_BUG_ON(intel_context_is_child(parent));
> +	GEM_BUG_ON(intel_context_is_pinned(child));
> +	GEM_BUG_ON(intel_context_is_child(child));
> +	GEM_BUG_ON(intel_context_is_parent(child));
> +
> +	parent->parallel.number_children++;
> +	list_add_tail(&child->parallel.child_link,
> +		      &parent->parallel.child_list);
> +	child->parallel.parent = parent;
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_context.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c41098950746..b63c10a144af 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
>   int intel_context_reconfigure_sseu(struct intel_context *ce,
>   				   const struct intel_sseu sseu);
>   
> +static inline bool intel_context_is_child(struct intel_context *ce)
> +{
> +	return !!ce->parallel.parent;
> +}
> +
> +static inline bool intel_context_is_parent(struct intel_context *ce)
> +{
> +	return !!ce->parallel.number_children;
> +}
> +
> +static inline bool intel_context_is_pinned(struct intel_context *ce);
> +
> +static inline struct intel_context *
> +intel_context_to_parent(struct intel_context *ce)
> +{
> +	if (intel_context_is_child(ce)) {
> +		/*
> +		 * The parent holds ref count to the child so it is always safe
> +		 * for the parent to access the child, but the child has a
> +		 * pointer to the parent without a ref. To ensure this is safe
> +		 * the child should only access the parent pointer while the
> +		 * parent is pinned.
> +		 */
> +		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
> +
> +		return ce->parallel.parent;
> +	} else {
> +		return ce;
> +	}
> +}
> +
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child);
> +
> +#define for_each_child(parent, ce)\
> +	list_for_each_entry(ce, &(parent)->parallel.child_list,\
> +			    parallel.child_link)
> +#define for_each_child_safe(parent, ce, cn)\
> +	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
> +				 parallel.child_link)
> +
>   /**
>    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
>    * @ce - the context
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 4613d027cbc3..76dfca57cb45 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -220,6 +220,27 @@ struct intel_context {
>   	 */
>   	struct list_head destroyed_link;
>   
> +	/** @parallel: sub-structure for parallel submission members */
> +	struct {
> +		union {
> +			/**
> +			 * @child_list: parent's list of children
> +			 * contexts, no protection as immutable after context
> +			 * creation
> +			 */
> +			struct list_head child_list;
> +			/**
> +			 * @child_link: child's link into parent's list of
> +			 * children
> +			 */
> +			struct list_head child_link;
> +		};
> +		/** @parent: pointer to parent if child */
> +		struct intel_context *parent;
> +		/** @number_children: number of children if parent */
> +		u8 number_children;
Is there any particular reason for using 'u8'? A simple 'int' can be 
much more efficient depending upon the host CPU architecture.

Not a blocker though. So with the typo above fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> +	} parallel;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
@ 2021-10-07 19:35     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:35 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Introduce context parent-child relationship. Once this relationship is
> created all pinning / unpinning operations are directed to the parent
> context. The parent context is responsible for pinning all of its'
No need for an apostrophe.

> children and itself.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - a single H2G is used
> register / deregister all of the contexts simultaneously.
>
> Subsequent patches in the series will implement the pinning / unpinning
> operations for parent / child contexts.
>
> v2:
>   (Daniel Vetter)
>    - Add kernel doc, add wrapper to access parent to ensure safety
> v3:
>   (John Harrison)
>    - Fix comment explaing GEM_BUG_ON in to_parent()
>    - Make variable names generic (non-GuC specific)
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
>   3 files changed, 91 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index f601323b939f..c5bb7ccfb3f8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   	INIT_LIST_HEAD(&ce->destroyed_link);
>   
> +	INIT_LIST_HEAD(&ce->parallel.child_list);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> @@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   void intel_context_fini(struct intel_context *ce)
>   {
> +	struct intel_context *child, *next;
> +
>   	if (ce->timeline)
>   		intel_timeline_put(ce->timeline);
>   	i915_vm_put(ce->vm);
>   
> +	/* Need to put the creation ref for the children */
> +	if (intel_context_is_parent(ce))
> +		for_each_child_safe(ce, child, next)
> +			intel_context_put(child);
> +
>   	mutex_destroy(&ce->pin_mutex);
>   	i915_active_fini(&ce->active);
>   	i915_sw_fence_fini(&ce->guc_state.blocked);
> @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   	return active;
>   }
>   
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child)
> +{
> +	/*
> +	 * Callers responsibility to validate that this function is used
> +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> +	 */
> +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> +	GEM_BUG_ON(intel_context_is_pinned(parent));
> +	GEM_BUG_ON(intel_context_is_child(parent));
> +	GEM_BUG_ON(intel_context_is_pinned(child));
> +	GEM_BUG_ON(intel_context_is_child(child));
> +	GEM_BUG_ON(intel_context_is_parent(child));
> +
> +	parent->parallel.number_children++;
> +	list_add_tail(&child->parallel.child_link,
> +		      &parent->parallel.child_list);
> +	child->parallel.parent = parent;
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_context.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c41098950746..b63c10a144af 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
>   int intel_context_reconfigure_sseu(struct intel_context *ce,
>   				   const struct intel_sseu sseu);
>   
> +static inline bool intel_context_is_child(struct intel_context *ce)
> +{
> +	return !!ce->parallel.parent;
> +}
> +
> +static inline bool intel_context_is_parent(struct intel_context *ce)
> +{
> +	return !!ce->parallel.number_children;
> +}
> +
> +static inline bool intel_context_is_pinned(struct intel_context *ce);
> +
> +static inline struct intel_context *
> +intel_context_to_parent(struct intel_context *ce)
> +{
> +	if (intel_context_is_child(ce)) {
> +		/*
> +		 * The parent holds ref count to the child so it is always safe
> +		 * for the parent to access the child, but the child has a
> +		 * pointer to the parent without a ref. To ensure this is safe
> +		 * the child should only access the parent pointer while the
> +		 * parent is pinned.
> +		 */
> +		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
> +
> +		return ce->parallel.parent;
> +	} else {
> +		return ce;
> +	}
> +}
> +
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child);
> +
> +#define for_each_child(parent, ce)\
> +	list_for_each_entry(ce, &(parent)->parallel.child_list,\
> +			    parallel.child_link)
> +#define for_each_child_safe(parent, ce, cn)\
> +	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
> +				 parallel.child_link)
> +
>   /**
>    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
>    * @ce - the context
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 4613d027cbc3..76dfca57cb45 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -220,6 +220,27 @@ struct intel_context {
>   	 */
>   	struct list_head destroyed_link;
>   
> +	/** @parallel: sub-structure for parallel submission members */
> +	struct {
> +		union {
> +			/**
> +			 * @child_list: parent's list of children
> +			 * contexts, no protection as immutable after context
> +			 * creation
> +			 */
> +			struct list_head child_list;
> +			/**
> +			 * @child_link: child's link into parent's list of
> +			 * children
> +			 */
> +			struct list_head child_link;
> +		};
> +		/** @parent: pointer to parent if child */
> +		struct intel_context *parent;
> +		/** @number_children: number of children if parent */
> +		u8 number_children;
Is there any particular reason for using 'u8'? A simple 'int' can be 
much more efficient depending upon the host CPU architecture.

Not a blocker though. So with the typo above fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> +	} parallel;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 19:50     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:50 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Add multi-lrc context registration H2G. In addition a workqueue and
> process descriptor are setup during multi-lrc context registration as
> these data structures are needed for multi-lrc submission.
>
> v2:
>   (John Harrison)
>    - Move GuC specific fields into sub-struct
>    - Clean up WQ defines
>    - Add comment explaining math to derive WQ / PD address
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
>   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
>   5 files changed, 131 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 76dfca57cb45..48decb5ee954 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -239,6 +239,18 @@ struct intel_context {
>   		struct intel_context *parent;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
> +		/** @guc: GuC specific members for parallel submission */
> +		struct {
> +			/** @wqi_head: head pointer in work queue */
> +			u16 wqi_head;
> +			/** @wqi_tail: tail pointer in work queue */
> +			u16 wqi_tail;
> +			/**
> +			 * @parent_page: page in context state (ce->state) used
> +			 * by parent for work queue, process descriptor
> +			 */
> +			u8 parent_page;
> +		} guc;
>   	} parallel;
>   
>   #ifdef CONFIG_DRM_I915_SELFTEST
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 3ef9eaf8c50e..57339d5c1fc8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
>   		context_size += PAGE_SIZE;
>   	}
>   
> +	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
> +		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
> +		context_size += PAGE_SIZE;
> +	}
> +
>   	obj = i915_gem_object_create_lmem(engine->i915, context_size,
>   					  I915_BO_ALLOC_PM_VOLATILE);
>   	if (IS_ERR(obj))
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ba10bd374cee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -142,6 +142,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>   	INTEL_GUC_ACTION_LIMIT
>   };
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..0eeb2a9feeed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -52,8 +52,6 @@
>   
>   #define GUC_DOORBELL_INVALID		256
>   
> -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> -
>   /* Work queue item header definitions */
>   #define WQ_STATUS_ACTIVE		1
>   #define WQ_STATUS_SUSPENDED		2
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 451d9ae861a6..ab6d7fc1b0b1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>   	return rb_entry(rb, struct i915_priolist, node);
>   }
>   
> +/*
> + * When using multi-lrc submission an extra page in the context state is
> + * reserved for the process descriptor and work queue.
> + *
> + * The layout of this page is below:
> + * 0						guc_process_desc
> + * ...						unused
> + * PAGE_SIZE / 2				work queue start
> + * ...						work queue
> + * PAGE_SIZE - 1				work queue end
> + */
> +#define WQ_SIZE			(PAGE_SIZE / 2)
> +#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE' 
and then using that everywhere else? Unless there is a fundamental 
reason why the above must be exactly a page in size then I think the 
size should be defined once and re-used rather than assumed in multiple 
places (including in the description comment).

> +static u32 __get_process_desc_offset(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> +
> +	return ce->parallel.guc.parent_page * PAGE_SIZE;
> +}
> +
> +static u32 __get_wq_offset(struct intel_context *ce)
> +{
> +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> +}
> +
> +static struct guc_process_desc *
> +__get_process_desc(struct intel_context *ce)
> +{
> +	/*
> +	 * Need to subtract LRC_STATE_OFFSET here as the
> +	 * parallel.guc.parent_page is the offset into ce->state while
> +	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> +	 */
> +	return (struct guc_process_desc *)
> +		(ce->lrc_reg_state +
> +		 ((__get_process_desc_offset(ce) -
> +		   LRC_STATE_OFFSET) / sizeof(u32)));
> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> +					   struct intel_context *ce,
> +					   u32 guc_id,
> +					   u32 offset,
> +					   bool loop)
> +{
> +	struct intel_context *child;
> +	u32 action[4 + MAX_ENGINE_INSTANCE];
> +	int len = 0;
> +
> +	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
> +
> +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> +	action[len++] = guc_id;
> +	action[len++] = ce->parallel.number_children + 1;
> +	action[len++] = offset;
> +	for_each_child(ce, child) {
> +		offset += sizeof(struct guc_lrc_desc);
> +		action[len++] = offset;
> +	}
> +
> +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> +}
> +
>   static int __guc_action_register_context(struct intel_guc *guc,
>   					 u32 guc_id,
>   					 u32 offset,
> @@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
>   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
>   	int ret;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_register(ce);
>   
> -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> +	if (intel_context_is_parent(ce))
> +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> +						      offset, loop);
> +	else
> +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> +						    loop);
>   	if (likely(!ret)) {
>   		unsigned long flags;
>   
> @@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_deregister(ce);
>   
>   	return __guc_action_deregister_context(guc, guc_id);
> @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	struct guc_lrc_desc *desc;
>   	bool context_registered;
>   	intel_wakeref_t wakeref;
> +	struct intel_context *child;
>   	int ret = 0;
>   
>   	GEM_BUG_ON(!engine->mask);
> @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
>   
> +	/*
> +	 * Context is a parent, we need to register a process descriptor
> +	 * describing a work queue and register all child contexts.
> +	 */
This was now meant to say 'If the context is a parent...'?

John.

> +	if (intel_context_is_parent(ce)) {
> +		struct guc_process_desc *pdesc;
> +
> +		ce->parallel.guc.wqi_tail = 0;
> +		ce->parallel.guc.wqi_head = 0;
> +
> +		desc->process_desc = i915_ggtt_offset(ce->state) +
> +			__get_process_desc_offset(ce);
> +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> +			__get_wq_offset(ce);
> +		desc->wq_size = WQ_SIZE;
> +
> +		pdesc = __get_process_desc(ce);
> +		memset(pdesc, 0, sizeof(*(pdesc)));
> +		pdesc->stage_id = ce->guc_id.id;
> +		pdesc->wq_base_addr = desc->wq_addr;
> +		pdesc->wq_size_bytes = desc->wq_size;
> +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> +
> +		for_each_child(ce, child) {
> +			desc = __get_lrc_desc(guc, child->guc_id.id);
> +
> +			desc->engine_class =
> +				engine_class_to_guc_class(engine->class);
> +			desc->hw_context_desc = child->lrc.lrca;
> +			desc->priority = ce->guc_state.prio;
> +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> +			guc_context_policy_init(engine, desc);
> +		}
> +	}
> +
>   	/*
>   	 * The context_lookup xarray is used to determine if the hardware
>   	 * context is currently registered. There are two cases in which it
> @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   		return NULL;
>   	}
>   
> +	if (unlikely(intel_context_is_child(ce))) {
> +		drm_err(&guc_to_gt(guc)->i915->drm,
> +			"Context is child, desc_idx %u", desc_idx);
> +		return NULL;
> +	}
> +
>   	return ce;
>   }
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
@ 2021-10-07 19:50     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 19:50 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Add multi-lrc context registration H2G. In addition a workqueue and
> process descriptor are setup during multi-lrc context registration as
> these data structures are needed for multi-lrc submission.
>
> v2:
>   (John Harrison)
>    - Move GuC specific fields into sub-struct
>    - Clean up WQ defines
>    - Add comment explaining math to derive WQ / PD address
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
>   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
>   5 files changed, 131 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 76dfca57cb45..48decb5ee954 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -239,6 +239,18 @@ struct intel_context {
>   		struct intel_context *parent;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
> +		/** @guc: GuC specific members for parallel submission */
> +		struct {
> +			/** @wqi_head: head pointer in work queue */
> +			u16 wqi_head;
> +			/** @wqi_tail: tail pointer in work queue */
> +			u16 wqi_tail;
> +			/**
> +			 * @parent_page: page in context state (ce->state) used
> +			 * by parent for work queue, process descriptor
> +			 */
> +			u8 parent_page;
> +		} guc;
>   	} parallel;
>   
>   #ifdef CONFIG_DRM_I915_SELFTEST
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 3ef9eaf8c50e..57339d5c1fc8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
>   		context_size += PAGE_SIZE;
>   	}
>   
> +	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
> +		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
> +		context_size += PAGE_SIZE;
> +	}
> +
>   	obj = i915_gem_object_create_lmem(engine->i915, context_size,
>   					  I915_BO_ALLOC_PM_VOLATILE);
>   	if (IS_ERR(obj))
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ba10bd374cee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -142,6 +142,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>   	INTEL_GUC_ACTION_LIMIT
>   };
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..0eeb2a9feeed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -52,8 +52,6 @@
>   
>   #define GUC_DOORBELL_INVALID		256
>   
> -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> -
>   /* Work queue item header definitions */
>   #define WQ_STATUS_ACTIVE		1
>   #define WQ_STATUS_SUSPENDED		2
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 451d9ae861a6..ab6d7fc1b0b1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>   	return rb_entry(rb, struct i915_priolist, node);
>   }
>   
> +/*
> + * When using multi-lrc submission an extra page in the context state is
> + * reserved for the process descriptor and work queue.
> + *
> + * The layout of this page is below:
> + * 0						guc_process_desc
> + * ...						unused
> + * PAGE_SIZE / 2				work queue start
> + * ...						work queue
> + * PAGE_SIZE - 1				work queue end
> + */
> +#define WQ_SIZE			(PAGE_SIZE / 2)
> +#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE' 
and then using that everywhere else? Unless there is a fundamental 
reason why the above must be exactly a page in size then I think the 
size should be defined once and re-used rather than assumed in multiple 
places (including in the description comment).

> +static u32 __get_process_desc_offset(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> +
> +	return ce->parallel.guc.parent_page * PAGE_SIZE;
> +}
> +
> +static u32 __get_wq_offset(struct intel_context *ce)
> +{
> +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> +}
> +
> +static struct guc_process_desc *
> +__get_process_desc(struct intel_context *ce)
> +{
> +	/*
> +	 * Need to subtract LRC_STATE_OFFSET here as the
> +	 * parallel.guc.parent_page is the offset into ce->state while
> +	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> +	 */
> +	return (struct guc_process_desc *)
> +		(ce->lrc_reg_state +
> +		 ((__get_process_desc_offset(ce) -
> +		   LRC_STATE_OFFSET) / sizeof(u32)));
> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> +					   struct intel_context *ce,
> +					   u32 guc_id,
> +					   u32 offset,
> +					   bool loop)
> +{
> +	struct intel_context *child;
> +	u32 action[4 + MAX_ENGINE_INSTANCE];
> +	int len = 0;
> +
> +	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
> +
> +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> +	action[len++] = guc_id;
> +	action[len++] = ce->parallel.number_children + 1;
> +	action[len++] = offset;
> +	for_each_child(ce, child) {
> +		offset += sizeof(struct guc_lrc_desc);
> +		action[len++] = offset;
> +	}
> +
> +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> +}
> +
>   static int __guc_action_register_context(struct intel_guc *guc,
>   					 u32 guc_id,
>   					 u32 offset,
> @@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
>   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
>   	int ret;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_register(ce);
>   
> -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> +	if (intel_context_is_parent(ce))
> +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> +						      offset, loop);
> +	else
> +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> +						    loop);
>   	if (likely(!ret)) {
>   		unsigned long flags;
>   
> @@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_deregister(ce);
>   
>   	return __guc_action_deregister_context(guc, guc_id);
> @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	struct guc_lrc_desc *desc;
>   	bool context_registered;
>   	intel_wakeref_t wakeref;
> +	struct intel_context *child;
>   	int ret = 0;
>   
>   	GEM_BUG_ON(!engine->mask);
> @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
>   
> +	/*
> +	 * Context is a parent, we need to register a process descriptor
> +	 * describing a work queue and register all child contexts.
> +	 */
This was now meant to say 'If the context is a parent...'?

John.

> +	if (intel_context_is_parent(ce)) {
> +		struct guc_process_desc *pdesc;
> +
> +		ce->parallel.guc.wqi_tail = 0;
> +		ce->parallel.guc.wqi_head = 0;
> +
> +		desc->process_desc = i915_ggtt_offset(ce->state) +
> +			__get_process_desc_offset(ce);
> +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> +			__get_wq_offset(ce);
> +		desc->wq_size = WQ_SIZE;
> +
> +		pdesc = __get_process_desc(ce);
> +		memset(pdesc, 0, sizeof(*(pdesc)));
> +		pdesc->stage_id = ce->guc_id.id;
> +		pdesc->wq_base_addr = desc->wq_addr;
> +		pdesc->wq_size_bytes = desc->wq_size;
> +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> +
> +		for_each_child(ce, child) {
> +			desc = __get_lrc_desc(guc, child->guc_id.id);
> +
> +			desc->engine_class =
> +				engine_class_to_guc_class(engine->class);
> +			desc->hw_context_desc = child->lrc.lrca;
> +			desc->priority = ce->guc_state.prio;
> +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> +			guc_context_policy_init(engine, desc);
> +		}
> +	}
> +
>   	/*
>   	 * The context_lookup xarray is used to determine if the hardware
>   	 * context is currently registered. There are two cases in which it
> @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   		return NULL;
>   	}
>   
> +	if (unlikely(intel_context_is_child(ce))) {
> +		drm_err(&guc_to_gt(guc)->i915->drm,
> +			"Context is child, desc_idx %u", desc_idx);
> +		return NULL;
> +	}
> +
>   	return ce;
>   }
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 09/26] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 20:23     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 20:23 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> In GuC parent-child contexts the parent context controls the scheduling,
> ensure only the parent does the scheduling operations.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ab6d7fc1b0b1..1f2809187513 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -324,6 +324,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
>   	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
>   }
>   
> +static struct intel_context *
> +request_to_scheduling_context(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
>   static inline bool context_guc_id_invalid(struct intel_context *ce)
>   {
>   	return ce->guc_id.id == GUC_INVALID_LRC_ID;
> @@ -1710,6 +1716,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
>   
>   	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_sched_disable(ce);
>   
>   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
> @@ -1935,6 +1942,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   	intel_wakeref_t wakeref;
>   	u16 guc_id;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	/*
> @@ -2303,6 +2312,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	clr_context_wait_for_deregister_to_register(ce);
>   	__guc_signal_context_fence(ce);
> @@ -2333,7 +2344,7 @@ static void guc_context_init(struct intel_context *ce)
>   
>   static int guc_request_alloc(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	struct intel_guc *guc = ce_to_guc(ce);
>   	unsigned long flags;
>   	int ret;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 09/26] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
@ 2021-10-07 20:23     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 20:23 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> In GuC parent-child contexts the parent context controls the scheduling,
> ensure only the parent does the scheduling operations.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ab6d7fc1b0b1..1f2809187513 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -324,6 +324,12 @@ static inline void decr_context_committed_requests(struct intel_context *ce)
>   	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
>   }
>   
> +static struct intel_context *
> +request_to_scheduling_context(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
>   static inline bool context_guc_id_invalid(struct intel_context *ce)
>   {
>   	return ce->guc_id.id == GUC_INVALID_LRC_ID;
> @@ -1710,6 +1716,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
>   
>   	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_sched_disable(ce);
>   
>   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
> @@ -1935,6 +1942,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   	intel_wakeref_t wakeref;
>   	u16 guc_id;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	/*
> @@ -2303,6 +2312,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	clr_context_wait_for_deregister_to_register(ce);
>   	__guc_signal_context_fence(ce);
> @@ -2333,7 +2344,7 @@ static void guc_context_init(struct intel_context *ce)
>   
>   static int guc_request_alloc(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	struct intel_guc *guc = ce_to_guc(ce);
>   	unsigned long flags;
>   	int ret;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 22:03     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 22:03 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Assign contexts in parent-child relationship consecutive guc_ids. This
> is accomplished by partitioning guc_id space between ones that need to
> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> available guc_ids). The consecutive search is implemented via the bitmap
> API.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> when using the GuC multi-lrc interface.
>
> v2:
>   (Daniel Vetter)
>    - Explicitly state why we assign consecutive guc_ids
> v3:
>   (John Harrison)
>    - Bring back in spin lock
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>   2 files changed, 86 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 25a598e2b6e8..a9f4ec972bfb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -76,9 +76,13 @@ struct intel_guc {
>   		 */
>   		spinlock_t lock;
>   		/**
> -		 * @guc_ids: used to allocate new guc_ids
> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>   		 */
>   		struct ida guc_ids;
> +		/**
> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> +		 */
> +		unsigned long *guc_ids_bitmap;
>   		/**
>   		 * @guc_id_list: list of intel_context with valid guc_ids but no
>   		 * refs
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 1f2809187513..79e7732e83b2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> +/*
> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> + * per the GuC submission interface. A different allocation algorithm is used
> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> + * partition the guc_id space. We believe the number of multi-lrc contexts in
> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> + * multi-lrc.
> + */
> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> +
>   /*
>    * Below is a set of functions which control the GuC scheduling state which
>    * require a lock.
> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
>   
> +	guc->submission_state.guc_ids_bitmap =
> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> +	if (!guc->submission_state.guc_ids_bitmap)
> +		return -ENOMEM;
> +
>   	return 0;
>   }
>   
> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   	guc_lrc_desc_pool_destroy(guc);
>   	guc_flush_destroyed_contexts(guc);
>   	i915_sched_engine_put(guc->sched_engine);
> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>   }
>   
>   static inline void queue_request(struct i915_sched_engine *sched_engine,
> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> -static int new_guc_id(struct intel_guc *guc)
> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +	int ret;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (intel_context_is_parent(ce))
> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> +					      NUMBER_MULTI_LRC_GUC_ID,
> +					      order_base_2(ce->parallel.number_children
> +							   + 1));
> +	else
> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> +				     NUMBER_MULTI_LRC_GUC_ID,
> +				     GUC_MAX_LRC_DESCRIPTORS,
> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> +				     __GFP_NOWARN);
> +	if (unlikely(ret < 0))
> +		return ret;
> +
> +	ce->guc_id.id = ret;
> +	return 0;
>   }
>   
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->submission_state.guc_ids,
> -				  ce->guc_id.id);
> +		if (intel_context_is_parent(ce))
> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> +					      ce->guc_id.id,
> +					      order_base_2(ce->parallel.number_children
> +							   + 1));
There was a discussion on the previous revision about adding a BUG_ON to 
ensure that number_children cannot change between the bitmap alloc and 
the bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.

John.


> +		else
> +			ida_simple_remove(&guc->submission_state.guc_ids,
> +					  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> -static int steal_guc_id(struct intel_guc *guc)
> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	struct intel_context *ce;
> -	int guc_id;
> +	struct intel_context *cn;
>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(intel_context_is_parent(ce));
>   
>   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> -		GEM_BUG_ON(context_guc_id_invalid(ce));
> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> +		GEM_BUG_ON(context_guc_id_invalid(cn));
> +		GEM_BUG_ON(intel_context_is_child(cn));
> +		GEM_BUG_ON(intel_context_is_parent(cn));
>   
> -		list_del_init(&ce->guc_id.link);
> -		guc_id = ce->guc_id.id;
> +		list_del_init(&cn->guc_id.link);
> +		ce->guc_id = cn->guc_id;
>   
>   		spin_lock(&ce->guc_state.lock);
> -		clr_context_registered(ce);
> +		clr_context_registered(cn);
>   		spin_unlock(&ce->guc_state.lock);
>   
> -		set_context_guc_id_invalid(ce);
> -		return guc_id;
> +		set_context_guc_id_invalid(cn);
> +
> +		return 0;
>   	} else {
>   		return -EAGAIN;
>   	}
>   }
>   
> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	int ret;
>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	ret = new_guc_id(guc);
> +	ret = new_guc_id(guc, ce);
>   	if (unlikely(ret < 0)) {
> -		ret = steal_guc_id(guc);
> +		if (intel_context_is_parent(ce))
> +			return -ENOSPC;
> +
> +		ret = steal_guc_id(guc, ce);
>   		if (ret < 0)
>   			return ret;
>   	}
>   
> -	*out = ret;
> +	if (intel_context_is_parent(ce)) {
> +		struct intel_context *child;
> +		int i = 1;
> +
> +		for_each_child(ce, child)
> +			child->guc_id.id = ce->guc_id.id + i++;
> +	}
> +
>   	return 0;
>   }
>   
> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	might_lock(&ce->guc_state.lock);
>   
>   	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id.id);
> +		ret = assign_guc_id(guc, ce);
>   		if (ret)
>   			goto out_unlock;
>   		ret = 1;	/* Indidcates newly assigned guc_id */
> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	if (unlikely(context_guc_id_invalid(ce)))
> +	if (unlikely(context_guc_id_invalid(ce) ||
> +		     intel_context_is_parent(ce)))
>   		return;
>   
>   	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-07 22:03     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-07 22:03 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Assign contexts in parent-child relationship consecutive guc_ids. This
> is accomplished by partitioning guc_id space between ones that need to
> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> available guc_ids). The consecutive search is implemented via the bitmap
> API.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> when using the GuC multi-lrc interface.
>
> v2:
>   (Daniel Vetter)
>    - Explicitly state why we assign consecutive guc_ids
> v3:
>   (John Harrison)
>    - Bring back in spin lock
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>   2 files changed, 86 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 25a598e2b6e8..a9f4ec972bfb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -76,9 +76,13 @@ struct intel_guc {
>   		 */
>   		spinlock_t lock;
>   		/**
> -		 * @guc_ids: used to allocate new guc_ids
> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>   		 */
>   		struct ida guc_ids;
> +		/**
> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> +		 */
> +		unsigned long *guc_ids_bitmap;
>   		/**
>   		 * @guc_id_list: list of intel_context with valid guc_ids but no
>   		 * refs
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 1f2809187513..79e7732e83b2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> +/*
> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> + * per the GuC submission interface. A different allocation algorithm is used
> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> + * partition the guc_id space. We believe the number of multi-lrc contexts in
> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> + * multi-lrc.
> + */
> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> +
>   /*
>    * Below is a set of functions which control the GuC scheduling state which
>    * require a lock.
> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_WORK(&guc->submission_state.destroyed_worker,
>   		  destroyed_worker_func);
>   
> +	guc->submission_state.guc_ids_bitmap =
> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> +	if (!guc->submission_state.guc_ids_bitmap)
> +		return -ENOMEM;
> +
>   	return 0;
>   }
>   
> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   	guc_lrc_desc_pool_destroy(guc);
>   	guc_flush_destroyed_contexts(guc);
>   	i915_sched_engine_put(guc->sched_engine);
> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>   }
>   
>   static inline void queue_request(struct i915_sched_engine *sched_engine,
> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> -static int new_guc_id(struct intel_guc *guc)
> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +	int ret;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (intel_context_is_parent(ce))
> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> +					      NUMBER_MULTI_LRC_GUC_ID,
> +					      order_base_2(ce->parallel.number_children
> +							   + 1));
> +	else
> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> +				     NUMBER_MULTI_LRC_GUC_ID,
> +				     GUC_MAX_LRC_DESCRIPTORS,
> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> +				     __GFP_NOWARN);
> +	if (unlikely(ret < 0))
> +		return ret;
> +
> +	ce->guc_id.id = ret;
> +	return 0;
>   }
>   
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->submission_state.guc_ids,
> -				  ce->guc_id.id);
> +		if (intel_context_is_parent(ce))
> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> +					      ce->guc_id.id,
> +					      order_base_2(ce->parallel.number_children
> +							   + 1));
There was a discussion on the previous revision about adding a BUG_ON to 
ensure that number_children cannot change between the bitmap alloc and 
the bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.

John.


> +		else
> +			ida_simple_remove(&guc->submission_state.guc_ids,
> +					  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> -static int steal_guc_id(struct intel_guc *guc)
> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	struct intel_context *ce;
> -	int guc_id;
> +	struct intel_context *cn;
>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(intel_context_is_parent(ce));
>   
>   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> -		GEM_BUG_ON(context_guc_id_invalid(ce));
> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> +		GEM_BUG_ON(context_guc_id_invalid(cn));
> +		GEM_BUG_ON(intel_context_is_child(cn));
> +		GEM_BUG_ON(intel_context_is_parent(cn));
>   
> -		list_del_init(&ce->guc_id.link);
> -		guc_id = ce->guc_id.id;
> +		list_del_init(&cn->guc_id.link);
> +		ce->guc_id = cn->guc_id;
>   
>   		spin_lock(&ce->guc_state.lock);
> -		clr_context_registered(ce);
> +		clr_context_registered(cn);
>   		spin_unlock(&ce->guc_state.lock);
>   
> -		set_context_guc_id_invalid(ce);
> -		return guc_id;
> +		set_context_guc_id_invalid(cn);
> +
> +		return 0;
>   	} else {
>   		return -EAGAIN;
>   	}
>   }
>   
> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	int ret;
>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	ret = new_guc_id(guc);
> +	ret = new_guc_id(guc, ce);
>   	if (unlikely(ret < 0)) {
> -		ret = steal_guc_id(guc);
> +		if (intel_context_is_parent(ce))
> +			return -ENOSPC;
> +
> +		ret = steal_guc_id(guc, ce);
>   		if (ret < 0)
>   			return ret;
>   	}
>   
> -	*out = ret;
> +	if (intel_context_is_parent(ce)) {
> +		struct intel_context *child;
> +		int i = 1;
> +
> +		for_each_child(ce, child)
> +			child->guc_id.id = ce->guc_id.id + i++;
> +	}
> +
>   	return 0;
>   }
>   
> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	might_lock(&ce->guc_state.lock);
>   
>   	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id.id);
> +		ret = assign_guc_id(guc, ce);
>   		if (ret)
>   			goto out_unlock;
>   		ret = 1;	/* Indidcates newly assigned guc_id */
> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	if (unlikely(context_guc_id_invalid(ce)))
> +	if (unlikely(context_guc_id_invalid(ce) ||
> +		     intel_context_is_parent(ce)))
>   		return;
>   
>   	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-07 22:03     ` [Intel-gfx] " John Harrison
@ 2021-10-08  1:21       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:21 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Assign contexts in parent-child relationship consecutive guc_ids. This
> > is accomplished by partitioning guc_id space between ones that need to
> > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > available guc_ids). The consecutive search is implemented via the bitmap
> > API.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > when using the GuC multi-lrc interface.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Explicitly state why we assign consecutive guc_ids
> > v3:
> >   (John Harrison)
> >    - Bring back in spin lock
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
> >   2 files changed, 86 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 25a598e2b6e8..a9f4ec972bfb 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -76,9 +76,13 @@ struct intel_guc {
> >   		 */
> >   		spinlock_t lock;
> >   		/**
> > -		 * @guc_ids: used to allocate new guc_ids
> > +		 * @guc_ids: used to allocate new guc_ids, single-lrc
> >   		 */
> >   		struct ida guc_ids;
> > +		/**
> > +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> > +		 */
> > +		unsigned long *guc_ids_bitmap;
> >   		/**
> >   		 * @guc_id_list: list of intel_context with valid guc_ids but no
> >   		 * refs
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 1f2809187513..79e7732e83b2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > +/*
> > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > + * per the GuC submission interface. A different allocation algorithm is used
> > + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> > + * partition the guc_id space. We believe the number of multi-lrc contexts in
> > + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> > + * multi-lrc.
> > + */
> > +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> > +
> >   /*
> >    * Below is a set of functions which control the GuC scheduling state which
> >    * require a lock.
> > @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	guc->submission_state.guc_ids_bitmap =
> > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> > +	if (!guc->submission_state.guc_ids_bitmap)
> > +		return -ENOMEM;
> > +
> >   	return 0;
> >   }
> > @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   	guc_lrc_desc_pool_destroy(guc);
> >   	guc_flush_destroyed_contexts(guc);
> >   	i915_sched_engine_put(guc->sched_engine);
> > +	bitmap_free(guc->submission_state.guc_ids_bitmap);
> >   }
> >   static inline void queue_request(struct i915_sched_engine *sched_engine,
> > @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > -static int new_guc_id(struct intel_guc *guc)
> > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +	int ret;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (intel_context_is_parent(ce))
> > +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > +					      NUMBER_MULTI_LRC_GUC_ID,
> > +					      order_base_2(ce->parallel.number_children
> > +							   + 1));
> > +	else
> > +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > +				     NUMBER_MULTI_LRC_GUC_ID,
> > +				     GUC_MAX_LRC_DESCRIPTORS,
> > +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> > +				     __GFP_NOWARN);
> > +	if (unlikely(ret < 0))
> > +		return ret;
> > +
> > +	ce->guc_id.id = ret;
> > +	return 0;
> >   }
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->submission_state.guc_ids,
> > -				  ce->guc_id.id);
> > +		if (intel_context_is_parent(ce))
> > +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> > +					      ce->guc_id.id,
> > +					      order_base_2(ce->parallel.number_children
> > +							   + 1));
> There was a discussion on the previous revision about adding a BUG_ON to
> ensure that number_children cannot change between the bitmap alloc and the
> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
> 

I thought you meant to add a BUG_ON to ensure before we release a region
/ id it is occupied? I looked in both the bitmap API and ida API and
neither have a function that checks if region / id is occupied so can't
really add a BUG_ON for that.

How much you add BUG_ON to ensure the number of children canoot change
between alloc and release? I don't follow how that would work.

Matt 

> John.
> 
> 
> > +		else
> > +			ida_simple_remove(&guc->submission_state.guc_ids,
> > +					  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > -static int steal_guc_id(struct intel_guc *guc)
> > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	struct intel_context *ce;
> > -	int guc_id;
> > +	struct intel_context *cn;
> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(intel_context_is_parent(ce));
> >   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> > +		cn = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> > +		GEM_BUG_ON(context_guc_id_invalid(cn));
> > +		GEM_BUG_ON(intel_context_is_child(cn));
> > +		GEM_BUG_ON(intel_context_is_parent(cn));
> > -		list_del_init(&ce->guc_id.link);
> > -		guc_id = ce->guc_id.id;
> > +		list_del_init(&cn->guc_id.link);
> > +		ce->guc_id = cn->guc_id;
> >   		spin_lock(&ce->guc_state.lock);
> > -		clr_context_registered(ce);
> > +		clr_context_registered(cn);
> >   		spin_unlock(&ce->guc_state.lock);
> > -		set_context_guc_id_invalid(ce);
> > -		return guc_id;
> > +		set_context_guc_id_invalid(cn);
> > +
> > +		return 0;
> >   	} else {
> >   		return -EAGAIN;
> >   	}
> >   }
> > -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	int ret;
> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	ret = new_guc_id(guc);
> > +	ret = new_guc_id(guc, ce);
> >   	if (unlikely(ret < 0)) {
> > -		ret = steal_guc_id(guc);
> > +		if (intel_context_is_parent(ce))
> > +			return -ENOSPC;
> > +
> > +		ret = steal_guc_id(guc, ce);
> >   		if (ret < 0)
> >   			return ret;
> >   	}
> > -	*out = ret;
> > +	if (intel_context_is_parent(ce)) {
> > +		struct intel_context *child;
> > +		int i = 1;
> > +
> > +		for_each_child(ce, child)
> > +			child->guc_id.id = ce->guc_id.id + i++;
> > +	}
> > +
> >   	return 0;
> >   }
> > @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	might_lock(&ce->guc_state.lock);
> >   	if (context_guc_id_invalid(ce)) {
> > -		ret = assign_guc_id(guc, &ce->guc_id.id);
> > +		ret = assign_guc_id(guc, ce);
> >   		if (ret)
> >   			goto out_unlock;
> >   		ret = 1;	/* Indidcates newly assigned guc_id */
> > @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	unsigned long flags;
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	if (unlikely(context_guc_id_invalid(ce)))
> > +	if (unlikely(context_guc_id_invalid(ce) ||
> > +		     intel_context_is_parent(ce)))
> >   		return;
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-08  1:21       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:21 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Assign contexts in parent-child relationship consecutive guc_ids. This
> > is accomplished by partitioning guc_id space between ones that need to
> > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > available guc_ids). The consecutive search is implemented via the bitmap
> > API.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > when using the GuC multi-lrc interface.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Explicitly state why we assign consecutive guc_ids
> > v3:
> >   (John Harrison)
> >    - Bring back in spin lock
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
> >   2 files changed, 86 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 25a598e2b6e8..a9f4ec972bfb 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -76,9 +76,13 @@ struct intel_guc {
> >   		 */
> >   		spinlock_t lock;
> >   		/**
> > -		 * @guc_ids: used to allocate new guc_ids
> > +		 * @guc_ids: used to allocate new guc_ids, single-lrc
> >   		 */
> >   		struct ida guc_ids;
> > +		/**
> > +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> > +		 */
> > +		unsigned long *guc_ids_bitmap;
> >   		/**
> >   		 * @guc_id_list: list of intel_context with valid guc_ids but no
> >   		 * refs
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 1f2809187513..79e7732e83b2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > +/*
> > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > + * per the GuC submission interface. A different allocation algorithm is used
> > + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> > + * partition the guc_id space. We believe the number of multi-lrc contexts in
> > + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> > + * multi-lrc.
> > + */
> > +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> > +
> >   /*
> >    * Below is a set of functions which control the GuC scheduling state which
> >    * require a lock.
> > @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_WORK(&guc->submission_state.destroyed_worker,
> >   		  destroyed_worker_func);
> > +	guc->submission_state.guc_ids_bitmap =
> > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> > +	if (!guc->submission_state.guc_ids_bitmap)
> > +		return -ENOMEM;
> > +
> >   	return 0;
> >   }
> > @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   	guc_lrc_desc_pool_destroy(guc);
> >   	guc_flush_destroyed_contexts(guc);
> >   	i915_sched_engine_put(guc->sched_engine);
> > +	bitmap_free(guc->submission_state.guc_ids_bitmap);
> >   }
> >   static inline void queue_request(struct i915_sched_engine *sched_engine,
> > @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > -static int new_guc_id(struct intel_guc *guc)
> > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +	int ret;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (intel_context_is_parent(ce))
> > +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > +					      NUMBER_MULTI_LRC_GUC_ID,
> > +					      order_base_2(ce->parallel.number_children
> > +							   + 1));
> > +	else
> > +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > +				     NUMBER_MULTI_LRC_GUC_ID,
> > +				     GUC_MAX_LRC_DESCRIPTORS,
> > +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> > +				     __GFP_NOWARN);
> > +	if (unlikely(ret < 0))
> > +		return ret;
> > +
> > +	ce->guc_id.id = ret;
> > +	return 0;
> >   }
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->submission_state.guc_ids,
> > -				  ce->guc_id.id);
> > +		if (intel_context_is_parent(ce))
> > +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> > +					      ce->guc_id.id,
> > +					      order_base_2(ce->parallel.number_children
> > +							   + 1));
> There was a discussion on the previous revision about adding a BUG_ON to
> ensure that number_children cannot change between the bitmap alloc and the
> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
> 

I thought you meant to add a BUG_ON to ensure before we release a region
/ id it is occupied? I looked in both the bitmap API and ida API and
neither have a function that checks if region / id is occupied so can't
really add a BUG_ON for that.

How much you add BUG_ON to ensure the number of children canoot change
between alloc and release? I don't follow how that would work.

Matt 

> John.
> 
> 
> > +		else
> > +			ida_simple_remove(&guc->submission_state.guc_ids,
> > +					  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > -static int steal_guc_id(struct intel_guc *guc)
> > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	struct intel_context *ce;
> > -	int guc_id;
> > +	struct intel_context *cn;
> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(intel_context_is_parent(ce));
> >   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> > +		cn = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> > +		GEM_BUG_ON(context_guc_id_invalid(cn));
> > +		GEM_BUG_ON(intel_context_is_child(cn));
> > +		GEM_BUG_ON(intel_context_is_parent(cn));
> > -		list_del_init(&ce->guc_id.link);
> > -		guc_id = ce->guc_id.id;
> > +		list_del_init(&cn->guc_id.link);
> > +		ce->guc_id = cn->guc_id;
> >   		spin_lock(&ce->guc_state.lock);
> > -		clr_context_registered(ce);
> > +		clr_context_registered(cn);
> >   		spin_unlock(&ce->guc_state.lock);
> > -		set_context_guc_id_invalid(ce);
> > -		return guc_id;
> > +		set_context_guc_id_invalid(cn);
> > +
> > +		return 0;
> >   	} else {
> >   		return -EAGAIN;
> >   	}
> >   }
> > -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	int ret;
> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	ret = new_guc_id(guc);
> > +	ret = new_guc_id(guc, ce);
> >   	if (unlikely(ret < 0)) {
> > -		ret = steal_guc_id(guc);
> > +		if (intel_context_is_parent(ce))
> > +			return -ENOSPC;
> > +
> > +		ret = steal_guc_id(guc, ce);
> >   		if (ret < 0)
> >   			return ret;
> >   	}
> > -	*out = ret;
> > +	if (intel_context_is_parent(ce)) {
> > +		struct intel_context *child;
> > +		int i = 1;
> > +
> > +		for_each_child(ce, child)
> > +			child->guc_id.id = ce->guc_id.id + i++;
> > +	}
> > +
> >   	return 0;
> >   }
> > @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	might_lock(&ce->guc_state.lock);
> >   	if (context_guc_id_invalid(ce)) {
> > -		ret = assign_guc_id(guc, &ce->guc_id.id);
> > +		ret = assign_guc_id(guc, ce);
> >   		if (ret)
> >   			goto out_unlock;
> >   		ret = 1;	/* Indidcates newly assigned guc_id */
> > @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	unsigned long flags;
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	if (unlikely(context_guc_id_invalid(ce)))
> > +	if (unlikely(context_guc_id_invalid(ce) ||
> > +		     intel_context_is_parent(ce)))
> >   		return;
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-10-07 18:15         ` [Intel-gfx] " John Harrison
@ 2021-10-08  1:23           ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:23 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 11:15:51AM -0700, John Harrison wrote:
> On 10/7/2021 08:19, Matthew Brost wrote:
> > On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
> > > On 10/4/2021 15:06, Matthew Brost wrote:
> > > > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > > > circuiting while a scheduling of user context could be enabled.
> > > I'm not sure what 'while a scheduling of user context could be enabled'
> > > means.
> > > 
> > Not really sure how this isn't clear.
> > 
> > It means if a user context has scheduling enabled this function cannot
> > short circuit returning idle.
> > 
> > Matt
> Okay. The 'a scheduling' was throwing me off. And I was reading 'could be
> enabled' as saying something that might happen in the future. English is
> great at being ambiguous ;). Maybe 'while any user context has scheduling
> enabled' would be simpler?
> 

Sure.

Matt

> John.
> 
> > > John.
> > > 
> > > > Returning GT idle when it is not can cause all sorts of issues
> > > > throughout the stack.
> > > > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Add might_lock annotations to pin / unpin function
> > > > v3:
> > > >    (CI)
> > > >     - Drop intel_engine_pm_might_put from unpin path as an async put is
> > > >       used
> > > > v4:
> > > >    (John Harrison)
> > > >     - Make intel_engine_pm_might_get/put work with GuC virtual engines
> > > >     - Update commit message
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
> > > >    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
> > > >    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> > > >    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> > > >    5 files changed, 89 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 1076066f41e0..f601323b939f 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >    	if (err)
> > > >    		goto err_post_unpin;
> > > > +	intel_engine_pm_might_get(ce->engine);
> > > > +
> > > >    	if (unlikely(intel_context_is_closed(ce))) {
> > > >    		err = -ENOENT;
> > > >    		goto err_unlock;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > index 6fdeae668e6e..d68675925b79 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > @@ -6,9 +6,11 @@
> > > >    #ifndef INTEL_ENGINE_PM_H
> > > >    #define INTEL_ENGINE_PM_H
> > > > +#include "i915_drv.h"
> > > >    #include "i915_request.h"
> > > >    #include "intel_engine_types.h"
> > > >    #include "intel_wakeref.h"
> > > > +#include "intel_gt_pm.h"
> > > >    static inline bool
> > > >    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > > > @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> > > >    	return intel_wakeref_get_if_active(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine)) {
> > > > +		intel_wakeref_might_get(&engine->wakeref);
> > > > +	} else {
> > > > +		struct intel_gt *gt = engine->gt;
> > > > +		struct intel_engine_cs *tengine;
> > > > +		intel_engine_mask_t tmp, mask = engine->mask;
> > > > +
> > > > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > > > +			intel_wakeref_might_get(&tengine->wakeref);
> > > > +	}
> > > > +	intel_gt_pm_might_get(engine->gt);
> > > > +}
> > > > +
> > > >    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> > > >    {
> > > >    	intel_wakeref_put(&engine->wakeref);
> > > > @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> > > >    	intel_wakeref_unlock_wait(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine)) {
> > > > +		intel_wakeref_might_put(&engine->wakeref);
> > > > +	} else {
> > > > +		struct intel_gt *gt = engine->gt;
> > > > +		struct intel_engine_cs *tengine;
> > > > +		intel_engine_mask_t tmp, mask = engine->mask;
> > > > +
> > > > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > > > +			intel_wakeref_might_put(&tengine->wakeref);
> > > > +	}
> > > > +	intel_gt_pm_might_put(engine->gt);
> > > > +}
> > > > +
> > > >    static inline struct i915_request *
> > > >    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> > > >    {
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > index 05de6c1af25b..bc898df7a48c 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> > > >    	return intel_wakeref_get_if_active(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_get(&gt->wakeref);
> > > > +}
> > > > +
> > > >    static inline void intel_gt_pm_put(struct intel_gt *gt)
> > > >    {
> > > >    	intel_wakeref_put(&gt->wakeref);
> > > > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> > > >    	intel_wakeref_put_async(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_put(&gt->wakeref);
> > > > +}
> > > > +
> > > >    #define with_intel_gt_pm(gt, tmp) \
> > > >    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > > >    	     intel_gt_pm_put(gt), tmp = 0)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 17da2fea1bff..8b82da50c2bc 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +
> > > > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_get(ce->engine);
> > > > +
> > > > +	return ret;
> > > >    }
> > > >    static void guc_context_unpin(struct intel_context *ce)
> > > > @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >    	unpin_guc_id(guc, ce);
> > > >    	lrc_unpin(ce);
> > > > +
> > > > +	if (likely(!intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_put_async(ce->engine);
> > > >    }
> > > >    static void guc_context_post_unpin(struct intel_context *ce)
> > > > @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > >    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +
> > > > +	if (likely(!ret))
> > > > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > > -	return __guc_context_pin(ce, engine, vaddr);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +	struct intel_engine_cs *engine;
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > +
> > > > +	unpin_guc_id(guc, ce);
> > > > +	lrc_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +		intel_engine_pm_put_async(engine);
> > > >    }
> > > >    static void guc_virtual_context_enter(struct intel_context *ce)
> > > > @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > >    	.pre_pin = guc_virtual_context_pre_pin,
> > > >    	.pin = guc_virtual_context_pin,
> > > > -	.unpin = guc_context_unpin,
> > > > +	.unpin = guc_virtual_context_unpin,
> > > >    	.post_unpin = guc_context_post_unpin,
> > > >    	.ban = guc_context_ban,
> > > > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > index 545c8f277c46..4f4c2e15e736 100644
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > @@ -123,6 +123,12 @@ enum {
> > > >    	__INTEL_WAKEREF_PUT_LAST_BIT__
> > > >    };
> > > > +static inline void
> > > > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_put_flags: Release the wakeref
> > > >     * @wf: the wakeref
> > > > @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> > > >    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> > > >    }
> > > > +static inline void
> > > > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_lock: Lock the wakeref (mutex)
> > > >     * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-10-08  1:23           ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:23 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 11:15:51AM -0700, John Harrison wrote:
> On 10/7/2021 08:19, Matthew Brost wrote:
> > On Wed, Oct 06, 2021 at 08:45:42PM -0700, John Harrison wrote:
> > > On 10/4/2021 15:06, Matthew Brost wrote:
> > > > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > > > circuiting while a scheduling of user context could be enabled.
> > > I'm not sure what 'while a scheduling of user context could be enabled'
> > > means.
> > > 
> > Not really sure how this isn't clear.
> > 
> > It means if a user context has scheduling enabled this function cannot
> > short circuit returning idle.
> > 
> > Matt
> Okay. The 'a scheduling' was throwing me off. And I was reading 'could be
> enabled' as saying something that might happen in the future. English is
> great at being ambiguous ;). Maybe 'while any user context has scheduling
> enabled' would be simpler?
> 

Sure.

Matt

> John.
> 
> > > John.
> > > 
> > > > Returning GT idle when it is not can cause all sorts of issues
> > > > throughout the stack.
> > > > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Add might_lock annotations to pin / unpin function
> > > > v3:
> > > >    (CI)
> > > >     - Drop intel_engine_pm_might_put from unpin path as an async put is
> > > >       used
> > > > v4:
> > > >    (John Harrison)
> > > >     - Make intel_engine_pm_might_get/put work with GuC virtual engines
> > > >     - Update commit message
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
> > > >    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 32 +++++++++++++++++
> > > >    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> > > >    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> > > >    5 files changed, 89 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 1076066f41e0..f601323b939f 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >    	if (err)
> > > >    		goto err_post_unpin;
> > > > +	intel_engine_pm_might_get(ce->engine);
> > > > +
> > > >    	if (unlikely(intel_context_is_closed(ce))) {
> > > >    		err = -ENOENT;
> > > >    		goto err_unlock;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > index 6fdeae668e6e..d68675925b79 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > @@ -6,9 +6,11 @@
> > > >    #ifndef INTEL_ENGINE_PM_H
> > > >    #define INTEL_ENGINE_PM_H
> > > > +#include "i915_drv.h"
> > > >    #include "i915_request.h"
> > > >    #include "intel_engine_types.h"
> > > >    #include "intel_wakeref.h"
> > > > +#include "intel_gt_pm.h"
> > > >    static inline bool
> > > >    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > > > @@ -31,6 +33,21 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> > > >    	return intel_wakeref_get_if_active(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine)) {
> > > > +		intel_wakeref_might_get(&engine->wakeref);
> > > > +	} else {
> > > > +		struct intel_gt *gt = engine->gt;
> > > > +		struct intel_engine_cs *tengine;
> > > > +		intel_engine_mask_t tmp, mask = engine->mask;
> > > > +
> > > > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > > > +			intel_wakeref_might_get(&tengine->wakeref);
> > > > +	}
> > > > +	intel_gt_pm_might_get(engine->gt);
> > > > +}
> > > > +
> > > >    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> > > >    {
> > > >    	intel_wakeref_put(&engine->wakeref);
> > > > @@ -52,6 +69,21 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> > > >    	intel_wakeref_unlock_wait(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine)) {
> > > > +		intel_wakeref_might_put(&engine->wakeref);
> > > > +	} else {
> > > > +		struct intel_gt *gt = engine->gt;
> > > > +		struct intel_engine_cs *tengine;
> > > > +		intel_engine_mask_t tmp, mask = engine->mask;
> > > > +
> > > > +		for_each_engine_masked(tengine, gt, mask, tmp)
> > > > +			intel_wakeref_might_put(&tengine->wakeref);
> > > > +	}
> > > > +	intel_gt_pm_might_put(engine->gt);
> > > > +}
> > > > +
> > > >    static inline struct i915_request *
> > > >    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> > > >    {
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > index 05de6c1af25b..bc898df7a48c 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> > > >    	return intel_wakeref_get_if_active(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_get(&gt->wakeref);
> > > > +}
> > > > +
> > > >    static inline void intel_gt_pm_put(struct intel_gt *gt)
> > > >    {
> > > >    	intel_wakeref_put(&gt->wakeref);
> > > > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> > > >    	intel_wakeref_put_async(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_put(&gt->wakeref);
> > > > +}
> > > > +
> > > >    #define with_intel_gt_pm(gt, tmp) \
> > > >    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > > >    	     intel_gt_pm_put(gt), tmp = 0)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 17da2fea1bff..8b82da50c2bc 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1571,7 +1571,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +
> > > > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_get(ce->engine);
> > > > +
> > > > +	return ret;
> > > >    }
> > > >    static void guc_context_unpin(struct intel_context *ce)
> > > > @@ -1580,6 +1585,9 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >    	unpin_guc_id(guc, ce);
> > > >    	lrc_unpin(ce);
> > > > +
> > > > +	if (likely(!intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_put_async(ce->engine);
> > > >    }
> > > >    static void guc_context_post_unpin(struct intel_context *ce)
> > > > @@ -2341,8 +2349,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > >    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +
> > > > +	if (likely(!ret))
> > > > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > > -	return __guc_context_pin(ce, engine, vaddr);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +	struct intel_engine_cs *engine;
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > +
> > > > +	unpin_guc_id(guc, ce);
> > > > +	lrc_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +		intel_engine_pm_put_async(engine);
> > > >    }
> > > >    static void guc_virtual_context_enter(struct intel_context *ce)
> > > > @@ -2379,7 +2409,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > >    	.pre_pin = guc_virtual_context_pre_pin,
> > > >    	.pin = guc_virtual_context_pin,
> > > > -	.unpin = guc_context_unpin,
> > > > +	.unpin = guc_virtual_context_unpin,
> > > >    	.post_unpin = guc_context_post_unpin,
> > > >    	.ban = guc_context_ban,
> > > > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > index 545c8f277c46..4f4c2e15e736 100644
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > @@ -123,6 +123,12 @@ enum {
> > > >    	__INTEL_WAKEREF_PUT_LAST_BIT__
> > > >    };
> > > > +static inline void
> > > > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_put_flags: Release the wakeref
> > > >     * @wf: the wakeref
> > > > @@ -170,6 +176,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> > > >    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> > > >    }
> > > > +static inline void
> > > > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_lock: Lock the wakeref (mutex)
> > > >     * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
  2021-10-07  3:37     ` [Intel-gfx] " John Harrison
@ 2021-10-08  1:28       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:28 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:37:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight. To do this must
> > issue the deregister H2G from a worker as context can be destroyed from
> > an atomic context and taking GT PM ref blows up. Previously we took a
> > runtime PM from this atomic context which worked but will stop working
> > once runtime pm autosuspend in enabled.
> > 
> > So this patch is two fold, stop intel_gt_wait_for_idle from short
> > circuting and fix runtime pm autosuspend.
> > 
> > v2:
> >   (John Harrison)
> >    - Split structure changes out in different patch
> >   (Tvrtko)
> >    - Don't drop lock in deregister_destroyed_contexts
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
> >   6 files changed, 121 insertions(+), 54 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index e9a0cad5c34d..1076066f41e0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index e7e3984aab78..4613d027cbc3 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -213,6 +213,13 @@ struct intel_context {
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 8520c595f5e1..6fdeae668e6e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..05de6c1af25b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 65b5e8eeef96..25a598e2b6e8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -84,6 +84,17 @@ struct intel_guc {
> >   		 * refs
> >   		 */
> >   		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ad5c18119d92..17da2fea1bff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -90,8 +90,8 @@
> >    * used for all of GuC submission but that could change in the future.
> >    *
> >    * guc->submission_state.lock
> > - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> > - * doing guc_id allocation operations at a time for each GuC in the system.
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> Feels like this should not be removing explanations, only adding to them.
> The patch itself is only adding new features not removing them. Either the
> details about id allocation are not worth mentioning and should not have
> been added in the previous patch. Or they are and should be kept rather than
> removed in this patch. Either way works for me. The comment was valid
> information but does maybe count as obvious from the guc_id member (and
> friends) are within a per GuC instance structure.
> 

The comment before this patch is already in drm-tip merged in a patch
prior to this series. Prior to this patch, this lock was very specific
to guc_id allocation now it is a generic global submission lock. The
explaination has been updated to reflect that.

Matt 

> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	int i;
> > @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	guc_flush_submissions(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	/*
> >   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	spin_lock_init(&guc->submission_state.lock);
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker,
> > +		  destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> Seems like these lines should be reversed. We should destroy the higher
> level constructs before the lower level ones that they could be built on.
> 
> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id);
> >   }
> > @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		guc_lrc_desc_unpin(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> 'destroy' can be true if the guc_id is invalid. Is it good to call release
> on an invalid id?
> 
> John.
> 
> >   	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > -	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-10-08  1:28       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:28 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:37:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight. To do this must
> > issue the deregister H2G from a worker as context can be destroyed from
> > an atomic context and taking GT PM ref blows up. Previously we took a
> > runtime PM from this atomic context which worked but will stop working
> > once runtime pm autosuspend in enabled.
> > 
> > So this patch is two fold, stop intel_gt_wait_for_idle from short
> > circuting and fix runtime pm autosuspend.
> > 
> > v2:
> >   (John Harrison)
> >    - Split structure changes out in different patch
> >   (Tvrtko)
> >    - Don't drop lock in deregister_destroyed_contexts
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
> >   6 files changed, 121 insertions(+), 54 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index e9a0cad5c34d..1076066f41e0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index e7e3984aab78..4613d027cbc3 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -213,6 +213,13 @@ struct intel_context {
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 8520c595f5e1..6fdeae668e6e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..05de6c1af25b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 65b5e8eeef96..25a598e2b6e8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -84,6 +84,17 @@ struct intel_guc {
> >   		 * refs
> >   		 */
> >   		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ad5c18119d92..17da2fea1bff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -90,8 +90,8 @@
> >    * used for all of GuC submission but that could change in the future.
> >    *
> >    * guc->submission_state.lock
> > - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> > - * doing guc_id allocation operations at a time for each GuC in the system.
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> Feels like this should not be removing explanations, only adding to them.
> The patch itself is only adding new features not removing them. Either the
> details about id allocation are not worth mentioning and should not have
> been added in the previous patch. Or they are and should be kept rather than
> removed in this patch. Either way works for me. The comment was valid
> information but does maybe count as obvious from the guc_id member (and
> friends) are within a per GuC instance structure.
> 

The comment before this patch is already in drm-tip merged in a patch
prior to this series. Prior to this patch, this lock was very specific
to guc_id allocation now it is a generic global submission lock. The
explaination has been updated to reflect that.

Matt 

> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	int i;
> > @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	guc_flush_submissions(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	/*
> >   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	spin_lock_init(&guc->submission_state.lock);
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker,
> > +		  destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> Seems like these lines should be reversed. We should destroy the higher
> level constructs before the lower level ones that they could be built on.
> 
> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id);
> >   }
> > @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		guc_lrc_desc_unpin(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> 'destroy' can be true if the guc_id is invalid. Is it good to call release
> on an invalid id?
> 
> John.
> 
> >   	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > -	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
  2021-10-07 19:50     ` [Intel-gfx] " John Harrison
@ 2021-10-08  1:31       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:31 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 12:50:28PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Add multi-lrc context registration H2G. In addition a workqueue and
> > process descriptor are setup during multi-lrc context registration as
> > these data structures are needed for multi-lrc submission.
> > 
> > v2:
> >   (John Harrison)
> >    - Move GuC specific fields into sub-struct
> >    - Clean up WQ defines
> >    - Add comment explaining math to derive WQ / PD address
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
> >   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
> >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
> >   5 files changed, 131 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 76dfca57cb45..48decb5ee954 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -239,6 +239,18 @@ struct intel_context {
> >   		struct intel_context *parent;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> > +		/** @guc: GuC specific members for parallel submission */
> > +		struct {
> > +			/** @wqi_head: head pointer in work queue */
> > +			u16 wqi_head;
> > +			/** @wqi_tail: tail pointer in work queue */
> > +			u16 wqi_tail;
> > +			/**
> > +			 * @parent_page: page in context state (ce->state) used
> > +			 * by parent for work queue, process descriptor
> > +			 */
> > +			u8 parent_page;
> > +		} guc;
> >   	} parallel;
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 3ef9eaf8c50e..57339d5c1fc8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
> >   		context_size += PAGE_SIZE;
> >   	}
> > +	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
> > +		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
> > +		context_size += PAGE_SIZE;
> > +	}
> > +
> >   	obj = i915_gem_object_create_lmem(engine->i915, context_size,
> >   					  I915_BO_ALLOC_PM_VOLATILE);
> >   	if (IS_ERR(obj))
> > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > index 8ff582222aff..ba10bd374cee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > @@ -142,6 +142,7 @@ enum intel_guc_action {
> >   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
> >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
> >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> >   	INTEL_GUC_ACTION_LIMIT
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index fa4be13c8854..0eeb2a9feeed 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -52,8 +52,6 @@
> >   #define GUC_DOORBELL_INVALID		256
> > -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> > -
> >   /* Work queue item header definitions */
> >   #define WQ_STATUS_ACTIVE		1
> >   #define WQ_STATUS_SUSPENDED		2
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 451d9ae861a6..ab6d7fc1b0b1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >   	return rb_entry(rb, struct i915_priolist, node);
> >   }
> > +/*
> > + * When using multi-lrc submission an extra page in the context state is
> > + * reserved for the process descriptor and work queue.
> > + *
> > + * The layout of this page is below:
> > + * 0						guc_process_desc
> > + * ...						unused
> > + * PAGE_SIZE / 2				work queue start
> > + * ...						work queue
> > + * PAGE_SIZE - 1				work queue end
> > + */
> > +#define WQ_SIZE			(PAGE_SIZE / 2)
> > +#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE' and
> then using that everywhere else? Unless there is a fundamental reason why
> the above must be exactly a page in size then I think the size should be
> defined once and re-used rather than assumed in multiple places (including
> in the description comment).
> 

Right, forgot about that. Will fix.

> > +static u32 __get_process_desc_offset(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> > +
> > +	return ce->parallel.guc.parent_page * PAGE_SIZE;
> > +}
> > +
> > +static u32 __get_wq_offset(struct intel_context *ce)
> > +{
> > +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > +}
> > +
> > +static struct guc_process_desc *
> > +__get_process_desc(struct intel_context *ce)
> > +{
> > +	/*
> > +	 * Need to subtract LRC_STATE_OFFSET here as the
> > +	 * parallel.guc.parent_page is the offset into ce->state while
> > +	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> > +	 */
> > +	return (struct guc_process_desc *)
> > +		(ce->lrc_reg_state +
> > +		 ((__get_process_desc_offset(ce) -
> > +		   LRC_STATE_OFFSET) / sizeof(u32)));
> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> > +					   struct intel_context *ce,
> > +					   u32 guc_id,
> > +					   u32 offset,
> > +					   bool loop)
> > +{
> > +	struct intel_context *child;
> > +	u32 action[4 + MAX_ENGINE_INSTANCE];
> > +	int len = 0;
> > +
> > +	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
> > +
> > +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> > +	action[len++] = guc_id;
> > +	action[len++] = ce->parallel.number_children + 1;
> > +	action[len++] = offset;
> > +	for_each_child(ce, child) {
> > +		offset += sizeof(struct guc_lrc_desc);
> > +		action[len++] = offset;
> > +	}
> > +
> > +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> > +}
> > +
> >   static int __guc_action_register_context(struct intel_guc *guc,
> >   					 u32 guc_id,
> >   					 u32 offset,
> > @@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
> >   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
> >   	int ret;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_register(ce);
> > -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> > +	if (intel_context_is_parent(ce))
> > +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> > +						      offset, loop);
> > +	else
> > +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> > +						    loop);
> >   	if (likely(!ret)) {
> >   		unsigned long flags;
> > @@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_deregister(ce);
> >   	return __guc_action_deregister_context(guc, guc_id);
> > @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	struct guc_lrc_desc *desc;
> >   	bool context_registered;
> >   	intel_wakeref_t wakeref;
> > +	struct intel_context *child;
> >   	int ret = 0;
> >   	GEM_BUG_ON(!engine->mask);
> > @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > +	/*
> > +	 * Context is a parent, we need to register a process descriptor
> > +	 * describing a work queue and register all child contexts.
> > +	 */
> This was now meant to say 'If the context is a parent...'?
> 

Yep, that is what it should say. Will fix.

Matt

> John.
> 
> > +	if (intel_context_is_parent(ce)) {
> > +		struct guc_process_desc *pdesc;
> > +
> > +		ce->parallel.guc.wqi_tail = 0;
> > +		ce->parallel.guc.wqi_head = 0;
> > +
> > +		desc->process_desc = i915_ggtt_offset(ce->state) +
> > +			__get_process_desc_offset(ce);
> > +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> > +			__get_wq_offset(ce);
> > +		desc->wq_size = WQ_SIZE;
> > +
> > +		pdesc = __get_process_desc(ce);
> > +		memset(pdesc, 0, sizeof(*(pdesc)));
> > +		pdesc->stage_id = ce->guc_id.id;
> > +		pdesc->wq_base_addr = desc->wq_addr;
> > +		pdesc->wq_size_bytes = desc->wq_size;
> > +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> > +
> > +		for_each_child(ce, child) {
> > +			desc = __get_lrc_desc(guc, child->guc_id.id);
> > +
> > +			desc->engine_class =
> > +				engine_class_to_guc_class(engine->class);
> > +			desc->hw_context_desc = child->lrc.lrca;
> > +			desc->priority = ce->guc_state.prio;
> > +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > +			guc_context_policy_init(engine, desc);
> > +		}
> > +	}
> > +
> >   	/*
> >   	 * The context_lookup xarray is used to determine if the hardware
> >   	 * context is currently registered. There are two cases in which it
> > @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   		return NULL;
> >   	}
> > +	if (unlikely(intel_context_is_child(ce))) {
> > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > +			"Context is child, desc_idx %u", desc_idx);
> > +		return NULL;
> > +	}
> > +
> >   	return ce;
> >   }
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
@ 2021-10-08  1:31       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08  1:31 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 12:50:28PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Add multi-lrc context registration H2G. In addition a workqueue and
> > process descriptor are setup during multi-lrc context registration as
> > these data structures are needed for multi-lrc submission.
> > 
> > v2:
> >   (John Harrison)
> >    - Move GuC specific fields into sub-struct
> >    - Clean up WQ defines
> >    - Add comment explaining math to derive WQ / PD address
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
> >   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
> >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
> >   5 files changed, 131 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 76dfca57cb45..48decb5ee954 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -239,6 +239,18 @@ struct intel_context {
> >   		struct intel_context *parent;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> > +		/** @guc: GuC specific members for parallel submission */
> > +		struct {
> > +			/** @wqi_head: head pointer in work queue */
> > +			u16 wqi_head;
> > +			/** @wqi_tail: tail pointer in work queue */
> > +			u16 wqi_tail;
> > +			/**
> > +			 * @parent_page: page in context state (ce->state) used
> > +			 * by parent for work queue, process descriptor
> > +			 */
> > +			u8 parent_page;
> > +		} guc;
> >   	} parallel;
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 3ef9eaf8c50e..57339d5c1fc8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
> >   		context_size += PAGE_SIZE;
> >   	}
> > +	if (intel_context_is_parent(ce) && intel_engine_uses_guc(engine)) {
> > +		ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
> > +		context_size += PAGE_SIZE;
> > +	}
> > +
> >   	obj = i915_gem_object_create_lmem(engine->i915, context_size,
> >   					  I915_BO_ALLOC_PM_VOLATILE);
> >   	if (IS_ERR(obj))
> > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > index 8ff582222aff..ba10bd374cee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > @@ -142,6 +142,7 @@ enum intel_guc_action {
> >   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
> >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
> >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> >   	INTEL_GUC_ACTION_LIMIT
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index fa4be13c8854..0eeb2a9feeed 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -52,8 +52,6 @@
> >   #define GUC_DOORBELL_INVALID		256
> > -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> > -
> >   /* Work queue item header definitions */
> >   #define WQ_STATUS_ACTIVE		1
> >   #define WQ_STATUS_SUSPENDED		2
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 451d9ae861a6..ab6d7fc1b0b1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -344,6 +344,45 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >   	return rb_entry(rb, struct i915_priolist, node);
> >   }
> > +/*
> > + * When using multi-lrc submission an extra page in the context state is
> > + * reserved for the process descriptor and work queue.
> > + *
> > + * The layout of this page is below:
> > + * 0						guc_process_desc
> > + * ...						unused
> > + * PAGE_SIZE / 2				work queue start
> > + * ...						work queue
> > + * PAGE_SIZE - 1				work queue end
> > + */
> > +#define WQ_SIZE			(PAGE_SIZE / 2)
> > +#define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE' and
> then using that everywhere else? Unless there is a fundamental reason why
> the above must be exactly a page in size then I think the size should be
> defined once and re-used rather than assumed in multiple places (including
> in the description comment).
> 

Right, forgot about that. Will fix.

> > +static u32 __get_process_desc_offset(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> > +
> > +	return ce->parallel.guc.parent_page * PAGE_SIZE;
> > +}
> > +
> > +static u32 __get_wq_offset(struct intel_context *ce)
> > +{
> > +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > +}
> > +
> > +static struct guc_process_desc *
> > +__get_process_desc(struct intel_context *ce)
> > +{
> > +	/*
> > +	 * Need to subtract LRC_STATE_OFFSET here as the
> > +	 * parallel.guc.parent_page is the offset into ce->state while
> > +	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> > +	 */
> > +	return (struct guc_process_desc *)
> > +		(ce->lrc_reg_state +
> > +		 ((__get_process_desc_offset(ce) -
> > +		   LRC_STATE_OFFSET) / sizeof(u32)));
> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> > +					   struct intel_context *ce,
> > +					   u32 guc_id,
> > +					   u32 offset,
> > +					   bool loop)
> > +{
> > +	struct intel_context *child;
> > +	u32 action[4 + MAX_ENGINE_INSTANCE];
> > +	int len = 0;
> > +
> > +	GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
> > +
> > +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> > +	action[len++] = guc_id;
> > +	action[len++] = ce->parallel.number_children + 1;
> > +	action[len++] = offset;
> > +	for_each_child(ce, child) {
> > +		offset += sizeof(struct guc_lrc_desc);
> > +		action[len++] = offset;
> > +	}
> > +
> > +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> > +}
> > +
> >   static int __guc_action_register_context(struct intel_guc *guc,
> >   					 u32 guc_id,
> >   					 u32 offset,
> > @@ -1387,9 +1450,15 @@ static int register_context(struct intel_context *ce, bool loop)
> >   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
> >   	int ret;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_register(ce);
> > -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> > +	if (intel_context_is_parent(ce))
> > +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> > +						      offset, loop);
> > +	else
> > +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> > +						    loop);
> >   	if (likely(!ret)) {
> >   		unsigned long flags;
> > @@ -1418,6 +1487,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_deregister(ce);
> >   	return __guc_action_deregister_context(guc, guc_id);
> > @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	struct guc_lrc_desc *desc;
> >   	bool context_registered;
> >   	intel_wakeref_t wakeref;
> > +	struct intel_context *child;
> >   	int ret = 0;
> >   	GEM_BUG_ON(!engine->mask);
> > @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > +	/*
> > +	 * Context is a parent, we need to register a process descriptor
> > +	 * describing a work queue and register all child contexts.
> > +	 */
> This was now meant to say 'If the context is a parent...'?
> 

Yep, that is what it should say. Will fix.

Matt

> John.
> 
> > +	if (intel_context_is_parent(ce)) {
> > +		struct guc_process_desc *pdesc;
> > +
> > +		ce->parallel.guc.wqi_tail = 0;
> > +		ce->parallel.guc.wqi_head = 0;
> > +
> > +		desc->process_desc = i915_ggtt_offset(ce->state) +
> > +			__get_process_desc_offset(ce);
> > +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> > +			__get_wq_offset(ce);
> > +		desc->wq_size = WQ_SIZE;
> > +
> > +		pdesc = __get_process_desc(ce);
> > +		memset(pdesc, 0, sizeof(*(pdesc)));
> > +		pdesc->stage_id = ce->guc_id.id;
> > +		pdesc->wq_base_addr = desc->wq_addr;
> > +		pdesc->wq_size_bytes = desc->wq_size;
> > +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> > +
> > +		for_each_child(ce, child) {
> > +			desc = __get_lrc_desc(guc, child->guc_id.id);
> > +
> > +			desc->engine_class =
> > +				engine_class_to_guc_class(engine->class);
> > +			desc->hw_context_desc = child->lrc.lrca;
> > +			desc->priority = ce->guc_state.prio;
> > +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > +			guc_context_policy_init(engine, desc);
> > +		}
> > +	}
> > +
> >   	/*
> >   	 * The context_lookup xarray is used to determine if the hardware
> >   	 * context is currently registered. There are two cases in which it
> > @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   		return NULL;
> >   	}
> > +	if (unlikely(intel_context_is_child(ce))) {
> > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > +			"Context is child, desc_idx %u", desc_idx);
> > +		return NULL;
> > +	}
> > +
> >   	return ce;
> >   }
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-08  1:21       ` [Intel-gfx] " Matthew Brost
@ 2021-10-08 16:40         ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 16:40 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 18:21, Matthew Brost wrote:
> On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Assign contexts in parent-child relationship consecutive guc_ids. This
>>> is accomplished by partitioning guc_id space between ones that need to
>>> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
>>> available guc_ids). The consecutive search is implemented via the bitmap
>>> API.
>>>
>>> This is a precursor to the full GuC multi-lrc implementation but aligns
>>> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
>>> when using the GuC multi-lrc interface.
>>>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Explicitly state why we assign consecutive guc_ids
>>> v3:
>>>    (John Harrison)
>>>     - Bring back in spin lock
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>>>    2 files changed, 86 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 25a598e2b6e8..a9f4ec972bfb 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -76,9 +76,13 @@ struct intel_guc {
>>>    		 */
>>>    		spinlock_t lock;
>>>    		/**
>>> -		 * @guc_ids: used to allocate new guc_ids
>>> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>>>    		 */
>>>    		struct ida guc_ids;
>>> +		/**
>>> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
>>> +		 */
>>> +		unsigned long *guc_ids_bitmap;
>>>    		/**
>>>    		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>>    		 * refs
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 1f2809187513..79e7732e83b2 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>>>    #define GUC_REQUEST_SIZE 64 /* bytes */
>>> +/*
>>> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
>>> + * per the GuC submission interface. A different allocation algorithm is used
>>> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
>>> + * partition the guc_id space. We believe the number of multi-lrc contexts in
>>> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
>>> + * multi-lrc.
>>> + */
>>> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
>>> +
>>>    /*
>>>     * Below is a set of functions which control the GuC scheduling state which
>>>     * require a lock.
>>> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>    		  destroyed_worker_func);
>>> +	guc->submission_state.guc_ids_bitmap =
>>> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
>>> +	if (!guc->submission_state.guc_ids_bitmap)
>>> +		return -ENOMEM;
>>> +
>>>    	return 0;
>>>    }
>>> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>>    	guc_lrc_desc_pool_destroy(guc);
>>>    	guc_flush_destroyed_contexts(guc);
>>>    	i915_sched_engine_put(guc->sched_engine);
>>> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>>>    }
>>>    static inline void queue_request(struct i915_sched_engine *sched_engine,
>>> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>>>    	spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>    }
>>> -static int new_guc_id(struct intel_guc *guc)
>>> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>> +	int ret;
>>> +
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +
>>> +	if (intel_context_is_parent(ce))
>>> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
>>> +					      NUMBER_MULTI_LRC_GUC_ID,
>>> +					      order_base_2(ce->parallel.number_children
>>> +							   + 1));
>>> +	else
>>> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
>>> +				     NUMBER_MULTI_LRC_GUC_ID,
>>> +				     GUC_MAX_LRC_DESCRIPTORS,
>>> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
>>> +				     __GFP_NOWARN);
>>> +	if (unlikely(ret < 0))
>>> +		return ret;
>>> +
>>> +	ce->guc_id.id = ret;
>>> +	return 0;
>>>    }
>>>    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +
>>>    	if (!context_guc_id_invalid(ce)) {
>>> -		ida_simple_remove(&guc->submission_state.guc_ids,
>>> -				  ce->guc_id.id);
>>> +		if (intel_context_is_parent(ce))
>>> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
>>> +					      ce->guc_id.id,
>>> +					      order_base_2(ce->parallel.number_children
>>> +							   + 1));
>> There was a discussion on the previous revision about adding a BUG_ON to
>> ensure that number_children cannot change between the bitmap alloc and the
>> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
>>
> I thought you meant to add a BUG_ON to ensure before we release a region
> / id it is occupied? I looked in both the bitmap API and ida API and
> neither have a function that checks if region / id is occupied so can't
> really add a BUG_ON for that.
>
> How much you add BUG_ON to ensure the number of children canoot change
> between alloc and release? I don't follow how that would work.
>
> Matt
I was thinking that where number_children is modified, you have a 
BUG_ON(guc_id_is_valid). That would ensure that the release has to match 
the alloc. Hmm, you already have a BUG_ON about the parent/child not 
being pinned in intel_context_bind_parent_child(), which I guess covers 
it because you shouldn't have a guc_id if you aren't pinned, right? And 
that is the only function which can modify number_children, yes? So 
maybe it's all good?

John.

>
>> John.
>>
>>
>>> +		else
>>> +			ida_simple_remove(&guc->submission_state.guc_ids,
>>> +					  ce->guc_id.id);
>>>    		reset_lrc_desc(guc, ce->guc_id.id);
>>>    		set_context_guc_id_invalid(ce);
>>>    	}
>>> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>> -static int steal_guc_id(struct intel_guc *guc)
>>> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> -	struct intel_context *ce;
>>> -	int guc_id;
>>> +	struct intel_context *cn;
>>>    	lockdep_assert_held(&guc->submission_state.lock);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +	GEM_BUG_ON(intel_context_is_parent(ce));
>>>    	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>>>    				      struct intel_context,
>>>    				      guc_id.link);
>>> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>> -		GEM_BUG_ON(context_guc_id_invalid(ce));
>>> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
>>> +		GEM_BUG_ON(context_guc_id_invalid(cn));
>>> +		GEM_BUG_ON(intel_context_is_child(cn));
>>> +		GEM_BUG_ON(intel_context_is_parent(cn));
>>> -		list_del_init(&ce->guc_id.link);
>>> -		guc_id = ce->guc_id.id;
>>> +		list_del_init(&cn->guc_id.link);
>>> +		ce->guc_id = cn->guc_id;
>>>    		spin_lock(&ce->guc_state.lock);
>>> -		clr_context_registered(ce);
>>> +		clr_context_registered(cn);
>>>    		spin_unlock(&ce->guc_state.lock);
>>> -		set_context_guc_id_invalid(ce);
>>> -		return guc_id;
>>> +		set_context_guc_id_invalid(cn);
>>> +
>>> +		return 0;
>>>    	} else {
>>>    		return -EAGAIN;
>>>    	}
>>>    }
>>> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	int ret;
>>>    	lockdep_assert_held(&guc->submission_state.lock);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> -	ret = new_guc_id(guc);
>>> +	ret = new_guc_id(guc, ce);
>>>    	if (unlikely(ret < 0)) {
>>> -		ret = steal_guc_id(guc);
>>> +		if (intel_context_is_parent(ce))
>>> +			return -ENOSPC;
>>> +
>>> +		ret = steal_guc_id(guc, ce);
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	}
>>> -	*out = ret;
>>> +	if (intel_context_is_parent(ce)) {
>>> +		struct intel_context *child;
>>> +		int i = 1;
>>> +
>>> +		for_each_child(ce, child)
>>> +			child->guc_id.id = ce->guc_id.id + i++;
>>> +	}
>>> +
>>>    	return 0;
>>>    }
>>> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	might_lock(&ce->guc_state.lock);
>>>    	if (context_guc_id_invalid(ce)) {
>>> -		ret = assign_guc_id(guc, &ce->guc_id.id);
>>> +		ret = assign_guc_id(guc, ce);
>>>    		if (ret)
>>>    			goto out_unlock;
>>>    		ret = 1;	/* Indidcates newly assigned guc_id */
>>> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	unsigned long flags;
>>>    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> -	if (unlikely(context_guc_id_invalid(ce)))
>>> +	if (unlikely(context_guc_id_invalid(ce) ||
>>> +		     intel_context_is_parent(ce)))
>>>    		return;
>>>    	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-08 16:40         ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 16:40 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/7/2021 18:21, Matthew Brost wrote:
> On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> Assign contexts in parent-child relationship consecutive guc_ids. This
>>> is accomplished by partitioning guc_id space between ones that need to
>>> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
>>> available guc_ids). The consecutive search is implemented via the bitmap
>>> API.
>>>
>>> This is a precursor to the full GuC multi-lrc implementation but aligns
>>> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
>>> when using the GuC multi-lrc interface.
>>>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Explicitly state why we assign consecutive guc_ids
>>> v3:
>>>    (John Harrison)
>>>     - Bring back in spin lock
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>>>    2 files changed, 86 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 25a598e2b6e8..a9f4ec972bfb 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -76,9 +76,13 @@ struct intel_guc {
>>>    		 */
>>>    		spinlock_t lock;
>>>    		/**
>>> -		 * @guc_ids: used to allocate new guc_ids
>>> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>>>    		 */
>>>    		struct ida guc_ids;
>>> +		/**
>>> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
>>> +		 */
>>> +		unsigned long *guc_ids_bitmap;
>>>    		/**
>>>    		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>>    		 * refs
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 1f2809187513..79e7732e83b2 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>>>    #define GUC_REQUEST_SIZE 64 /* bytes */
>>> +/*
>>> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
>>> + * per the GuC submission interface. A different allocation algorithm is used
>>> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
>>> + * partition the guc_id space. We believe the number of multi-lrc contexts in
>>> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
>>> + * multi-lrc.
>>> + */
>>> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
>>> +
>>>    /*
>>>     * Below is a set of functions which control the GuC scheduling state which
>>>     * require a lock.
>>> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>    		  destroyed_worker_func);
>>> +	guc->submission_state.guc_ids_bitmap =
>>> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
>>> +	if (!guc->submission_state.guc_ids_bitmap)
>>> +		return -ENOMEM;
>>> +
>>>    	return 0;
>>>    }
>>> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>>    	guc_lrc_desc_pool_destroy(guc);
>>>    	guc_flush_destroyed_contexts(guc);
>>>    	i915_sched_engine_put(guc->sched_engine);
>>> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>>>    }
>>>    static inline void queue_request(struct i915_sched_engine *sched_engine,
>>> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>>>    	spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>    }
>>> -static int new_guc_id(struct intel_guc *guc)
>>> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>> +	int ret;
>>> +
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +
>>> +	if (intel_context_is_parent(ce))
>>> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
>>> +					      NUMBER_MULTI_LRC_GUC_ID,
>>> +					      order_base_2(ce->parallel.number_children
>>> +							   + 1));
>>> +	else
>>> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
>>> +				     NUMBER_MULTI_LRC_GUC_ID,
>>> +				     GUC_MAX_LRC_DESCRIPTORS,
>>> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
>>> +				     __GFP_NOWARN);
>>> +	if (unlikely(ret < 0))
>>> +		return ret;
>>> +
>>> +	ce->guc_id.id = ret;
>>> +	return 0;
>>>    }
>>>    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +
>>>    	if (!context_guc_id_invalid(ce)) {
>>> -		ida_simple_remove(&guc->submission_state.guc_ids,
>>> -				  ce->guc_id.id);
>>> +		if (intel_context_is_parent(ce))
>>> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
>>> +					      ce->guc_id.id,
>>> +					      order_base_2(ce->parallel.number_children
>>> +							   + 1));
>> There was a discussion on the previous revision about adding a BUG_ON to
>> ensure that number_children cannot change between the bitmap alloc and the
>> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
>>
> I thought you meant to add a BUG_ON to ensure before we release a region
> / id it is occupied? I looked in both the bitmap API and ida API and
> neither have a function that checks if region / id is occupied so can't
> really add a BUG_ON for that.
>
> How much you add BUG_ON to ensure the number of children canoot change
> between alloc and release? I don't follow how that would work.
>
> Matt
I was thinking that where number_children is modified, you have a 
BUG_ON(guc_id_is_valid). That would ensure that the release has to match 
the alloc. Hmm, you already have a BUG_ON about the parent/child not 
being pinned in intel_context_bind_parent_child(), which I guess covers 
it because you shouldn't have a guc_id if you aren't pinned, right? And 
that is the only function which can modify number_children, yes? So 
maybe it's all good?

John.

>
>> John.
>>
>>
>>> +		else
>>> +			ida_simple_remove(&guc->submission_state.guc_ids,
>>> +					  ce->guc_id.id);
>>>    		reset_lrc_desc(guc, ce->guc_id.id);
>>>    		set_context_guc_id_invalid(ce);
>>>    	}
>>> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>> -static int steal_guc_id(struct intel_guc *guc)
>>> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>> -	struct intel_context *ce;
>>> -	int guc_id;
>>> +	struct intel_context *cn;
>>>    	lockdep_assert_held(&guc->submission_state.lock);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> +	GEM_BUG_ON(intel_context_is_parent(ce));
>>>    	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>>>    				      struct intel_context,
>>>    				      guc_id.link);
>>> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>> -		GEM_BUG_ON(context_guc_id_invalid(ce));
>>> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
>>> +		GEM_BUG_ON(context_guc_id_invalid(cn));
>>> +		GEM_BUG_ON(intel_context_is_child(cn));
>>> +		GEM_BUG_ON(intel_context_is_parent(cn));
>>> -		list_del_init(&ce->guc_id.link);
>>> -		guc_id = ce->guc_id.id;
>>> +		list_del_init(&cn->guc_id.link);
>>> +		ce->guc_id = cn->guc_id;
>>>    		spin_lock(&ce->guc_state.lock);
>>> -		clr_context_registered(ce);
>>> +		clr_context_registered(cn);
>>>    		spin_unlock(&ce->guc_state.lock);
>>> -		set_context_guc_id_invalid(ce);
>>> -		return guc_id;
>>> +		set_context_guc_id_invalid(cn);
>>> +
>>> +		return 0;
>>>    	} else {
>>>    		return -EAGAIN;
>>>    	}
>>>    }
>>> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	int ret;
>>>    	lockdep_assert_held(&guc->submission_state.lock);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> -	ret = new_guc_id(guc);
>>> +	ret = new_guc_id(guc, ce);
>>>    	if (unlikely(ret < 0)) {
>>> -		ret = steal_guc_id(guc);
>>> +		if (intel_context_is_parent(ce))
>>> +			return -ENOSPC;
>>> +
>>> +		ret = steal_guc_id(guc, ce);
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	}
>>> -	*out = ret;
>>> +	if (intel_context_is_parent(ce)) {
>>> +		struct intel_context *child;
>>> +		int i = 1;
>>> +
>>> +		for_each_child(ce, child)
>>> +			child->guc_id.id = ce->guc_id.id + i++;
>>> +	}
>>> +
>>>    	return 0;
>>>    }
>>> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	might_lock(&ce->guc_state.lock);
>>>    	if (context_guc_id_invalid(ce)) {
>>> -		ret = assign_guc_id(guc, &ce->guc_id.id);
>>> +		ret = assign_guc_id(guc, ce);
>>>    		if (ret)
>>>    			goto out_unlock;
>>>    		ret = 1;	/* Indidcates newly assigned guc_id */
>>> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	unsigned long flags;
>>>    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>> -	if (unlikely(context_guc_id_invalid(ce)))
>>> +	if (unlikely(context_guc_id_invalid(ce) ||
>>> +		     intel_context_is_parent(ce)))
>>>    		return;
>>>    	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
  2021-10-07 19:50     ` [Intel-gfx] " John Harrison
  (?)
  (?)
@ 2021-10-08 17:20     ` John Harrison
  2021-10-08 17:29       ` Matthew Brost
  -1 siblings, 1 reply; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:20 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/7/2021 12:50, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
>> Add multi-lrc context registration H2G. In addition a workqueue and
>> process descriptor are setup during multi-lrc context registration as
>> these data structures are needed for multi-lrc submission.
>>
>> v2:
>>   (John Harrison)
>>    - Move GuC specific fields into sub-struct
>>    - Clean up WQ defines
>>    - Add comment explaining math to derive WQ / PD address
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
>>   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
>>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
>>   5 files changed, 131 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h 
>> b/drivers/gpu/drm/i915/gt/intel_context_types.h
>> index 76dfca57cb45..48decb5ee954 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>> @@ -239,6 +239,18 @@ struct intel_context {
>>           struct intel_context *parent;
>>           /** @number_children: number of children if parent */
>>           u8 number_children;
>> +        /** @guc: GuC specific members for parallel submission */
>> +        struct {
>> +            /** @wqi_head: head pointer in work queue */
>> +            u16 wqi_head;
>> +            /** @wqi_tail: tail pointer in work queue */
>> +            u16 wqi_tail;
PS: As per comments on previous rev, something somewhere needs to 
explicitly state what WQI means. One suggestion was to do that here, 
ideally with maybe a brief description of what the queue is, how it is 
used, etc. Although probably it would be better kept in a GuC specific 
file. E.g. added to guc_fwif.h in patch #12.

John.

>> +            /**
>> +             * @parent_page: page in context state (ce->state) used
>> +             * by parent for work queue, process descriptor
>> +             */
>> +            u8 parent_page;
>> +        } guc;
>>       } parallel;
>>     #ifdef CONFIG_DRM_I915_SELFTEST
>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
>> b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> index 3ef9eaf8c50e..57339d5c1fc8 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce, 
>> struct intel_engine_cs *engine)
>>           context_size += PAGE_SIZE;
>>       }
>>   +    if (intel_context_is_parent(ce) && 
>> intel_engine_uses_guc(engine)) {
>> +        ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
>> +        context_size += PAGE_SIZE;
>> +    }
>> +
>>       obj = i915_gem_object_create_lmem(engine->i915, context_size,
>>                         I915_BO_ALLOC_PM_VOLATILE);
>>       if (IS_ERR(obj))
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h 
>> b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index 8ff582222aff..ba10bd374cee 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -142,6 +142,7 @@ enum intel_guc_action {
>>       INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>>       INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>       INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>> +    INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>>       INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>>       INTEL_GUC_ACTION_LIMIT
>>   };
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index fa4be13c8854..0eeb2a9feeed 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -52,8 +52,6 @@
>>     #define GUC_DOORBELL_INVALID        256
>>   -#define GUC_WQ_SIZE            (PAGE_SIZE * 2)
>> -
>>   /* Work queue item header definitions */
>>   #define WQ_STATUS_ACTIVE        1
>>   #define WQ_STATUS_SUSPENDED        2
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 451d9ae861a6..ab6d7fc1b0b1 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -344,6 +344,45 @@ static inline struct i915_priolist 
>> *to_priolist(struct rb_node *rb)
>>       return rb_entry(rb, struct i915_priolist, node);
>>   }
>>   +/*
>> + * When using multi-lrc submission an extra page in the context 
>> state is
>> + * reserved for the process descriptor and work queue.
>> + *
>> + * The layout of this page is below:
>> + * 0                        guc_process_desc
>> + * ...                        unused
>> + * PAGE_SIZE / 2                work queue start
>> + * ...                        work queue
>> + * PAGE_SIZE - 1                work queue end
>> + */
>> +#define WQ_SIZE            (PAGE_SIZE / 2)
>> +#define WQ_OFFSET        (PAGE_SIZE - WQ_SIZE)
> I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE' 
> and then using that everywhere else? Unless there is a fundamental 
> reason why the above must be exactly a page in size then I think the 
> size should be defined once and re-used rather than assumed in 
> multiple places (including in the description comment).
>
>> +static u32 __get_process_desc_offset(struct intel_context *ce)
>> +{
>> +    GEM_BUG_ON(!ce->parallel.guc.parent_page);
>> +
>> +    return ce->parallel.guc.parent_page * PAGE_SIZE;
>> +}
>> +
>> +static u32 __get_wq_offset(struct intel_context *ce)
>> +{
>> +    return __get_process_desc_offset(ce) + WQ_OFFSET;
>> +}
>> +
>> +static struct guc_process_desc *
>> +__get_process_desc(struct intel_context *ce)
>> +{
>> +    /*
>> +     * Need to subtract LRC_STATE_OFFSET here as the
>> +     * parallel.guc.parent_page is the offset into ce->state while
>> +     * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
>> +     */
>> +    return (struct guc_process_desc *)
>> +        (ce->lrc_reg_state +
>> +         ((__get_process_desc_offset(ce) -
>> +           LRC_STATE_OFFSET) / sizeof(u32)));
>> +}
>> +
>>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, 
>> u32 index)
>>   {
>>       struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>> @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc 
>> *guc, struct intel_context *ce)
>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   }
>>   +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
>> +                       struct intel_context *ce,
>> +                       u32 guc_id,
>> +                       u32 offset,
>> +                       bool loop)
>> +{
>> +    struct intel_context *child;
>> +    u32 action[4 + MAX_ENGINE_INSTANCE];
>> +    int len = 0;
>> +
>> +    GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
>> +
>> +    action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
>> +    action[len++] = guc_id;
>> +    action[len++] = ce->parallel.number_children + 1;
>> +    action[len++] = offset;
>> +    for_each_child(ce, child) {
>> +        offset += sizeof(struct guc_lrc_desc);
>> +        action[len++] = offset;
>> +    }
>> +
>> +    return guc_submission_send_busy_loop(guc, action, len, 0, loop);
>> +}
>> +
>>   static int __guc_action_register_context(struct intel_guc *guc,
>>                        u32 guc_id,
>>                        u32 offset,
>> @@ -1387,9 +1450,15 @@ static int register_context(struct 
>> intel_context *ce, bool loop)
>>           ce->guc_id.id * sizeof(struct guc_lrc_desc);
>>       int ret;
>>   +    GEM_BUG_ON(intel_context_is_child(ce));
>>       trace_intel_context_register(ce);
>>   -    ret = __guc_action_register_context(guc, ce->guc_id.id, 
>> offset, loop);
>> +    if (intel_context_is_parent(ce))
>> +        ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
>> +                              offset, loop);
>> +    else
>> +        ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
>> +                            loop);
>>       if (likely(!ret)) {
>>           unsigned long flags;
>>   @@ -1418,6 +1487,7 @@ static int deregister_context(struct 
>> intel_context *ce, u32 guc_id)
>>   {
>>       struct intel_guc *guc = ce_to_guc(ce);
>>   +    GEM_BUG_ON(intel_context_is_child(ce));
>>       trace_intel_context_deregister(ce);
>>         return __guc_action_deregister_context(guc, guc_id);
>> @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct 
>> intel_context *ce, bool loop)
>>       struct guc_lrc_desc *desc;
>>       bool context_registered;
>>       intel_wakeref_t wakeref;
>> +    struct intel_context *child;
>>       int ret = 0;
>>         GEM_BUG_ON(!engine->mask);
>> @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct 
>> intel_context *ce, bool loop)
>>       desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>       guc_context_policy_init(engine, desc);
>>   +    /*
>> +     * Context is a parent, we need to register a process descriptor
>> +     * describing a work queue and register all child contexts.
>> +     */
> This was now meant to say 'If the context is a parent...'?
>
> John.
>
>> +    if (intel_context_is_parent(ce)) {
>> +        struct guc_process_desc *pdesc;
>> +
>> +        ce->parallel.guc.wqi_tail = 0;
>> +        ce->parallel.guc.wqi_head = 0;
>> +
>> +        desc->process_desc = i915_ggtt_offset(ce->state) +
>> +            __get_process_desc_offset(ce);
>> +        desc->wq_addr = i915_ggtt_offset(ce->state) +
>> +            __get_wq_offset(ce);
>> +        desc->wq_size = WQ_SIZE;
>> +
>> +        pdesc = __get_process_desc(ce);
>> +        memset(pdesc, 0, sizeof(*(pdesc)));
>> +        pdesc->stage_id = ce->guc_id.id;
>> +        pdesc->wq_base_addr = desc->wq_addr;
>> +        pdesc->wq_size_bytes = desc->wq_size;
>> +        pdesc->wq_status = WQ_STATUS_ACTIVE;
>> +
>> +        for_each_child(ce, child) {
>> +            desc = __get_lrc_desc(guc, child->guc_id.id);
>> +
>> +            desc->engine_class =
>> +                engine_class_to_guc_class(engine->class);
>> +            desc->hw_context_desc = child->lrc.lrca;
>> +            desc->priority = ce->guc_state.prio;
>> +            desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>> +            guc_context_policy_init(engine, desc);
>> +        }
>> +    }
>> +
>>       /*
>>        * The context_lookup xarray is used to determine if the hardware
>>        * context is currently registered. There are two cases in 
>> which it
>> @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 
>> desc_idx)
>>           return NULL;
>>       }
>>   +    if (unlikely(intel_context_is_child(ce))) {
>> +        drm_err(&guc_to_gt(guc)->i915->drm,
>> +            "Context is child, desc_idx %u", desc_idx);
>> +        return NULL;
>> +    }
>> +
>>       return ce;
>>   }
>


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
  2021-10-04 22:06   ` Matthew Brost
@ 2021-10-08 17:20     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:20 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Implement multi-lrc submission via a single workqueue entry and single
> H2G. The workqueue entry contains an updated tail value for each
> request, of all the contexts in the multi-lrc submission, and updates
> these values simultaneously. As such, the tasklet and bypass path have
> been updated to coalesce requests into a single submission.
>
> v2:
>   (John Harrison)
>    - s/wqe/wqi
>    - Use FIELD_PREP macros
>    - Add GEM_BUG_ONs ensures length fits within field
>    - Add comment / white space to intel_guc_write_barrier
>   (Kernel test robot)
>    - Make need_tasklet a static function
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
>   drivers/gpu/drm/i915/i915_request.h           |   8 +
>   6 files changed, 335 insertions(+), 73 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> index 8f8182bf7c11..7191e8439290 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> @@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
>   		}
>   	}
>   }
> +
> +void intel_guc_write_barrier(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> +		/*
> +		 * Ensure intel_uncore_write_fw can be used rather than
> +		 * intel_uncore_write.
> +		 */
> +		GEM_BUG_ON(guc->send_regs.fw_domains);
> +
> +		/*
> +		 * This register is used by the i915 and GuC for MMIO based
> +		 * communication. Once we are in this code CTBs are the only
> +		 * method the i915 uses to communicate with the GuC so it is
> +		 * safe to write to this register (a value of 0 is NOP for MMIO
> +		 * communication). If we ever start mixing CTBs and MMIOs a new
> +		 * register will have to be chosen.
> +		 */
Hmm, missed it before but this comment is very CTB centric and the 
barrier function is now being used for parallel submission work queues. 
Seems like an extra comment should be added to cover that case. Just 
something simple about WQ usage is also guaranteed to be post CTB switch 
over.

> +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> +	} else {
> +		/* wmb() sufficient for a barrier if in smem */
> +		wmb();
> +	}
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index a9f4ec972bfb..147f39cc0f2f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -46,6 +46,12 @@ struct intel_guc {
>   	 * submitted until the stalled request is processed.
>   	 */
>   	struct i915_request *stalled_request;
> +	enum {
> +		STALL_NONE,
> +		STALL_REGISTER_CONTEXT,
> +		STALL_MOVE_LRC_TAIL,
> +		STALL_ADD_REQUEST,
> +	} submission_stall_reason;
>   
>   	/* intel_guc_recv interrupt related state */
>   	/** @irq_lock: protects GuC irq state */
> @@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
>   
>   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
>   
> +void intel_guc_write_barrier(struct intel_guc *guc);
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 20c710a74498..10d1878d2826 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
>   	return ++ct->requests.last_fence;
>   }
>   
> -static void write_barrier(struct intel_guc_ct *ct)
> -{
> -	struct intel_guc *guc = ct_to_guc(ct);
> -	struct intel_gt *gt = guc_to_gt(guc);
> -
> -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> -		GEM_BUG_ON(guc->send_regs.fw_domains);
> -		/*
> -		 * This register is used by the i915 and GuC for MMIO based
> -		 * communication. Once we are in this code CTBs are the only
> -		 * method the i915 uses to communicate with the GuC so it is
> -		 * safe to write to this register (a value of 0 is NOP for MMIO
> -		 * communication). If we ever start mixing CTBs and MMIOs a new
> -		 * register will have to be chosen.
> -		 */
> -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> -	} else {
> -		/* wmb() sufficient for a barrier if in smem */
> -		wmb();
> -	}
> -}
> -
>   static int ct_write(struct intel_guc_ct *ct,
>   		    const u32 *action,
>   		    u32 len /* in dwords */,
> @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   	 * make sure H2G buffer update and LRC tail update (if this triggering a
>   	 * submission) are visible before updating the descriptor tail
>   	 */
> -	write_barrier(ct);
> +	intel_guc_write_barrier(ct_to_guc(ct));
>   
>   	/* update local copies */
>   	ctb->tail = tail;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 0eeb2a9feeed..a00eeddc1449 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -58,19 +58,16 @@
>   #define WQ_STATUS_CMD_ERROR		3
>   #define WQ_STATUS_ENGINE_ID_NOT_USED	4
>   #define WQ_STATUS_SUSPENDED_FROM_RESET	5
> -#define WQ_TYPE_SHIFT			0
> -#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> -#define WQ_TARGET_SHIFT			10
> -#define WQ_LEN_SHIFT			16
> -#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> -#define WQ_PRESENT_WORKLOAD		(1 << 28)
> -
> -#define WQ_RING_TAIL_SHIFT		20
> -#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> -#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> +#define WQ_TYPE_BATCH_BUF		0x1
> +#define WQ_TYPE_PSEUDO			0x2
> +#define WQ_TYPE_INORDER			0x3
> +#define WQ_TYPE_NOOP			0x4
> +#define WQ_TYPE_MULTI_LRC		0x5
> +#define WQ_TYPE_MASK			GENMASK(7, 0)
> +#define WQ_LEN_MASK			GENMASK(26, 16)
> +
> +#define WQ_GUC_ID_MASK			GENMASK(15, 0)
> +#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
Other option for documenting WQ and WQI would be at the top of this 
block of definitions. I believe there is a one line comment of 'work 
queue item header definitions' but none of these defines actually use 
the WQI abbreviation. And some description of what the work queue is, 
how it is used, etc. would be good.

>   
>   #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
>   #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 031b1bf5ba91..1610120e31a1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
>   		   LRC_STATE_OFFSET) / sizeof(u32)));
>   }
>   
> +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> +			   struct intel_context *ce,
> +			   u32 wqi_size)
> +{
> +	/*
> +	 * Check for space in work queue. Caching a value of head pointer in
> +	 * intel_context structure in order reduce the number accesses to shared
> +	 * GPU memory which may be across a PCIe bus.
> +	 */
> +#define AVAILABLE_SPACE	\
> +	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
> +	if (wqi_size > AVAILABLE_SPACE) {
> +		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
> +
> +		if (wqi_size > AVAILABLE_SPACE)
> +			return NULL;
> +	}
> +#undef AVAILABLE_SPACE
> +
> +	return ((u32 *)__get_process_desc(ce)) +
> +		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>   
>   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
>   
> -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   {
>   	int err = 0;
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u32 action[3];
>   	int len = 0;
>   	u32 g2h_len_dw = 0;
> @@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>   	GEM_BUG_ON(context_guc_id_invalid(ce));
>   
> -	/*
> -	 * Corner case where the GuC firmware was blown away and reloaded while
> -	 * this context was pinned.
> -	 */
> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> -		err = guc_lrc_desc_pin(ce, false);
> -		if (unlikely(err))
> -			return err;
> -	}
> -
>   	spin_lock(&ce->guc_state.lock);
>   
>   	/*
>   	 * The request / context will be run on the hardware when scheduling
> -	 * gets enabled in the unblock.
> +	 * gets enabled in the unblock. For multi-lrc we still submit the
> +	 * context to move the LRC tails.
>   	 */
> -	if (unlikely(context_blocked(ce)))
> +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
>   		goto out;
>   
> -	enabled = context_enabled(ce);
> +	enabled = context_enabled(ce) || context_blocked(ce);
>   
>   	if (!enabled) {
>   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> @@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   		trace_intel_context_sched_enable(ce);
>   		atomic_inc(&guc->outstanding_submission_g2h);
>   		set_context_enabled(ce);
> +
> +		/*
> +		 * Without multi-lrc KMD does the submission step (moving the
> +		 * lrc tail) so enabling scheduling is sufficient to submit the
> +		 * context. This isn't the case in multi-lrc submission as the
> +		 * GuC needs to move the tails, hence the need for another H2G
> +		 * to submit a multi-lrc context after enabling scheduling.
> +		 */
> +		if (intel_context_is_parent(ce)) {
> +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> +		}
>   	} else if (!enabled) {
>   		clr_context_pending_enable(ce);
>   		intel_context_put(ce);
> @@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	return err;
>   }
>   
> +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	int ret = __guc_add_request(guc, rq);
> +
> +	if (unlikely(ret == -EBUSY)) {
> +		guc->stalled_request = rq;
> +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> +	}
> +
> +	return ret;
> +}
> +
>   static inline void guc_set_lrc_tail(struct i915_request *rq)
>   {
>   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> @@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static bool is_multi_lrc_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_child(rq->context) ||
> +		intel_context_is_parent(rq->context);
> +}
> +
> +static bool can_merge_rq(struct i915_request *rq,
> +			 struct i915_request *last)
> +{
> +	return request_to_scheduling_context(rq) ==
> +		request_to_scheduling_context(last);
> +}
> +
> +static u32 wq_space_until_wrap(struct intel_context *ce)
> +{
> +	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
> +}
> +
> +static void write_wqi(struct guc_process_desc *desc,
> +		      struct intel_context *ce,
> +		      u32 wqi_size)
> +{
> +	/*
> +	 * Ensure WQI are visible before updating tail
> +	 */
> +	intel_guc_write_barrier(ce_to_guc(ce));
> +
> +	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
> +		(WQ_SIZE - 1);
This relies on WQ_SIZE being a power of two, right? Is it possible to 
add a BUILD_BUG_ON to ensure that?

> +	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
> +}
> +
> +static int guc_wq_noop_append(struct intel_context *ce)
> +{
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> +	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
> +
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> +
> +	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
> +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> +	ce->parallel.guc.wqi_tail = 0;
> +
> +	return 0;
> +}
> +
> +static int __guc_wq_item_append(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	struct intel_context *child;
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	unsigned int wqi_size = (ce->parallel.number_children + 4) *
> +		sizeof(u32);
> +	u32 *wqi;
> +	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
> +	int ret;
> +
> +	/* Ensure context is in correct state updating work queue */
> +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> +	GEM_BUG_ON(context_guc_id_invalid(ce));
> +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> +
> +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> +	if (wqi_size > wq_space_until_wrap(ce)) {
> +		ret = guc_wq_noop_append(ce);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	wqi = get_wq_pointer(desc, ce, wqi_size);
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> +
> +	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
> +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> +	*wqi++ = ce->lrc.lrca;
> +	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
> +	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
> +	*wqi++ = 0;	/* fence_id */
> +	for_each_child(ce, child)
> +		*wqi++ = child->ring->tail / sizeof(u64);
> +
> +	write_wqi(desc, ce, wqi_size);
> +
> +	return 0;
> +}
> +
> +static int guc_wq_item_append(struct intel_guc *guc,
> +			      struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	int ret = 0;
> +
> +	if (likely(!intel_context_is_banned(ce))) {
> +		ret = __guc_wq_item_append(rq);
> +
> +		if (unlikely(ret == -EBUSY)) {
> +			guc->stalled_request = rq;
> +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static bool multi_lrc_submit(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	intel_ring_set_tail(rq->ring, rq->tail);
> +
> +	/*
> +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> +	 * request generated from a multi-BB submission. This indicates to the
> +	 * backend (GuC interface) that we should submit this context thus
> +	 * submitting all the requests generated in parallel.
> +	 */
> +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
FYI: Apparently the test_bit/set_bit/etc helpers are intended for use on 
arbitrary sized bitfields. As in, they do all sorts of complicated 
atomic operations to work on 164 bit words and such like. For single 
word flags, the guidance is to just use 'if(word & BIT(bit))' instead.

John.

> +		intel_context_is_banned(ce);
> +}
> +
>   static int guc_dequeue_one_context(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> @@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   	if (guc->stalled_request) {
>   		submit = true;
>   		last = guc->stalled_request;
> -		goto resubmit;
> +
> +		switch (guc->submission_stall_reason) {
> +		case STALL_REGISTER_CONTEXT:
> +			goto register_context;
> +		case STALL_MOVE_LRC_TAIL:
> +			goto move_lrc_tail;
> +		case STALL_ADD_REQUEST:
> +			goto add_request;
> +		default:
> +			MISSING_CASE(guc->submission_stall_reason);
> +		}
>   	}
>   
>   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> @@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   		struct i915_request *rq, *rn;
>   
>   		priolist_for_each_request_consume(rq, rn, p) {
> -			if (last && rq->context != last->context)
> -				goto done;
> +			if (last && !can_merge_rq(rq, last))
> +				goto register_context;
>   
>   			list_del_init(&rq->sched.link);
>   
> @@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   
>   			trace_i915_request_in(rq, 0);
>   			last = rq;
> -			submit = true;
> +
> +			if (is_multi_lrc_rq(rq)) {
> +				/*
> +				 * We need to coalesce all multi-lrc requests in
> +				 * a relationship into a single H2G. We are
> +				 * guaranteed that all of these requests will be
> +				 * submitted sequentially.
> +				 */
> +				if (multi_lrc_submit(rq)) {
> +					submit = true;
> +					goto register_context;
> +				}
> +			} else {
> +				submit = true;
> +			}
>   		}
>   
>   		rb_erase_cached(&p->node, &sched_engine->queue);
>   		i915_priolist_free(p);
>   	}
> -done:
> +
> +register_context:
>   	if (submit) {
> -		guc_set_lrc_tail(last);
> -resubmit:
> +		struct intel_context *ce = request_to_scheduling_context(last);
> +
> +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> +			     !intel_context_is_banned(ce))) {
> +			ret = guc_lrc_desc_pin(ce, false);
> +			if (unlikely(ret == -EPIPE)) {
> +				goto deadlk;
> +			} else if (ret == -EBUSY) {
> +				guc->stalled_request = last;
> +				guc->submission_stall_reason =
> +					STALL_REGISTER_CONTEXT;
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		}
> +
> +move_lrc_tail:
> +		if (is_multi_lrc_rq(last)) {
> +			ret = guc_wq_item_append(guc, last);
> +			if (ret == -EBUSY) {
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		} else {
> +			guc_set_lrc_tail(last);
> +		}
> +
> +add_request:
>   		ret = guc_add_request(guc, last);
> -		if (unlikely(ret == -EPIPE))
> +		if (unlikely(ret == -EPIPE)) {
> +			goto deadlk;
> +		} else if (ret == -EBUSY) {
> +			goto schedule_tasklet;
> +		} else if (ret != 0) {
> +			GEM_WARN_ON(ret);	/* Unexpected */
>   			goto deadlk;
> -		else if (ret == -EBUSY) {
> -			tasklet_schedule(&sched_engine->tasklet);
> -			guc->stalled_request = last;
> -			return false;
>   		}
>   	}
>   
>   	guc->stalled_request = NULL;
> +	guc->submission_stall_reason = STALL_NONE;
>   	return submit;
>   
>   deadlk:
>   	sched_engine->tasklet.callback = NULL;
>   	tasklet_disable_nosync(&sched_engine->tasklet);
>   	return false;
> +
> +schedule_tasklet:
> +	tasklet_schedule(&sched_engine->tasklet);
> +	return false;
>   }
>   
>   static void guc_submission_tasklet(struct tasklet_struct *t)
> @@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   
>   	trace_i915_request_in(rq, 0);
>   
> -	guc_set_lrc_tail(rq);
> -	ret = guc_add_request(guc, rq);
> -	if (ret == -EBUSY)
> -		guc->stalled_request = rq;
> +	if (is_multi_lrc_rq(rq)) {
> +		if (multi_lrc_submit(rq)) {
> +			ret = guc_wq_item_append(guc, rq);
> +			if (!ret)
> +				ret = guc_add_request(guc, rq);
> +		}
> +	} else {
> +		guc_set_lrc_tail(rq);
> +		ret = guc_add_request(guc, rq);
> +	}
>   
>   	if (unlikely(ret == -EPIPE))
>   		disable_submission(guc);
> @@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   	return ret;
>   }
>   
> +static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	return submission_disabled(guc) || guc->stalled_request ||
> +		!i915_sched_engine_is_empty(sched_engine) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +}
> +
>   static void guc_submit_request(struct i915_request *rq)
>   {
>   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> @@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
>   	/* Will be called from irq-context when using foreign fences. */
>   	spin_lock_irqsave(&sched_engine->lock, flags);
>   
> -	if (submission_disabled(guc) || guc->stalled_request ||
> -	    !i915_sched_engine_is_empty(sched_engine))
> +	if (need_tasklet(guc, rq))
>   		queue_request(sched_engine, rq, rq_prio(rq));
>   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
>   		tasklet_hi_schedule(&sched_engine->tasklet);
> @@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
>   
>   static void add_to_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
>   
>   	spin_lock(&ce->guc_state.lock);
> @@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
>   
>   static void remove_from_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irq(&ce->guc_state.lock);
>   
> @@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
>   static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   					   int prio)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
>   
>   	/* Short circuit function */
> @@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   
>   static void guc_retire_inflight_request_prio(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   
>   	spin_lock(&ce->guc_state.lock);
>   	guc_prio_fini(rq, ce);
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 7bd9ed20623e..8950785e55d6 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -139,6 +139,14 @@ enum {
>   	 * the GPU. Here we track such boost requests on a per-request basis.
>   	 */
>   	I915_FENCE_FLAG_BOOST,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) should
> +	 * trigger a submission to the GuC rather than just moving the context
> +	 * tail.
> +	 */
> +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
@ 2021-10-08 17:20     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:20 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Implement multi-lrc submission via a single workqueue entry and single
> H2G. The workqueue entry contains an updated tail value for each
> request, of all the contexts in the multi-lrc submission, and updates
> these values simultaneously. As such, the tasklet and bypass path have
> been updated to coalesce requests into a single submission.
>
> v2:
>   (John Harrison)
>    - s/wqe/wqi
>    - Use FIELD_PREP macros
>    - Add GEM_BUG_ONs ensures length fits within field
>    - Add comment / white space to intel_guc_write_barrier
>   (Kernel test robot)
>    - Make need_tasklet a static function
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
>   drivers/gpu/drm/i915/i915_request.h           |   8 +
>   6 files changed, 335 insertions(+), 73 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> index 8f8182bf7c11..7191e8439290 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> @@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
>   		}
>   	}
>   }
> +
> +void intel_guc_write_barrier(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> +		/*
> +		 * Ensure intel_uncore_write_fw can be used rather than
> +		 * intel_uncore_write.
> +		 */
> +		GEM_BUG_ON(guc->send_regs.fw_domains);
> +
> +		/*
> +		 * This register is used by the i915 and GuC for MMIO based
> +		 * communication. Once we are in this code CTBs are the only
> +		 * method the i915 uses to communicate with the GuC so it is
> +		 * safe to write to this register (a value of 0 is NOP for MMIO
> +		 * communication). If we ever start mixing CTBs and MMIOs a new
> +		 * register will have to be chosen.
> +		 */
Hmm, missed it before but this comment is very CTB centric and the 
barrier function is now being used for parallel submission work queues. 
Seems like an extra comment should be added to cover that case. Just 
something simple about WQ usage is also guaranteed to be post CTB switch 
over.

> +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> +	} else {
> +		/* wmb() sufficient for a barrier if in smem */
> +		wmb();
> +	}
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index a9f4ec972bfb..147f39cc0f2f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -46,6 +46,12 @@ struct intel_guc {
>   	 * submitted until the stalled request is processed.
>   	 */
>   	struct i915_request *stalled_request;
> +	enum {
> +		STALL_NONE,
> +		STALL_REGISTER_CONTEXT,
> +		STALL_MOVE_LRC_TAIL,
> +		STALL_ADD_REQUEST,
> +	} submission_stall_reason;
>   
>   	/* intel_guc_recv interrupt related state */
>   	/** @irq_lock: protects GuC irq state */
> @@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
>   
>   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
>   
> +void intel_guc_write_barrier(struct intel_guc *guc);
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 20c710a74498..10d1878d2826 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
>   	return ++ct->requests.last_fence;
>   }
>   
> -static void write_barrier(struct intel_guc_ct *ct)
> -{
> -	struct intel_guc *guc = ct_to_guc(ct);
> -	struct intel_gt *gt = guc_to_gt(guc);
> -
> -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> -		GEM_BUG_ON(guc->send_regs.fw_domains);
> -		/*
> -		 * This register is used by the i915 and GuC for MMIO based
> -		 * communication. Once we are in this code CTBs are the only
> -		 * method the i915 uses to communicate with the GuC so it is
> -		 * safe to write to this register (a value of 0 is NOP for MMIO
> -		 * communication). If we ever start mixing CTBs and MMIOs a new
> -		 * register will have to be chosen.
> -		 */
> -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> -	} else {
> -		/* wmb() sufficient for a barrier if in smem */
> -		wmb();
> -	}
> -}
> -
>   static int ct_write(struct intel_guc_ct *ct,
>   		    const u32 *action,
>   		    u32 len /* in dwords */,
> @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   	 * make sure H2G buffer update and LRC tail update (if this triggering a
>   	 * submission) are visible before updating the descriptor tail
>   	 */
> -	write_barrier(ct);
> +	intel_guc_write_barrier(ct_to_guc(ct));
>   
>   	/* update local copies */
>   	ctb->tail = tail;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 0eeb2a9feeed..a00eeddc1449 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -58,19 +58,16 @@
>   #define WQ_STATUS_CMD_ERROR		3
>   #define WQ_STATUS_ENGINE_ID_NOT_USED	4
>   #define WQ_STATUS_SUSPENDED_FROM_RESET	5
> -#define WQ_TYPE_SHIFT			0
> -#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> -#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> -#define WQ_TARGET_SHIFT			10
> -#define WQ_LEN_SHIFT			16
> -#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> -#define WQ_PRESENT_WORKLOAD		(1 << 28)
> -
> -#define WQ_RING_TAIL_SHIFT		20
> -#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> -#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> +#define WQ_TYPE_BATCH_BUF		0x1
> +#define WQ_TYPE_PSEUDO			0x2
> +#define WQ_TYPE_INORDER			0x3
> +#define WQ_TYPE_NOOP			0x4
> +#define WQ_TYPE_MULTI_LRC		0x5
> +#define WQ_TYPE_MASK			GENMASK(7, 0)
> +#define WQ_LEN_MASK			GENMASK(26, 16)
> +
> +#define WQ_GUC_ID_MASK			GENMASK(15, 0)
> +#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
Other option for documenting WQ and WQI would be at the top of this 
block of definitions. I believe there is a one line comment of 'work 
queue item header definitions' but none of these defines actually use 
the WQI abbreviation. And some description of what the work queue is, 
how it is used, etc. would be good.

>   
>   #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
>   #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 031b1bf5ba91..1610120e31a1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
>   		   LRC_STATE_OFFSET) / sizeof(u32)));
>   }
>   
> +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> +			   struct intel_context *ce,
> +			   u32 wqi_size)
> +{
> +	/*
> +	 * Check for space in work queue. Caching a value of head pointer in
> +	 * intel_context structure in order reduce the number accesses to shared
> +	 * GPU memory which may be across a PCIe bus.
> +	 */
> +#define AVAILABLE_SPACE	\
> +	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
> +	if (wqi_size > AVAILABLE_SPACE) {
> +		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
> +
> +		if (wqi_size > AVAILABLE_SPACE)
> +			return NULL;
> +	}
> +#undef AVAILABLE_SPACE
> +
> +	return ((u32 *)__get_process_desc(ce)) +
> +		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>   
>   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
>   
> -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   {
>   	int err = 0;
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u32 action[3];
>   	int len = 0;
>   	u32 g2h_len_dw = 0;
> @@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>   	GEM_BUG_ON(context_guc_id_invalid(ce));
>   
> -	/*
> -	 * Corner case where the GuC firmware was blown away and reloaded while
> -	 * this context was pinned.
> -	 */
> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> -		err = guc_lrc_desc_pin(ce, false);
> -		if (unlikely(err))
> -			return err;
> -	}
> -
>   	spin_lock(&ce->guc_state.lock);
>   
>   	/*
>   	 * The request / context will be run on the hardware when scheduling
> -	 * gets enabled in the unblock.
> +	 * gets enabled in the unblock. For multi-lrc we still submit the
> +	 * context to move the LRC tails.
>   	 */
> -	if (unlikely(context_blocked(ce)))
> +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
>   		goto out;
>   
> -	enabled = context_enabled(ce);
> +	enabled = context_enabled(ce) || context_blocked(ce);
>   
>   	if (!enabled) {
>   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> @@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   		trace_intel_context_sched_enable(ce);
>   		atomic_inc(&guc->outstanding_submission_g2h);
>   		set_context_enabled(ce);
> +
> +		/*
> +		 * Without multi-lrc KMD does the submission step (moving the
> +		 * lrc tail) so enabling scheduling is sufficient to submit the
> +		 * context. This isn't the case in multi-lrc submission as the
> +		 * GuC needs to move the tails, hence the need for another H2G
> +		 * to submit a multi-lrc context after enabling scheduling.
> +		 */
> +		if (intel_context_is_parent(ce)) {
> +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> +		}
>   	} else if (!enabled) {
>   		clr_context_pending_enable(ce);
>   		intel_context_put(ce);
> @@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	return err;
>   }
>   
> +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	int ret = __guc_add_request(guc, rq);
> +
> +	if (unlikely(ret == -EBUSY)) {
> +		guc->stalled_request = rq;
> +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> +	}
> +
> +	return ret;
> +}
> +
>   static inline void guc_set_lrc_tail(struct i915_request *rq)
>   {
>   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> @@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static bool is_multi_lrc_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_child(rq->context) ||
> +		intel_context_is_parent(rq->context);
> +}
> +
> +static bool can_merge_rq(struct i915_request *rq,
> +			 struct i915_request *last)
> +{
> +	return request_to_scheduling_context(rq) ==
> +		request_to_scheduling_context(last);
> +}
> +
> +static u32 wq_space_until_wrap(struct intel_context *ce)
> +{
> +	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
> +}
> +
> +static void write_wqi(struct guc_process_desc *desc,
> +		      struct intel_context *ce,
> +		      u32 wqi_size)
> +{
> +	/*
> +	 * Ensure WQI are visible before updating tail
> +	 */
> +	intel_guc_write_barrier(ce_to_guc(ce));
> +
> +	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
> +		(WQ_SIZE - 1);
This relies on WQ_SIZE being a power of two, right? Is it possible to 
add a BUILD_BUG_ON to ensure that?

> +	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
> +}
> +
> +static int guc_wq_noop_append(struct intel_context *ce)
> +{
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> +	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
> +
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> +
> +	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
> +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> +	ce->parallel.guc.wqi_tail = 0;
> +
> +	return 0;
> +}
> +
> +static int __guc_wq_item_append(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	struct intel_context *child;
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	unsigned int wqi_size = (ce->parallel.number_children + 4) *
> +		sizeof(u32);
> +	u32 *wqi;
> +	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
> +	int ret;
> +
> +	/* Ensure context is in correct state updating work queue */
> +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> +	GEM_BUG_ON(context_guc_id_invalid(ce));
> +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> +
> +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> +	if (wqi_size > wq_space_until_wrap(ce)) {
> +		ret = guc_wq_noop_append(ce);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	wqi = get_wq_pointer(desc, ce, wqi_size);
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> +
> +	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
> +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> +	*wqi++ = ce->lrc.lrca;
> +	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
> +	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
> +	*wqi++ = 0;	/* fence_id */
> +	for_each_child(ce, child)
> +		*wqi++ = child->ring->tail / sizeof(u64);
> +
> +	write_wqi(desc, ce, wqi_size);
> +
> +	return 0;
> +}
> +
> +static int guc_wq_item_append(struct intel_guc *guc,
> +			      struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	int ret = 0;
> +
> +	if (likely(!intel_context_is_banned(ce))) {
> +		ret = __guc_wq_item_append(rq);
> +
> +		if (unlikely(ret == -EBUSY)) {
> +			guc->stalled_request = rq;
> +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static bool multi_lrc_submit(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	intel_ring_set_tail(rq->ring, rq->tail);
> +
> +	/*
> +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> +	 * request generated from a multi-BB submission. This indicates to the
> +	 * backend (GuC interface) that we should submit this context thus
> +	 * submitting all the requests generated in parallel.
> +	 */
> +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
FYI: Apparently the test_bit/set_bit/etc helpers are intended for use on 
arbitrary sized bitfields. As in, they do all sorts of complicated 
atomic operations to work on 164 bit words and such like. For single 
word flags, the guidance is to just use 'if(word & BIT(bit))' instead.

John.

> +		intel_context_is_banned(ce);
> +}
> +
>   static int guc_dequeue_one_context(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> @@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   	if (guc->stalled_request) {
>   		submit = true;
>   		last = guc->stalled_request;
> -		goto resubmit;
> +
> +		switch (guc->submission_stall_reason) {
> +		case STALL_REGISTER_CONTEXT:
> +			goto register_context;
> +		case STALL_MOVE_LRC_TAIL:
> +			goto move_lrc_tail;
> +		case STALL_ADD_REQUEST:
> +			goto add_request;
> +		default:
> +			MISSING_CASE(guc->submission_stall_reason);
> +		}
>   	}
>   
>   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> @@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   		struct i915_request *rq, *rn;
>   
>   		priolist_for_each_request_consume(rq, rn, p) {
> -			if (last && rq->context != last->context)
> -				goto done;
> +			if (last && !can_merge_rq(rq, last))
> +				goto register_context;
>   
>   			list_del_init(&rq->sched.link);
>   
> @@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   
>   			trace_i915_request_in(rq, 0);
>   			last = rq;
> -			submit = true;
> +
> +			if (is_multi_lrc_rq(rq)) {
> +				/*
> +				 * We need to coalesce all multi-lrc requests in
> +				 * a relationship into a single H2G. We are
> +				 * guaranteed that all of these requests will be
> +				 * submitted sequentially.
> +				 */
> +				if (multi_lrc_submit(rq)) {
> +					submit = true;
> +					goto register_context;
> +				}
> +			} else {
> +				submit = true;
> +			}
>   		}
>   
>   		rb_erase_cached(&p->node, &sched_engine->queue);
>   		i915_priolist_free(p);
>   	}
> -done:
> +
> +register_context:
>   	if (submit) {
> -		guc_set_lrc_tail(last);
> -resubmit:
> +		struct intel_context *ce = request_to_scheduling_context(last);
> +
> +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> +			     !intel_context_is_banned(ce))) {
> +			ret = guc_lrc_desc_pin(ce, false);
> +			if (unlikely(ret == -EPIPE)) {
> +				goto deadlk;
> +			} else if (ret == -EBUSY) {
> +				guc->stalled_request = last;
> +				guc->submission_stall_reason =
> +					STALL_REGISTER_CONTEXT;
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		}
> +
> +move_lrc_tail:
> +		if (is_multi_lrc_rq(last)) {
> +			ret = guc_wq_item_append(guc, last);
> +			if (ret == -EBUSY) {
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		} else {
> +			guc_set_lrc_tail(last);
> +		}
> +
> +add_request:
>   		ret = guc_add_request(guc, last);
> -		if (unlikely(ret == -EPIPE))
> +		if (unlikely(ret == -EPIPE)) {
> +			goto deadlk;
> +		} else if (ret == -EBUSY) {
> +			goto schedule_tasklet;
> +		} else if (ret != 0) {
> +			GEM_WARN_ON(ret);	/* Unexpected */
>   			goto deadlk;
> -		else if (ret == -EBUSY) {
> -			tasklet_schedule(&sched_engine->tasklet);
> -			guc->stalled_request = last;
> -			return false;
>   		}
>   	}
>   
>   	guc->stalled_request = NULL;
> +	guc->submission_stall_reason = STALL_NONE;
>   	return submit;
>   
>   deadlk:
>   	sched_engine->tasklet.callback = NULL;
>   	tasklet_disable_nosync(&sched_engine->tasklet);
>   	return false;
> +
> +schedule_tasklet:
> +	tasklet_schedule(&sched_engine->tasklet);
> +	return false;
>   }
>   
>   static void guc_submission_tasklet(struct tasklet_struct *t)
> @@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   
>   	trace_i915_request_in(rq, 0);
>   
> -	guc_set_lrc_tail(rq);
> -	ret = guc_add_request(guc, rq);
> -	if (ret == -EBUSY)
> -		guc->stalled_request = rq;
> +	if (is_multi_lrc_rq(rq)) {
> +		if (multi_lrc_submit(rq)) {
> +			ret = guc_wq_item_append(guc, rq);
> +			if (!ret)
> +				ret = guc_add_request(guc, rq);
> +		}
> +	} else {
> +		guc_set_lrc_tail(rq);
> +		ret = guc_add_request(guc, rq);
> +	}
>   
>   	if (unlikely(ret == -EPIPE))
>   		disable_submission(guc);
> @@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   	return ret;
>   }
>   
> +static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	return submission_disabled(guc) || guc->stalled_request ||
> +		!i915_sched_engine_is_empty(sched_engine) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +}
> +
>   static void guc_submit_request(struct i915_request *rq)
>   {
>   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> @@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
>   	/* Will be called from irq-context when using foreign fences. */
>   	spin_lock_irqsave(&sched_engine->lock, flags);
>   
> -	if (submission_disabled(guc) || guc->stalled_request ||
> -	    !i915_sched_engine_is_empty(sched_engine))
> +	if (need_tasklet(guc, rq))
>   		queue_request(sched_engine, rq, rq_prio(rq));
>   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
>   		tasklet_hi_schedule(&sched_engine->tasklet);
> @@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
>   
>   static void add_to_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
>   
>   	spin_lock(&ce->guc_state.lock);
> @@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
>   
>   static void remove_from_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irq(&ce->guc_state.lock);
>   
> @@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
>   static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   					   int prio)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
>   
>   	/* Short circuit function */
> @@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   
>   static void guc_retire_inflight_request_prio(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   
>   	spin_lock(&ce->guc_state.lock);
>   	guc_prio_fini(rq, ce);
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 7bd9ed20623e..8950785e55d6 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -139,6 +139,14 @@ enum {
>   	 * the GPU. Here we track such boost requests on a per-request basis.
>   	 */
>   	I915_FENCE_FLAG_BOOST,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) should
> +	 * trigger a submission to the GuC rather than just moving the context
> +	 * tail.
> +	 */
> +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration
  2021-10-08 17:20     ` John Harrison
@ 2021-10-08 17:29       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 17:29 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 10:20:16AM -0700, John Harrison wrote:
> On 10/7/2021 12:50, John Harrison wrote:
> > On 10/4/2021 15:06, Matthew Brost wrote:
> > > Add multi-lrc context registration H2G. In addition a workqueue and
> > > process descriptor are setup during multi-lrc context registration as
> > > these data structures are needed for multi-lrc submission.
> > > 
> > > v2:
> > >   (John Harrison)
> > >    - Move GuC specific fields into sub-struct
> > >    - Clean up WQ defines
> > >    - Add comment explaining math to derive WQ / PD address
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
> > >   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
> > >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 -
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 +++++++++++++++++-
> > >   5 files changed, 131 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 76dfca57cb45..48decb5ee954 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -239,6 +239,18 @@ struct intel_context {
> > >           struct intel_context *parent;
> > >           /** @number_children: number of children if parent */
> > >           u8 number_children;
> > > +        /** @guc: GuC specific members for parallel submission */
> > > +        struct {
> > > +            /** @wqi_head: head pointer in work queue */
> > > +            u16 wqi_head;
> > > +            /** @wqi_tail: tail pointer in work queue */
> > > +            u16 wqi_tail;
> PS: As per comments on previous rev, something somewhere needs to explicitly
> state what WQI means. One suggestion was to do that here, ideally with maybe
> a brief description of what the queue is, how it is used, etc. Although
> probably it would be better kept in a GuC specific file. E.g. added to
> guc_fwif.h in patch #12.
> 

I think this should just be in the main GuC kernel doc. I can include an
update to the kernel DoC in a patch at the end of the next rev of the
series. That patch doesn't necessarily have to included in the initial
merge of parallel submission if it takes a bit more time to review.

Matt 

> John.
> 
> > > +            /**
> > > +             * @parent_page: page in context state (ce->state) used
> > > +             * by parent for work queue, process descriptor
> > > +             */
> > > +            u8 parent_page;
> > > +        } guc;
> > >       } parallel;
> > >     #ifdef CONFIG_DRM_I915_SELFTEST
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > index 3ef9eaf8c50e..57339d5c1fc8 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > @@ -942,6 +942,11 @@ __lrc_alloc_state(struct intel_context *ce,
> > > struct intel_engine_cs *engine)
> > >           context_size += PAGE_SIZE;
> > >       }
> > >   +    if (intel_context_is_parent(ce) &&
> > > intel_engine_uses_guc(engine)) {
> > > +        ce->parallel.guc.parent_page = context_size / PAGE_SIZE;
> > > +        context_size += PAGE_SIZE;
> > > +    }
> > > +
> > >       obj = i915_gem_object_create_lmem(engine->i915, context_size,
> > >                         I915_BO_ALLOC_PM_VOLATILE);
> > >       if (IS_ERR(obj))
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > index 8ff582222aff..ba10bd374cee 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > @@ -142,6 +142,7 @@ enum intel_guc_action {
> > >       INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
> > >       INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> > >       INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > > +    INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
> > >       INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> > >       INTEL_GUC_ACTION_LIMIT
> > >   };
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > index fa4be13c8854..0eeb2a9feeed 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > @@ -52,8 +52,6 @@
> > >     #define GUC_DOORBELL_INVALID        256
> > >   -#define GUC_WQ_SIZE            (PAGE_SIZE * 2)
> > > -
> > >   /* Work queue item header definitions */
> > >   #define WQ_STATUS_ACTIVE        1
> > >   #define WQ_STATUS_SUSPENDED        2
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 451d9ae861a6..ab6d7fc1b0b1 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -344,6 +344,45 @@ static inline struct i915_priolist
> > > *to_priolist(struct rb_node *rb)
> > >       return rb_entry(rb, struct i915_priolist, node);
> > >   }
> > >   +/*
> > > + * When using multi-lrc submission an extra page in the context
> > > state is
> > > + * reserved for the process descriptor and work queue.
> > > + *
> > > + * The layout of this page is below:
> > > + * 0                        guc_process_desc
> > > + * ...                        unused
> > > + * PAGE_SIZE / 2                work queue start
> > > + * ...                        work queue
> > > + * PAGE_SIZE - 1                work queue end
> > > + */
> > > +#define WQ_SIZE            (PAGE_SIZE / 2)
> > > +#define WQ_OFFSET        (PAGE_SIZE - WQ_SIZE)
> > I thought you were going with '#define PARENT_SCRATCH SIZE PAGE_SIZE'
> > and then using that everywhere else? Unless there is a fundamental
> > reason why the above must be exactly a page in size then I think the
> > size should be defined once and re-used rather than assumed in multiple
> > places (including in the description comment).
> > 
> > > +static u32 __get_process_desc_offset(struct intel_context *ce)
> > > +{
> > > +    GEM_BUG_ON(!ce->parallel.guc.parent_page);
> > > +
> > > +    return ce->parallel.guc.parent_page * PAGE_SIZE;
> > > +}
> > > +
> > > +static u32 __get_wq_offset(struct intel_context *ce)
> > > +{
> > > +    return __get_process_desc_offset(ce) + WQ_OFFSET;
> > > +}
> > > +
> > > +static struct guc_process_desc *
> > > +__get_process_desc(struct intel_context *ce)
> > > +{
> > > +    /*
> > > +     * Need to subtract LRC_STATE_OFFSET here as the
> > > +     * parallel.guc.parent_page is the offset into ce->state while
> > > +     * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> > > +     */
> > > +    return (struct guc_process_desc *)
> > > +        (ce->lrc_reg_state +
> > > +         ((__get_process_desc_offset(ce) -
> > > +           LRC_STATE_OFFSET) / sizeof(u32)));
> > > +}
> > > +
> > >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc,
> > > u32 index)
> > >   {
> > >       struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > > @@ -1365,6 +1404,30 @@ static void unpin_guc_id(struct intel_guc
> > > *guc, struct intel_context *ce)
> > >       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > >   }
> > >   +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> > > +                       struct intel_context *ce,
> > > +                       u32 guc_id,
> > > +                       u32 offset,
> > > +                       bool loop)
> > > +{
> > > +    struct intel_context *child;
> > > +    u32 action[4 + MAX_ENGINE_INSTANCE];
> > > +    int len = 0;
> > > +
> > > +    GEM_BUG_ON(ce->parallel.number_children > MAX_ENGINE_INSTANCE);
> > > +
> > > +    action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> > > +    action[len++] = guc_id;
> > > +    action[len++] = ce->parallel.number_children + 1;
> > > +    action[len++] = offset;
> > > +    for_each_child(ce, child) {
> > > +        offset += sizeof(struct guc_lrc_desc);
> > > +        action[len++] = offset;
> > > +    }
> > > +
> > > +    return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> > > +}
> > > +
> > >   static int __guc_action_register_context(struct intel_guc *guc,
> > >                        u32 guc_id,
> > >                        u32 offset,
> > > @@ -1387,9 +1450,15 @@ static int register_context(struct
> > > intel_context *ce, bool loop)
> > >           ce->guc_id.id * sizeof(struct guc_lrc_desc);
> > >       int ret;
> > >   +    GEM_BUG_ON(intel_context_is_child(ce));
> > >       trace_intel_context_register(ce);
> > >   -    ret = __guc_action_register_context(guc, ce->guc_id.id,
> > > offset, loop);
> > > +    if (intel_context_is_parent(ce))
> > > +        ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> > > +                              offset, loop);
> > > +    else
> > > +        ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> > > +                            loop);
> > >       if (likely(!ret)) {
> > >           unsigned long flags;
> > >   @@ -1418,6 +1487,7 @@ static int deregister_context(struct
> > > intel_context *ce, u32 guc_id)
> > >   {
> > >       struct intel_guc *guc = ce_to_guc(ce);
> > >   +    GEM_BUG_ON(intel_context_is_child(ce));
> > >       trace_intel_context_deregister(ce);
> > >         return __guc_action_deregister_context(guc, guc_id);
> > > @@ -1445,6 +1515,7 @@ static int guc_lrc_desc_pin(struct
> > > intel_context *ce, bool loop)
> > >       struct guc_lrc_desc *desc;
> > >       bool context_registered;
> > >       intel_wakeref_t wakeref;
> > > +    struct intel_context *child;
> > >       int ret = 0;
> > >         GEM_BUG_ON(!engine->mask);
> > > @@ -1470,6 +1541,41 @@ static int guc_lrc_desc_pin(struct
> > > intel_context *ce, bool loop)
> > >       desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > >       guc_context_policy_init(engine, desc);
> > >   +    /*
> > > +     * Context is a parent, we need to register a process descriptor
> > > +     * describing a work queue and register all child contexts.
> > > +     */
> > This was now meant to say 'If the context is a parent...'?
> > 
> > John.
> > 
> > > +    if (intel_context_is_parent(ce)) {
> > > +        struct guc_process_desc *pdesc;
> > > +
> > > +        ce->parallel.guc.wqi_tail = 0;
> > > +        ce->parallel.guc.wqi_head = 0;
> > > +
> > > +        desc->process_desc = i915_ggtt_offset(ce->state) +
> > > +            __get_process_desc_offset(ce);
> > > +        desc->wq_addr = i915_ggtt_offset(ce->state) +
> > > +            __get_wq_offset(ce);
> > > +        desc->wq_size = WQ_SIZE;
> > > +
> > > +        pdesc = __get_process_desc(ce);
> > > +        memset(pdesc, 0, sizeof(*(pdesc)));
> > > +        pdesc->stage_id = ce->guc_id.id;
> > > +        pdesc->wq_base_addr = desc->wq_addr;
> > > +        pdesc->wq_size_bytes = desc->wq_size;
> > > +        pdesc->wq_status = WQ_STATUS_ACTIVE;
> > > +
> > > +        for_each_child(ce, child) {
> > > +            desc = __get_lrc_desc(guc, child->guc_id.id);
> > > +
> > > +            desc->engine_class =
> > > +                engine_class_to_guc_class(engine->class);
> > > +            desc->hw_context_desc = child->lrc.lrca;
> > > +            desc->priority = ce->guc_state.prio;
> > > +            desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > +            guc_context_policy_init(engine, desc);
> > > +        }
> > > +    }
> > > +
> > >       /*
> > >        * The context_lookup xarray is used to determine if the hardware
> > >        * context is currently registered. There are two cases in
> > > which it
> > > @@ -2804,6 +2910,12 @@ g2h_context_lookup(struct intel_guc *guc, u32
> > > desc_idx)
> > >           return NULL;
> > >       }
> > >   +    if (unlikely(intel_context_is_child(ce))) {
> > > +        drm_err(&guc_to_gt(guc)->i915->drm,
> > > +            "Context is child, desc_idx %u", desc_idx);
> > > +        return NULL;
> > > +    }
> > > +
> > >       return ce;
> > >   }
> > 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-08 17:39     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:39 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Update context and full GPU reset to work with multi-lrc. The idea is
> parent context tracks all the active requests inflight for itself and
> its' children. The parent context owns the reset replaying / canceling
Still its' should be its.

> requests as needed.
>
> v2:
>   (John Harrison)
>    - Simply loop in find active request
>    - Add comments to find ative request / reset loop
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
>   2 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index c5bb7ccfb3f8..3b340eb59ada 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
>   
>   struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   {
> +	struct intel_context *parent = intel_context_to_parent(ce);
>   	struct i915_request *rq, *active = NULL;
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
>   
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
> +	/*
> +	 * We search the parent list to find an active request on the submitted
> +	 * context. The parent list contains the requests for all the contexts
> +	 * in the relationship so we have to do a compare of each request's
> +	 * context must be done.
"have to do ... must be done" - no need for both.

> +	 */
> +	spin_lock_irqsave(&parent->guc_state.lock, flags);
> +	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
>   				    sched.link) {
> +		if (rq->context != ce)
> +			continue;
>   		if (i915_request_completed(rq))
>   			break;
>   
>   		active = rq;
>   	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 6be7adf89e4f..d661a69ef4f7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static inline bool is_multi_lrc(struct intel_context *ce)
> +{
> +	return intel_context_is_parallel(ce);
> +}
> +
>   static bool is_multi_lrc_rq(struct i915_request *rq)
>   {
>   	return intel_context_is_parallel(rq->context);
> @@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   
>   static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   {
> +	bool local_stalled;
>   	struct i915_request *rq;
>   	unsigned long flags;
>   	u32 head;
> +	int i, number_children = ce->parallel.number_children;
>   	bool skip = false;
> +	struct intel_context *parent = ce;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	intel_context_get(ce);
>   
> @@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   	if (unlikely(skip))
>   		goto out_put;
>   
> -	rq = intel_context_find_active_request(ce);
> -	if (!rq) {
> -		head = ce->ring->tail;
> -		stalled = false;
> -		goto out_replay;
> -	}
> +	/*
> +	 * For each context in the relationship find the hanging request
> +	 * resetting each context / request as needed
> +	 */
> +	for (i = 0; i < number_children + 1; ++i) {
> +		if (!intel_context_is_pinned(ce))
> +			goto next_context;
> +
> +		local_stalled = false;
> +		rq = intel_context_find_active_request(ce);
> +		if (!rq) {
> +			head = ce->ring->tail;
> +			goto out_replay;
> +		}
>   
> -	if (!i915_request_started(rq))
> -		stalled = false;
> +		GEM_BUG_ON(i915_active_is_idle(&ce->active));
> +		head = intel_ring_wrap(ce->ring, rq->head);
>   
> -	GEM_BUG_ON(i915_active_is_idle(&ce->active));
> -	head = intel_ring_wrap(ce->ring, rq->head);
> -	__i915_request_reset(rq, stalled);
> +		if (i915_request_started(rq))
I didn't see an answer as to why the started test and the wrap call need 
to be reversed?

John.

> +			local_stalled = true;
>   
> +		__i915_request_reset(rq, local_stalled && stalled);
>   out_replay:
> -	guc_reset_state(ce, head, stalled);
> -	__unwind_incomplete_requests(ce);
> +		guc_reset_state(ce, head, local_stalled && stalled);
> +next_context:
> +		if (i != number_children)
> +			ce = list_next_entry(ce, parallel.child_link);
> +	}
> +
> +	__unwind_incomplete_requests(parent);
>   out_put:
> -	intel_context_put(ce);
> +	intel_context_put(parent);
>   }
>   
>   void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> @@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			__guc_reset_context(ce, stalled);
>   
>   		intel_context_put(ce);
> @@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			guc_cancel_context_requests(ce);
>   
>   		intel_context_put(ce);
> @@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   	u16 guc_id;
>   	bool enabled;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	incr_context_blocked(ce);
> @@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
>   	bool enable;
>   
>   	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
> @@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
>   static void guc_context_cancel_request(struct intel_context *ce,
>   				       struct i915_request *rq)
>   {
> +	struct intel_context *block_context =
> +		request_to_scheduling_context(rq);
> +
>   	if (i915_sw_fence_signaled(&rq->submit)) {
>   		struct i915_sw_fence *fence;
>   
>   		intel_context_get(ce);
> -		fence = guc_context_block(ce);
> +		fence = guc_context_block(block_context);
>   		i915_sw_fence_wait(fence);
>   		if (!i915_request_completed(rq)) {
>   			__i915_request_skip(rq);
> @@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   		 */
>   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
>   
> -		guc_context_unblock(ce);
> +		guc_context_unblock(block_context);
>   		intel_context_put(ce);
>   	}
>   }
> @@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>   	intel_wakeref_t wakeref;
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	guc_flush_submissions(guc);
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
@ 2021-10-08 17:39     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:39 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Update context and full GPU reset to work with multi-lrc. The idea is
> parent context tracks all the active requests inflight for itself and
> its' children. The parent context owns the reset replaying / canceling
Still its' should be its.

> requests as needed.
>
> v2:
>   (John Harrison)
>    - Simply loop in find active request
>    - Add comments to find ative request / reset loop
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
>   2 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index c5bb7ccfb3f8..3b340eb59ada 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
>   
>   struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   {
> +	struct intel_context *parent = intel_context_to_parent(ce);
>   	struct i915_request *rq, *active = NULL;
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
>   
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
> +	/*
> +	 * We search the parent list to find an active request on the submitted
> +	 * context. The parent list contains the requests for all the contexts
> +	 * in the relationship so we have to do a compare of each request's
> +	 * context must be done.
"have to do ... must be done" - no need for both.

> +	 */
> +	spin_lock_irqsave(&parent->guc_state.lock, flags);
> +	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
>   				    sched.link) {
> +		if (rq->context != ce)
> +			continue;
>   		if (i915_request_completed(rq))
>   			break;
>   
>   		active = rq;
>   	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 6be7adf89e4f..d661a69ef4f7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static inline bool is_multi_lrc(struct intel_context *ce)
> +{
> +	return intel_context_is_parallel(ce);
> +}
> +
>   static bool is_multi_lrc_rq(struct i915_request *rq)
>   {
>   	return intel_context_is_parallel(rq->context);
> @@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   
>   static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   {
> +	bool local_stalled;
>   	struct i915_request *rq;
>   	unsigned long flags;
>   	u32 head;
> +	int i, number_children = ce->parallel.number_children;
>   	bool skip = false;
> +	struct intel_context *parent = ce;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	intel_context_get(ce);
>   
> @@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   	if (unlikely(skip))
>   		goto out_put;
>   
> -	rq = intel_context_find_active_request(ce);
> -	if (!rq) {
> -		head = ce->ring->tail;
> -		stalled = false;
> -		goto out_replay;
> -	}
> +	/*
> +	 * For each context in the relationship find the hanging request
> +	 * resetting each context / request as needed
> +	 */
> +	for (i = 0; i < number_children + 1; ++i) {
> +		if (!intel_context_is_pinned(ce))
> +			goto next_context;
> +
> +		local_stalled = false;
> +		rq = intel_context_find_active_request(ce);
> +		if (!rq) {
> +			head = ce->ring->tail;
> +			goto out_replay;
> +		}
>   
> -	if (!i915_request_started(rq))
> -		stalled = false;
> +		GEM_BUG_ON(i915_active_is_idle(&ce->active));
> +		head = intel_ring_wrap(ce->ring, rq->head);
>   
> -	GEM_BUG_ON(i915_active_is_idle(&ce->active));
> -	head = intel_ring_wrap(ce->ring, rq->head);
> -	__i915_request_reset(rq, stalled);
> +		if (i915_request_started(rq))
I didn't see an answer as to why the started test and the wrap call need 
to be reversed?

John.

> +			local_stalled = true;
>   
> +		__i915_request_reset(rq, local_stalled && stalled);
>   out_replay:
> -	guc_reset_state(ce, head, stalled);
> -	__unwind_incomplete_requests(ce);
> +		guc_reset_state(ce, head, local_stalled && stalled);
> +next_context:
> +		if (i != number_children)
> +			ce = list_next_entry(ce, parallel.child_link);
> +	}
> +
> +	__unwind_incomplete_requests(parent);
>   out_put:
> -	intel_context_put(ce);
> +	intel_context_put(parent);
>   }
>   
>   void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> @@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			__guc_reset_context(ce, stalled);
>   
>   		intel_context_put(ce);
> @@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			guc_cancel_context_requests(ce);
>   
>   		intel_context_put(ce);
> @@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   	u16 guc_id;
>   	bool enabled;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	incr_context_blocked(ce);
> @@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
>   	bool enable;
>   
>   	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
> @@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
>   static void guc_context_cancel_request(struct intel_context *ce,
>   				       struct i915_request *rq)
>   {
> +	struct intel_context *block_context =
> +		request_to_scheduling_context(rq);
> +
>   	if (i915_sw_fence_signaled(&rq->submit)) {
>   		struct i915_sw_fence *fence;
>   
>   		intel_context_get(ce);
> -		fence = guc_context_block(ce);
> +		fence = guc_context_block(block_context);
>   		i915_sw_fence_wait(fence);
>   		if (!i915_request_completed(rq)) {
>   			__i915_request_skip(rq);
> @@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   		 */
>   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
>   
> -		guc_context_unblock(ce);
> +		guc_context_unblock(block_context);
>   		intel_context_put(ce);
>   	}
>   }
> @@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>   	intel_wakeref_t wakeref;
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	guc_flush_submissions(guc);
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 15/26] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-08 17:46     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:46 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Display the workqueue status in debugfs for GuC contexts that are in
> parent-child relationship.
>
> v2:
>   (John Harrison)
>    - Output number children in debugfs
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 ++++++++++++++-----
>   1 file changed, 39 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d661a69ef4f7..f69e984683aa 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3704,6 +3704,26 @@ static inline void guc_log_context_priority(struct drm_printer *p,
>   	drm_printf(p, "\n");
>   }
>   
> +
> +static inline void guc_log_context(struct drm_printer *p,
> +				   struct intel_context *ce)
> +{
> +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> +		   ce->ring->head,
> +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> +		   ce->ring->tail,
> +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> +		   atomic_read(&ce->pin_count));
> +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> +		   atomic_read(&ce->guc_id.ref));
> +	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> +		   ce->guc_state.sched_state);
> +}
> +
>   void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   					     struct drm_printer *p)
>   {
> @@ -3713,22 +3733,27 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> -			   ce->ring->head,
> -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> -			   ce->ring->tail,
> -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> -			   atomic_read(&ce->pin_count));
> -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> -			   atomic_read(&ce->guc_id.ref));
> -		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> -			   ce->guc_state.sched_state);
> +		GEM_BUG_ON(intel_context_is_child(ce));
>   
> +		guc_log_context(p, ce);
>   		guc_log_context_priority(p, ce);
> +
> +		if (intel_context_is_parent(ce)) {
> +			struct guc_process_desc *desc = __get_process_desc(ce);
> +			struct intel_context *child;
> +
> +			drm_printf(p, "\t\tNumber children: %u\n",
> +				   ce->parallel.number_children);
> +			drm_printf(p, "\t\tWQI Head: %u\n",
> +				   READ_ONCE(desc->head));
> +			drm_printf(p, "\t\tWQI Tail: %u\n",
> +				   READ_ONCE(desc->tail));
> +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> +				   READ_ONCE(desc->wq_status));
> +
> +			for_each_child(ce, child)
> +				guc_log_context(p, child);
> +		}
>   	}
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 15/26] drm/i915/guc: Update debugfs for GuC multi-lrc
@ 2021-10-08 17:46     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:46 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Display the workqueue status in debugfs for GuC contexts that are in
> parent-child relationship.
>
> v2:
>   (John Harrison)
>    - Output number children in debugfs
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 ++++++++++++++-----
>   1 file changed, 39 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d661a69ef4f7..f69e984683aa 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3704,6 +3704,26 @@ static inline void guc_log_context_priority(struct drm_printer *p,
>   	drm_printf(p, "\n");
>   }
>   
> +
> +static inline void guc_log_context(struct drm_printer *p,
> +				   struct intel_context *ce)
> +{
> +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> +		   ce->ring->head,
> +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> +		   ce->ring->tail,
> +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> +		   atomic_read(&ce->pin_count));
> +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> +		   atomic_read(&ce->guc_id.ref));
> +	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> +		   ce->guc_state.sched_state);
> +}
> +
>   void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   					     struct drm_printer *p)
>   {
> @@ -3713,22 +3733,27 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> -			   ce->ring->head,
> -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> -			   ce->ring->tail,
> -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> -			   atomic_read(&ce->pin_count));
> -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> -			   atomic_read(&ce->guc_id.ref));
> -		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> -			   ce->guc_state.sched_state);
> +		GEM_BUG_ON(intel_context_is_child(ce));
>   
> +		guc_log_context(p, ce);
>   		guc_log_context_priority(p, ce);
> +
> +		if (intel_context_is_parent(ce)) {
> +			struct guc_process_desc *desc = __get_process_desc(ce);
> +			struct intel_context *child;
> +
> +			drm_printf(p, "\t\tNumber children: %u\n",
> +				   ce->parallel.number_children);
> +			drm_printf(p, "\t\tWQI Head: %u\n",
> +				   READ_ONCE(desc->head));
> +			drm_printf(p, "\t\tWQI Tail: %u\n",
> +				   READ_ONCE(desc->tail));
> +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> +				   READ_ONCE(desc->wq_status));
> +
> +			for_each_child(ce, child)
> +				guc_log_context(p, child);
> +		}
>   	}
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 16/26] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-08 17:49     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:49 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Set number of engines before attempting to create contexts so the
> function free_engines can clean up properly. Also check return of
> alloc_engines for NULL.
>
> v2:
>   (Tvrtko)
>    - Send as stand alone patch
>   (John Harrison)
>    - Check for alloc_engines returning NULL
> v3:
>   (Checkpatch / Tvrtko)
>    - Remove braces around single line if statement
>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 8208fd5b72c3..8c7ea6e56262 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -898,6 +898,10 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	unsigned int n;
>   
>   	e = alloc_engines(num_engines);
> +	if (!e)
> +		return ERR_PTR(-ENOMEM);
> +	e->num_engines = num_engines;
> +
>   	for (n = 0; n < num_engines; n++) {
>   		struct intel_context *ce;
>   		int ret;
> @@ -931,7 +935,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			goto free_engines;
>   		}
>   	}
> -	e->num_engines = num_engines;
>   
>   	return e;
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 16/26] drm/i915: Fix bug in user proto-context creation that leaked contexts
@ 2021-10-08 17:49     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-08 17:49 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Set number of engines before attempting to create contexts so the
> function free_engines can clean up properly. Also check return of
> alloc_engines for NULL.
>
> v2:
>   (Tvrtko)
>    - Send as stand alone patch
>   (John Harrison)
>    - Check for alloc_engines returning NULL
> v3:
>   (Checkpatch / Tvrtko)
>    - Remove braces around single line if statement
>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 8208fd5b72c3..8c7ea6e56262 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -898,6 +898,10 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	unsigned int n;
>   
>   	e = alloc_engines(num_engines);
> +	if (!e)
> +		return ERR_PTR(-ENOMEM);
> +	e->num_engines = num_engines;
> +
>   	for (n = 0; n < num_engines; n++) {
>   		struct intel_context *ce;
>   		int ret;
> @@ -931,7 +935,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			goto free_engines;
>   		}
>   	}
> -	e->num_engines = num_engines;
>   
>   	return e;
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
  2021-10-08 17:39     ` [Intel-gfx] " John Harrison
@ 2021-10-08 17:56       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 17:56 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 10:39:35AM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Update context and full GPU reset to work with multi-lrc. The idea is
> > parent context tracks all the active requests inflight for itself and
> > its' children. The parent context owns the reset replaying / canceling
> Still its' should be its.
> 

Yea. Will fix.

> > requests as needed.
> > 
> > v2:
> >   (John Harrison)
> >    - Simply loop in find active request
> >    - Add comments to find ative request / reset loop
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
> >   2 files changed, 63 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index c5bb7ccfb3f8..3b340eb59ada 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
> >   struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >   {
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> >   	struct i915_request *rq, *active = NULL;
> >   	unsigned long flags;
> >   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
> > +	/*
> > +	 * We search the parent list to find an active request on the submitted
> > +	 * context. The parent list contains the requests for all the contexts
> > +	 * in the relationship so we have to do a compare of each request's
> > +	 * context must be done.
> "have to do ... must be done" - no need for both.
>

Right, will fix.
 
> > +	 */
> > +	spin_lock_irqsave(&parent->guc_state.lock, flags);
> > +	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
> >   				    sched.link) {
> > +		if (rq->context != ce)
> > +			continue;
> >   		if (i915_request_completed(rq))
> >   			break;
> >   		active = rq;
> >   	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
> >   	return active;
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 6be7adf89e4f..d661a69ef4f7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
> >   	return rq->sched.attr.priority;
> >   }
> > +static inline bool is_multi_lrc(struct intel_context *ce)
> > +{
> > +	return intel_context_is_parallel(ce);
> > +}
> > +
> >   static bool is_multi_lrc_rq(struct i915_request *rq)
> >   {
> >   	return intel_context_is_parallel(rq->context);
> > @@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   {
> > +	bool local_stalled;
> >   	struct i915_request *rq;
> >   	unsigned long flags;
> >   	u32 head;
> > +	int i, number_children = ce->parallel.number_children;
> >   	bool skip = false;
> > +	struct intel_context *parent = ce;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	intel_context_get(ce);
> > @@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   	if (unlikely(skip))
> >   		goto out_put;
> > -	rq = intel_context_find_active_request(ce);
> > -	if (!rq) {
> > -		head = ce->ring->tail;
> > -		stalled = false;
> > -		goto out_replay;
> > -	}
> > +	/*
> > +	 * For each context in the relationship find the hanging request
> > +	 * resetting each context / request as needed
> > +	 */
> > +	for (i = 0; i < number_children + 1; ++i) {
> > +		if (!intel_context_is_pinned(ce))
> > +			goto next_context;
> > +
> > +		local_stalled = false;
> > +		rq = intel_context_find_active_request(ce);
> > +		if (!rq) {
> > +			head = ce->ring->tail;
> > +			goto out_replay;
> > +		}
> > -	if (!i915_request_started(rq))
> > -		stalled = false;
> > +		GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > +		head = intel_ring_wrap(ce->ring, rq->head);
> > -	GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > -	head = intel_ring_wrap(ce->ring, rq->head);
> > -	__i915_request_reset(rq, stalled);
> > +		if (i915_request_started(rq))
> I didn't see an answer as to why the started test and the wrap call need to
> be reversed?
>

Sorry, they don't have to be. Can flip this back if you want but either
way works.

Matt
 
> John.
> 
> > +			local_stalled = true;
> > +		__i915_request_reset(rq, local_stalled && stalled);
> >   out_replay:
> > -	guc_reset_state(ce, head, stalled);
> > -	__unwind_incomplete_requests(ce);
> > +		guc_reset_state(ce, head, local_stalled && stalled);
> > +next_context:
> > +		if (i != number_children)
> > +			ce = list_next_entry(ce, parallel.child_link);
> > +	}
> > +
> > +	__unwind_incomplete_requests(parent);
> >   out_put:
> > -	intel_context_put(ce);
> > +	intel_context_put(parent);
> >   }
> >   void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> > @@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> >   		xa_unlock(&guc->context_lookup);
> > -		if (intel_context_is_pinned(ce))
> > +		if (intel_context_is_pinned(ce) &&
> > +		    !intel_context_is_child(ce))
> >   			__guc_reset_context(ce, stalled);
> >   		intel_context_put(ce);
> > @@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> >   		xa_unlock(&guc->context_lookup);
> > -		if (intel_context_is_pinned(ce))
> > +		if (intel_context_is_pinned(ce) &&
> > +		    !intel_context_is_child(ce))
> >   			guc_cancel_context_requests(ce);
> >   		intel_context_put(ce);
> > @@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >   	u16 guc_id;
> >   	bool enabled;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   	incr_context_blocked(ce);
> > @@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >   	bool enable;
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > @@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
> >   static void guc_context_cancel_request(struct intel_context *ce,
> >   				       struct i915_request *rq)
> >   {
> > +	struct intel_context *block_context =
> > +		request_to_scheduling_context(rq);
> > +
> >   	if (i915_sw_fence_signaled(&rq->submit)) {
> >   		struct i915_sw_fence *fence;
> >   		intel_context_get(ce);
> > -		fence = guc_context_block(ce);
> > +		fence = guc_context_block(block_context);
> >   		i915_sw_fence_wait(fence);
> >   		if (!i915_request_completed(rq)) {
> >   			__i915_request_skip(rq);
> > @@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >   		 */
> >   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> > -		guc_context_unblock(ce);
> > +		guc_context_unblock(block_context);
> >   		intel_context_put(ce);
> >   	}
> >   }
> > @@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> >   	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	guc_flush_submissions(guc);
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset
@ 2021-10-08 17:56       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 17:56 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 10:39:35AM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Update context and full GPU reset to work with multi-lrc. The idea is
> > parent context tracks all the active requests inflight for itself and
> > its' children. The parent context owns the reset replaying / canceling
> Still its' should be its.
> 

Yea. Will fix.

> > requests as needed.
> > 
> > v2:
> >   (John Harrison)
> >    - Simply loop in find active request
> >    - Add comments to find ative request / reset loop
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       | 15 +++-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 69 ++++++++++++++-----
> >   2 files changed, 63 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index c5bb7ccfb3f8..3b340eb59ada 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -528,20 +528,29 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
> >   struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >   {
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> >   	struct i915_request *rq, *active = NULL;
> >   	unsigned long flags;
> >   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
> > +	/*
> > +	 * We search the parent list to find an active request on the submitted
> > +	 * context. The parent list contains the requests for all the contexts
> > +	 * in the relationship so we have to do a compare of each request's
> > +	 * context must be done.
> "have to do ... must be done" - no need for both.
>

Right, will fix.
 
> > +	 */
> > +	spin_lock_irqsave(&parent->guc_state.lock, flags);
> > +	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
> >   				    sched.link) {
> > +		if (rq->context != ce)
> > +			continue;
> >   		if (i915_request_completed(rq))
> >   			break;
> >   		active = rq;
> >   	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
> >   	return active;
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 6be7adf89e4f..d661a69ef4f7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -681,6 +681,11 @@ static inline int rq_prio(const struct i915_request *rq)
> >   	return rq->sched.attr.priority;
> >   }
> > +static inline bool is_multi_lrc(struct intel_context *ce)
> > +{
> > +	return intel_context_is_parallel(ce);
> > +}
> > +
> >   static bool is_multi_lrc_rq(struct i915_request *rq)
> >   {
> >   	return intel_context_is_parallel(rq->context);
> > @@ -1214,10 +1219,15 @@ __unwind_incomplete_requests(struct intel_context *ce)
> >   static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   {
> > +	bool local_stalled;
> >   	struct i915_request *rq;
> >   	unsigned long flags;
> >   	u32 head;
> > +	int i, number_children = ce->parallel.number_children;
> >   	bool skip = false;
> > +	struct intel_context *parent = ce;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	intel_context_get(ce);
> > @@ -1243,25 +1253,38 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
> >   	if (unlikely(skip))
> >   		goto out_put;
> > -	rq = intel_context_find_active_request(ce);
> > -	if (!rq) {
> > -		head = ce->ring->tail;
> > -		stalled = false;
> > -		goto out_replay;
> > -	}
> > +	/*
> > +	 * For each context in the relationship find the hanging request
> > +	 * resetting each context / request as needed
> > +	 */
> > +	for (i = 0; i < number_children + 1; ++i) {
> > +		if (!intel_context_is_pinned(ce))
> > +			goto next_context;
> > +
> > +		local_stalled = false;
> > +		rq = intel_context_find_active_request(ce);
> > +		if (!rq) {
> > +			head = ce->ring->tail;
> > +			goto out_replay;
> > +		}
> > -	if (!i915_request_started(rq))
> > -		stalled = false;
> > +		GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > +		head = intel_ring_wrap(ce->ring, rq->head);
> > -	GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > -	head = intel_ring_wrap(ce->ring, rq->head);
> > -	__i915_request_reset(rq, stalled);
> > +		if (i915_request_started(rq))
> I didn't see an answer as to why the started test and the wrap call need to
> be reversed?
>

Sorry, they don't have to be. Can flip this back if you want but either
way works.

Matt
 
> John.
> 
> > +			local_stalled = true;
> > +		__i915_request_reset(rq, local_stalled && stalled);
> >   out_replay:
> > -	guc_reset_state(ce, head, stalled);
> > -	__unwind_incomplete_requests(ce);
> > +		guc_reset_state(ce, head, local_stalled && stalled);
> > +next_context:
> > +		if (i != number_children)
> > +			ce = list_next_entry(ce, parallel.child_link);
> > +	}
> > +
> > +	__unwind_incomplete_requests(parent);
> >   out_put:
> > -	intel_context_put(ce);
> > +	intel_context_put(parent);
> >   }
> >   void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> > @@ -1282,7 +1305,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> >   		xa_unlock(&guc->context_lookup);
> > -		if (intel_context_is_pinned(ce))
> > +		if (intel_context_is_pinned(ce) &&
> > +		    !intel_context_is_child(ce))
> >   			__guc_reset_context(ce, stalled);
> >   		intel_context_put(ce);
> > @@ -1374,7 +1398,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> >   		xa_unlock(&guc->context_lookup);
> > -		if (intel_context_is_pinned(ce))
> > +		if (intel_context_is_pinned(ce) &&
> > +		    !intel_context_is_child(ce))
> >   			guc_cancel_context_requests(ce);
> >   		intel_context_put(ce);
> > @@ -2067,6 +2092,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >   	u16 guc_id;
> >   	bool enabled;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   	incr_context_blocked(ce);
> > @@ -2121,6 +2148,7 @@ static void guc_context_unblock(struct intel_context *ce)
> >   	bool enable;
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > @@ -2147,11 +2175,14 @@ static void guc_context_unblock(struct intel_context *ce)
> >   static void guc_context_cancel_request(struct intel_context *ce,
> >   				       struct i915_request *rq)
> >   {
> > +	struct intel_context *block_context =
> > +		request_to_scheduling_context(rq);
> > +
> >   	if (i915_sw_fence_signaled(&rq->submit)) {
> >   		struct i915_sw_fence *fence;
> >   		intel_context_get(ce);
> > -		fence = guc_context_block(ce);
> > +		fence = guc_context_block(block_context);
> >   		i915_sw_fence_wait(fence);
> >   		if (!i915_request_completed(rq)) {
> >   			__i915_request_skip(rq);
> > @@ -2165,7 +2196,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
> >   		 */
> >   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
> > -		guc_context_unblock(ce);
> > +		guc_context_unblock(block_context);
> >   		intel_context_put(ce);
> >   	}
> >   }
> > @@ -2191,6 +2222,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> >   	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	guc_flush_submissions(guc);
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
  2021-10-07  3:37     ` [Intel-gfx] " John Harrison
@ 2021-10-08 18:23       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 18:23 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:37:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight. To do this must
> > issue the deregister H2G from a worker as context can be destroyed from
> > an atomic context and taking GT PM ref blows up. Previously we took a
> > runtime PM from this atomic context which worked but will stop working
> > once runtime pm autosuspend in enabled.
> > 
> > So this patch is two fold, stop intel_gt_wait_for_idle from short
> > circuting and fix runtime pm autosuspend.
> > 
> > v2:
> >   (John Harrison)
> >    - Split structure changes out in different patch
> >   (Tvrtko)
> >    - Don't drop lock in deregister_destroyed_contexts
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
> >   6 files changed, 121 insertions(+), 54 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index e9a0cad5c34d..1076066f41e0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index e7e3984aab78..4613d027cbc3 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -213,6 +213,13 @@ struct intel_context {
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 8520c595f5e1..6fdeae668e6e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..05de6c1af25b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 65b5e8eeef96..25a598e2b6e8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -84,6 +84,17 @@ struct intel_guc {
> >   		 * refs
> >   		 */
> >   		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ad5c18119d92..17da2fea1bff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -90,8 +90,8 @@
> >    * used for all of GuC submission but that could change in the future.
> >    *
> >    * guc->submission_state.lock
> > - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> > - * doing guc_id allocation operations at a time for each GuC in the system.
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> Feels like this should not be removing explanations, only adding to them.
> The patch itself is only adding new features not removing them. Either the
> details about id allocation are not worth mentioning and should not have
> been added in the previous patch. Or they are and should be kept rather than
> removed in this patch. Either way works for me. The comment was valid
> information but does maybe count as obvious from the guc_id member (and
> friends) are within a per GuC instance structure.
> 
> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	int i;
> > @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	guc_flush_submissions(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	/*
> >   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	spin_lock_init(&guc->submission_state.lock);
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker,
> > +		  destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> Seems like these lines should be reversed. We should destroy the higher
> level constructs before the lower level ones that they could be built on.
> 

Missed a few comments. Sure, will reverse.

> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id);
> >   }
> > @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		guc_lrc_desc_unpin(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> 'destroy' can be true if the guc_id is invalid. Is it good to call release
> on an invalid id?
> 

__release_guc_id protects again that, it is harmless to call.

Matt

> John.
> 
> >   	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > -	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-10-08 18:23       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 18:23 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Wed, Oct 06, 2021 at 08:37:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight. To do this must
> > issue the deregister H2G from a worker as context can be destroyed from
> > an atomic context and taking GT PM ref blows up. Previously we took a
> > runtime PM from this atomic context which worked but will stop working
> > once runtime pm autosuspend in enabled.
> > 
> > So this patch is two fold, stop intel_gt_wait_for_idle from short
> > circuting and fix runtime pm autosuspend.
> > 
> > v2:
> >   (John Harrison)
> >    - Split structure changes out in different patch
> >   (Tvrtko)
> >    - Don't drop lock in deregister_destroyed_contexts
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
> >   6 files changed, 121 insertions(+), 54 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index e9a0cad5c34d..1076066f41e0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index e7e3984aab78..4613d027cbc3 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -213,6 +213,13 @@ struct intel_context {
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 8520c595f5e1..6fdeae668e6e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..05de6c1af25b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 65b5e8eeef96..25a598e2b6e8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -84,6 +84,17 @@ struct intel_guc {
> >   		 * refs
> >   		 */
> >   		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> >   	} submission_state;
> >   	/**
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index ad5c18119d92..17da2fea1bff 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -90,8 +90,8 @@
> >    * used for all of GuC submission but that could change in the future.
> >    *
> >    * guc->submission_state.lock
> > - * Protects guc_id allocation for the given GuC, i.e. only one context can be
> > - * doing guc_id allocation operations at a time for each GuC in the system.
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> Feels like this should not be removing explanations, only adding to them.
> The patch itself is only adding new features not removing them. Either the
> details about id allocation are not worth mentioning and should not have
> been added in the previous patch. Or they are and should be kept rather than
> removed in this patch. Either way works for me. The comment was valid
> information but does maybe count as obvious from the guc_id member (and
> friends) are within a per GuC instance structure.
> 
> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	int i;
> > @@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	guc_flush_submissions(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	/*
> >   	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
> > @@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	spin_lock_init(&guc->submission_state.lock);
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker,
> > +		  destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1161,6 +1170,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> Seems like these lines should be reversed. We should destroy the higher
> level constructs before the lower level ones that they could be built on.
> 

Missed a few comments. Sure, will reverse.

> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id);
> >   }
> > @@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		guc_lrc_desc_unpin(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> 'destroy' can be true if the guc_id is invalid. Is it good to call release
> on an invalid id?
> 

__release_guc_id protects again that, it is harmless to call.

Matt

> John.
> 
> >   	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > -	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
  2021-10-07 19:35     ` [Intel-gfx] " John Harrison
@ 2021-10-08 18:33       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 18:33 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 12:35:08PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Introduce context parent-child relationship. Once this relationship is
> > created all pinning / unpinning operations are directed to the parent
> > context. The parent context is responsible for pinning all of its'
> No need for an apostrophe.
> 

Fixed.

> > children and itself.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - a single H2G is used
> > register / deregister all of the contexts simultaneously.
> > 
> > Subsequent patches in the series will implement the pinning / unpinning
> > operations for parent / child contexts.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add kernel doc, add wrapper to access parent to ensure safety
> > v3:
> >   (John Harrison)
> >    - Fix comment explaing GEM_BUG_ON in to_parent()
> >    - Make variable names generic (non-GuC specific)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
> >   3 files changed, 91 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index f601323b939f..c5bb7ccfb3f8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	INIT_LIST_HEAD(&ce->destroyed_link);
> > +	INIT_LIST_HEAD(&ce->parallel.child_list);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > @@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   void intel_context_fini(struct intel_context *ce)
> >   {
> > +	struct intel_context *child, *next;
> > +
> >   	if (ce->timeline)
> >   		intel_timeline_put(ce->timeline);
> >   	i915_vm_put(ce->vm);
> > +	/* Need to put the creation ref for the children */
> > +	if (intel_context_is_parent(ce))
> > +		for_each_child_safe(ce, child, next)
> > +			intel_context_put(child);
> > +
> >   	mutex_destroy(&ce->pin_mutex);
> >   	i915_active_fini(&ce->active);
> >   	i915_sw_fence_fini(&ce->guc_state.blocked);
> > @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >   	return active;
> >   }
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child)
> > +{
> > +	/*
> > +	 * Callers responsibility to validate that this function is used
> > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > +	 */
> > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > +	GEM_BUG_ON(intel_context_is_child(parent));
> > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > +	GEM_BUG_ON(intel_context_is_child(child));
> > +	GEM_BUG_ON(intel_context_is_parent(child));
> > +
> > +	parent->parallel.number_children++;
> > +	list_add_tail(&child->parallel.child_link,
> > +		      &parent->parallel.child_list);
> > +	child->parallel.parent = parent;
> > +}
> > +
> >   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >   #include "selftest_context.c"
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c41098950746..b63c10a144af 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
> >   int intel_context_reconfigure_sseu(struct intel_context *ce,
> >   				   const struct intel_sseu sseu);
> > +static inline bool intel_context_is_child(struct intel_context *ce)
> > +{
> > +	return !!ce->parallel.parent;
> > +}
> > +
> > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > +{
> > +	return !!ce->parallel.number_children;
> > +}
> > +
> > +static inline bool intel_context_is_pinned(struct intel_context *ce);
> > +
> > +static inline struct intel_context *
> > +intel_context_to_parent(struct intel_context *ce)
> > +{
> > +	if (intel_context_is_child(ce)) {
> > +		/*
> > +		 * The parent holds ref count to the child so it is always safe
> > +		 * for the parent to access the child, but the child has a
> > +		 * pointer to the parent without a ref. To ensure this is safe
> > +		 * the child should only access the parent pointer while the
> > +		 * parent is pinned.
> > +		 */
> > +		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
> > +
> > +		return ce->parallel.parent;
> > +	} else {
> > +		return ce;
> > +	}
> > +}
> > +
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child);
> > +
> > +#define for_each_child(parent, ce)\
> > +	list_for_each_entry(ce, &(parent)->parallel.child_list,\
> > +			    parallel.child_link)
> > +#define for_each_child_safe(parent, ce, cn)\
> > +	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
> > +				 parallel.child_link)
> > +
> >   /**
> >    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> >    * @ce - the context
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 4613d027cbc3..76dfca57cb45 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -220,6 +220,27 @@ struct intel_context {
> >   	 */
> >   	struct list_head destroyed_link;
> > +	/** @parallel: sub-structure for parallel submission members */
> > +	struct {
> > +		union {
> > +			/**
> > +			 * @child_list: parent's list of children
> > +			 * contexts, no protection as immutable after context
> > +			 * creation
> > +			 */
> > +			struct list_head child_list;
> > +			/**
> > +			 * @child_link: child's link into parent's list of
> > +			 * children
> > +			 */
> > +			struct list_head child_link;
> > +		};
> > +		/** @parent: pointer to parent if child */
> > +		struct intel_context *parent;
> > +		/** @number_children: number of children if parent */
> > +		u8 number_children;
> Is there any particular reason for using 'u8'? A simple 'int' can be much
> more efficient depending upon the host CPU architecture.
> 

Several other fields in the struct are u8 as well, I guess it saves a
few bytes in the struct if they are packed together. Going to leave as
is, if we want to change to all natural sizes we can do in a simple
follow up patch.

Matt

> Not a blocker though. So with the typo above fixed:
> Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
> 
> > +	} parallel;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship
@ 2021-10-08 18:33       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-08 18:33 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Thu, Oct 07, 2021 at 12:35:08PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Introduce context parent-child relationship. Once this relationship is
> > created all pinning / unpinning operations are directed to the parent
> > context. The parent context is responsible for pinning all of its'
> No need for an apostrophe.
> 

Fixed.

> > children and itself.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - a single H2G is used
> > register / deregister all of the contexts simultaneously.
> > 
> > Subsequent patches in the series will implement the pinning / unpinning
> > operations for parent / child contexts.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add kernel doc, add wrapper to access parent to ensure safety
> > v3:
> >   (John Harrison)
> >    - Fix comment explaing GEM_BUG_ON in to_parent()
> >    - Make variable names generic (non-GuC specific)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context.h       | 41 +++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context_types.h | 21 ++++++++++
> >   3 files changed, 91 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index f601323b939f..c5bb7ccfb3f8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -403,6 +403,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	INIT_LIST_HEAD(&ce->destroyed_link);
> > +	INIT_LIST_HEAD(&ce->parallel.child_list);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > @@ -417,10 +419,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   void intel_context_fini(struct intel_context *ce)
> >   {
> > +	struct intel_context *child, *next;
> > +
> >   	if (ce->timeline)
> >   		intel_timeline_put(ce->timeline);
> >   	i915_vm_put(ce->vm);
> > +	/* Need to put the creation ref for the children */
> > +	if (intel_context_is_parent(ce))
> > +		for_each_child_safe(ce, child, next)
> > +			intel_context_put(child);
> > +
> >   	mutex_destroy(&ce->pin_mutex);
> >   	i915_active_fini(&ce->active);
> >   	i915_sw_fence_fini(&ce->guc_state.blocked);
> > @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >   	return active;
> >   }
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child)
> > +{
> > +	/*
> > +	 * Callers responsibility to validate that this function is used
> > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > +	 */
> > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > +	GEM_BUG_ON(intel_context_is_child(parent));
> > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > +	GEM_BUG_ON(intel_context_is_child(child));
> > +	GEM_BUG_ON(intel_context_is_parent(child));
> > +
> > +	parent->parallel.number_children++;
> > +	list_add_tail(&child->parallel.child_link,
> > +		      &parent->parallel.child_list);
> > +	child->parallel.parent = parent;
> > +}
> > +
> >   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >   #include "selftest_context.c"
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c41098950746..b63c10a144af 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -44,6 +44,47 @@ void intel_context_free(struct intel_context *ce);
> >   int intel_context_reconfigure_sseu(struct intel_context *ce,
> >   				   const struct intel_sseu sseu);
> > +static inline bool intel_context_is_child(struct intel_context *ce)
> > +{
> > +	return !!ce->parallel.parent;
> > +}
> > +
> > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > +{
> > +	return !!ce->parallel.number_children;
> > +}
> > +
> > +static inline bool intel_context_is_pinned(struct intel_context *ce);
> > +
> > +static inline struct intel_context *
> > +intel_context_to_parent(struct intel_context *ce)
> > +{
> > +	if (intel_context_is_child(ce)) {
> > +		/*
> > +		 * The parent holds ref count to the child so it is always safe
> > +		 * for the parent to access the child, but the child has a
> > +		 * pointer to the parent without a ref. To ensure this is safe
> > +		 * the child should only access the parent pointer while the
> > +		 * parent is pinned.
> > +		 */
> > +		GEM_BUG_ON(!intel_context_is_pinned(ce->parallel.parent));
> > +
> > +		return ce->parallel.parent;
> > +	} else {
> > +		return ce;
> > +	}
> > +}
> > +
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child);
> > +
> > +#define for_each_child(parent, ce)\
> > +	list_for_each_entry(ce, &(parent)->parallel.child_list,\
> > +			    parallel.child_link)
> > +#define for_each_child_safe(parent, ce, cn)\
> > +	list_for_each_entry_safe(ce, cn, &(parent)->parallel.child_list,\
> > +				 parallel.child_link)
> > +
> >   /**
> >    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> >    * @ce - the context
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 4613d027cbc3..76dfca57cb45 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -220,6 +220,27 @@ struct intel_context {
> >   	 */
> >   	struct list_head destroyed_link;
> > +	/** @parallel: sub-structure for parallel submission members */
> > +	struct {
> > +		union {
> > +			/**
> > +			 * @child_list: parent's list of children
> > +			 * contexts, no protection as immutable after context
> > +			 * creation
> > +			 */
> > +			struct list_head child_list;
> > +			/**
> > +			 * @child_link: child's link into parent's list of
> > +			 * children
> > +			 */
> > +			struct list_head child_link;
> > +		};
> > +		/** @parent: pointer to parent if child */
> > +		struct intel_context *parent;
> > +		/** @number_children: number of children if parent */
> > +		u8 number_children;
> Is there any particular reason for using 'u8'? A simple 'int' can be much
> more efficient depending upon the host CPU architecture.
> 

Several other fields in the struct are u8 as well, I guess it saves a
few bytes in the struct if they are packed together. Going to leave as
is, if we want to change to all natural sizes we can do in a simple
follow up patch.

Matt

> Not a blocker though. So with the typo above fixed:
> Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
> 
> > +	} parallel;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-11 22:09     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-11 22:09 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Introduce 'set parallel submit' extension to connect UAPI to GuC
> multi-lrc interface. Kernel doc in new uAPI should explain it all.
>
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
>
> v2:
>   (Daniel Vetter)
>    - Add IGT link and placeholder for media UMD link
> v3:
>   (Kernel test robot)
>    - Fix warning in unpin engines call
>   (John Harrison)
>    - Reword a bunch of the kernel doc
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
>   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
>   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
>   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
>   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
>   include/uapi/drm/i915_drm.h                   | 131 +++++++++++
>   9 files changed, 489 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 8c7ea6e56262..6290bc20ccb1 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
>   	return 0;
>   }
>   
> +static int
> +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> +				      void *data)
> +{
> +	struct i915_context_engines_parallel_submit __user *ext =
> +		container_of_user(base, typeof(*ext), base);
> +	const struct set_proto_ctx_engines *set = data;
> +	struct drm_i915_private *i915 = set->i915;
> +	u64 flags;
> +	int err = 0, n, i, j;
> +	u16 slot, width, num_siblings;
> +	struct intel_engine_cs **siblings = NULL;
> +	intel_engine_mask_t prev_mask;
> +
> +	/* Disabling for now */
> +	return -ENODEV;
> +
> +	/* FIXME: This is NIY for execlists */
> +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> +		return -ENODEV;
> +
> +	if (get_user(slot, &ext->engine_index))
> +		return -EFAULT;
> +
> +	if (get_user(width, &ext->width))
> +		return -EFAULT;
> +
> +	if (get_user(num_siblings, &ext->num_siblings))
> +		return -EFAULT;
> +
> +	if (slot >= set->num_engines) {
> +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> +			slot, set->num_engines);
> +		return -EINVAL;
> +	}
> +
> +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> +		drm_dbg(&i915->drm,
> +			"Invalid placement[%d], already occupied\n", slot);
> +		return -EINVAL;
> +	}
> +
> +	if (get_user(flags, &ext->flags))
> +		return -EFAULT;
> +
> +	if (flags) {
> +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> +		return -EINVAL;
> +	}
> +
> +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> +		err = check_user_mbz(&ext->mbz64[n]);
> +		if (err)
> +			return err;
> +	}
> +
> +	if (width < 2) {
> +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> +		return -EINVAL;
> +	}
> +
> +	if (num_siblings < 1) {
> +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> +			num_siblings);
> +		return -EINVAL;
> +	}
> +
> +	siblings = kmalloc_array(num_siblings * width,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return -ENOMEM;
> +
> +	/* Create contexts / engines */
> +	for (i = 0; i < width; ++i) {
> +		intel_engine_mask_t current_mask = 0;
> +		struct i915_engine_class_instance prev_engine;
> +
> +		for (j = 0; j < num_siblings; ++j) {
> +			struct i915_engine_class_instance ci;
> +
> +			n = i * num_siblings + j;
> +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> +				err = -EFAULT;
> +				goto out_err;
> +			}
> +
> +			siblings[n] =
> +				intel_engine_lookup_user(i915, ci.engine_class,
> +							 ci.engine_instance);
> +			if (!siblings[n]) {
> +				drm_dbg(&i915->drm,
> +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> +					n, ci.engine_class, ci.engine_instance);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +
> +			if (n) {
> +				if (prev_engine.engine_class !=
> +				    ci.engine_class) {
> +					drm_dbg(&i915->drm,
> +						"Mismatched class %d, %d\n",
> +						prev_engine.engine_class,
> +						ci.engine_class);
> +					err = -EINVAL;
> +					goto out_err;
> +				}
> +			}
> +
> +			prev_engine = ci;
> +			current_mask |= siblings[n]->logical_mask;
> +		}
> +
> +		if (i > 0) {
> +			if (current_mask != prev_mask << 1) {
> +				drm_dbg(&i915->drm,
> +					"Non contiguous logical mask 0x%x, 0x%x\n",
> +					prev_mask, current_mask);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +		}
> +		prev_mask = current_mask;
> +	}
> +
> +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> +	set->engines[slot].num_siblings = num_siblings;
> +	set->engines[slot].width = width;
> +	set->engines[slot].siblings = siblings;
> +
> +	return 0;
> +
> +out_err:
> +	kfree(siblings);
> +
> +	return err;
> +}
> +
>   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
>   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
>   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> +		set_proto_ctx_engines_parallel_submit,
>   };
>   
>   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> @@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
>   	return ret;
>   }
>   
> +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> +{
> +	while (count--) {
> +		struct intel_context *ce = e->engines[count], *child;
> +
> +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> +			continue;
> +
> +		for_each_child(ce, child)
> +			intel_context_unpin(child);
> +		intel_context_unpin(ce);
> +	}
> +}
> +
> +static void unpin_engines(struct i915_gem_engines *e)
> +{
> +	__unpin_engines(e, e->num_engines);
> +}
> +
>   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
>   {
>   	while (count--) {
> @@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
>   	return err;
>   }
>   
> +static int perma_pin_contexts(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +	int i = 0, j = 0, ret;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	ret = intel_context_pin(ce);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	for_each_child(ce, child) {
> +		ret = intel_context_pin(child);
> +		if (unlikely(ret))
> +			goto unwind;
> +		++i;
> +	}
> +
> +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> +
> +	return 0;
> +
> +unwind:
> +	intel_context_unpin(ce);
> +	for_each_child(ce, child) {
> +		if (j++ < i)
> +			intel_context_unpin(child);
> +		else
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
>   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   					     unsigned int num_engines,
>   					     struct i915_gem_proto_engine *pe)
> @@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	e->num_engines = num_engines;
>   
>   	for (n = 0; n < num_engines; n++) {
> -		struct intel_context *ce;
> +		struct intel_context *ce, *child;
>   		int ret;
>   
>   		switch (pe[n].type) {
> @@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   
>   		case I915_GEM_ENGINE_TYPE_BALANCED:
>   			ce = intel_engine_create_virtual(pe[n].siblings,
> -							 pe[n].num_siblings);
> +							 pe[n].num_siblings, 0);
> +			break;
> +
> +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> +			ce = intel_engine_create_parallel(pe[n].siblings,
> +							  pe[n].num_siblings,
> +							  pe[n].width);
>   			break;
>   
>   		case I915_GEM_ENGINE_TYPE_INVALID:
> @@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			err = ERR_PTR(ret);
>   			goto free_engines;
>   		}
> +		for_each_child(ce, child) {
> +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
> +
> +		/* XXX: Must be done after setting gem context */
There is still no explanation of this comment either here or in the 
commit message. It needs to say why it is a problem that the perma-pin 
must be done after the above set_gem call. And what must be done to fix 
this problem. And what issues could be expected because of this problem.

> +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> +			ret = perma_pin_contexts(ce);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
>   	}
>   
>   	return e;
> @@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
>   
>   	/* Flush any concurrent set_engines() */
>   	mutex_lock(&ctx->engines_mutex);
> +	unpin_engines(__context_engines_static(ctx));
>   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
>   	i915_gem_context_set_closed(ctx);
>   	mutex_unlock(&ctx->engines_mutex);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index c4617e4d9fa9..eb5f9b4f2d19 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
>   
>   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
>   	I915_GEM_ENGINE_TYPE_BALANCED,
> +
> +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> +	I915_GEM_ENGINE_TYPE_PARALLEL,
>   };
>   
>   /**
> @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
>   	/** @num_siblings: Number of balanced siblings */
Should this be updated to say 'number of balanced or parallel siblings'?

John.

>   	unsigned int num_siblings;
>   
> +	/** @width: Width of each sibling */
> +	unsigned int width;
> +
>   	/** @siblings: Balanced siblings */
>   	struct intel_engine_cs **siblings;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 8309d1141d0a..1d880303a7e4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -55,9 +55,13 @@ struct intel_context_ops {
>   	void (*reset)(struct intel_context *ce);
>   	void (*destroy)(struct kref *kref);
>   
> -	/* virtual engine/context interface */
> +	/* virtual/parallel engine/context interface */
>   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> -						unsigned int count);
> +						unsigned int count,
> +						unsigned long flags);
> +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> +						 unsigned int num_siblings,
> +						 unsigned int width);
>   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
>   					       unsigned int sibling);
>   };
> @@ -113,6 +117,7 @@ struct intel_context {
>   #define CONTEXT_NOPREEMPT		8
>   #define CONTEXT_LRCA_DIRTY		9
>   #define CONTEXT_GUC_INIT		10
> +#define CONTEXT_PERMA_PIN		11
>   
>   	struct {
>   		u64 timeout_us;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 87579affb952..43f16a8347ee 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>   	return intel_engine_has_preemption(engine);
>   }
>   
> +#define FORCE_VIRTUAL	BIT(0)
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count);
> +			    unsigned int count, unsigned long flags);
> +
> +static inline struct intel_context *
> +intel_engine_create_parallel(struct intel_engine_cs **engines,
> +			     unsigned int num_engines,
> +			     unsigned int width)
> +{
> +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> +}
>   
>   static inline bool
>   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2eb798ad068b..ff6753ccb129 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count)
> +			    unsigned int count, unsigned long flags)
>   {
>   	if (count == 0)
>   		return ERR_PTR(-EINVAL);
>   
> -	if (count == 1)
> +	if (count == 1 && !(flags & FORCE_VIRTUAL))
>   		return intel_context_create(siblings[0]);
>   
>   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> -	return siblings[0]->cops->create_virtual(siblings, count);
> +	return siblings[0]->cops->create_virtual(siblings, count, flags);
>   }
>   
>   struct i915_request *
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 5ed1e222c308..8d7f571029df 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags);
>   
>   static struct i915_request *
>   __active_request(const struct intel_timeline * const tl,
> @@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags)
>   {
>   	struct virtual_engine *ve;
>   	unsigned int n;
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index b3863abc51f5..74986b094b96 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
>   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
>   
>   	for (n = 0; n < nctx; n++) {
> -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ve[n])) {
>   			err = PTR_ERR(ve[n]);
>   			nctx = n;
> @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
>   	 * restrict it to our desired engine within the virtual engine.
>   	 */
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_close;
> @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
>   		i915_request_add(rq);
>   	}
>   
> -	ce = intel_engine_create_virtual(siblings, nsibling);
> +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ce)) {
>   		err = PTR_ERR(ce);
>   		goto out;
> @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
>   
>   	/* XXX We do not handle oversubscription and fairness with normal rq */
>   	for (n = 0; n < nsibling; n++) {
> -		ce = intel_engine_create_virtual(siblings, nsibling);
> +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ce)) {
>   			err = PTR_ERR(ce);
>   			goto out;
> @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
>   	if (err)
>   		goto out_scratch;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_scratch;
> @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
>   	if (igt_spinner_init(&spin, gt))
>   		return -ENOMEM;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_spin;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f69e984683aa..9b19e0d830a2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -124,7 +124,13 @@ struct guc_virtual_engine {
>   };
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags);
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> @@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
>   	.destroy = guc_context_destroy,
>   
>   	.create_virtual = guc_create_virtual,
> +	.create_parallel = guc_create_parallel,
>   };
>   
>   static void submit_work_cb(struct irq_work *wrk)
> @@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_parent_context_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> @@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(context_enabled(ce));
> @@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_post_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(!intel_context_is_child(ce));
> @@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
>   	intel_context_unpin(ce->parallel.parent);
>   }
>   
> +static void guc_child_context_destroy(struct kref *kref)
> +{
> +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> +
> +	__guc_context_destroy(ce);
> +}
> +
> +static const struct intel_context_ops virtual_parent_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_parent_context_pin,
> +	.unpin = guc_parent_context_unpin,
> +	.post_unpin = guc_context_post_unpin,
> +
> +	.ban = guc_context_ban,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.sched_disable = guc_context_sched_disable,
> +
> +	.destroy = guc_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static const struct intel_context_ops virtual_child_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_child_context_pin,
> +	.unpin = guc_child_context_unpin,
> +	.post_unpin = guc_child_context_post_unpin,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.destroy = guc_child_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width)
> +{
> +	struct intel_engine_cs **siblings = NULL;
> +	struct intel_context *parent = NULL, *ce, *err;
> +	int i, j;
> +
> +	siblings = kmalloc_array(num_siblings,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < width; ++i) {
> +		for (j = 0; j < num_siblings; ++j)
> +			siblings[j] = engines[i * num_siblings + j];
> +
> +		ce = intel_engine_create_virtual(siblings, num_siblings,
> +						 FORCE_VIRTUAL);
> +		if (!ce) {
> +			err = ERR_PTR(-ENOMEM);
> +			goto unwind;
> +		}
> +
> +		if (i == 0) {
> +			parent = ce;
> +			parent->ops = &virtual_parent_context_ops;
> +		} else {
> +			ce->ops = &virtual_child_context_ops;
> +			intel_context_bind_parent_child(parent, ce);
> +		}
> +	}
> +
> +	kfree(siblings);
> +	return parent;
> +
> +unwind:
> +	if (parent)
> +		intel_context_put(parent);
> +	kfree(siblings);
> +	return err;
> +}
> +
>   static bool
>   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
>   {
> @@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   }
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags)
>   {
>   	struct guc_virtual_engine *ve;
>   	struct intel_guc *guc;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index b1248a67b4f8..f7c19e5464ae 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
>    * Extensions:
>    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
>    */
>   #define I915_CONTEXT_PARAM_ENGINES	0xa
>   
> @@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
>   	struct i915_engine_class_instance engines[N__]; \
>   } __attribute__((packed)) name__
>   
> +/**
> + * struct i915_context_engines_parallel_submit - Configure engine for
> + * parallel submission.
> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915 to
> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> + * many BBs there are based on the slot's configuration. The N BBs are the last
> + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> + *
> + * The default placement behavior is to create implicit bonds between each
> + * context if each context maps to more than 1 physical engine (e.g. context is
> + * a virtual engine). Also we only allow contexts of same engine class and these
> + * contexts must be in logically contiguous order. Examples of the placement
> + * behavior are described below. Lastly, the default is to not allow BBs to be
> + * preempted mid-batch. Rather insert coordinated preemption points on all
> + * hardware contexts between each set of BBs. Flags could be added in the future
> + * to change both of these default behaviors.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * interface.
> + *
> + * .. code-block:: none
> + *
> + *	Examples syntax:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *
> + *	Example 1 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> + *		     engines=CS[0],CS[1])
> + *
> + *	Results in the following valid placement:
> + *	CS[0], CS[1]
> + *
> + *	Example 2 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[2],CS[1],CS[3])
> + *
> + *	Results in the following valid placements:
> + *	CS[0], CS[1]
> + *	CS[2], CS[3]
> + *
> + *	This can be thought of as two virtual engines, each containing two
> + *	engines thereby making a 2D array. However, there are bonds tying the
> + *	entries together and placing restrictions on how they can be scheduled.
> + *	Specifically, the scheduler can choose only vertical columns from the 2D
> + *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
> + *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
> + *	versa. Same for CS[2] requires also using CS[3].
> + *	VE[0] = CS[0], CS[2]
> + *	VE[1] = CS[1], CS[3]
> + *
> + *	Example 3 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[1],CS[1],CS[3])
> + *
> + *	Results in the following valid and invalid placements:
> + *	CS[0], CS[1]
> + *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
> + */
> +struct i915_context_engines_parallel_submit {
> +	/**
> +	 * @base: base user extension.
> +	 */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @engine_index: slot for parallel engine
> +	 */
> +	__u16 engine_index;
> +
> +	/**
> +	 * @width: number of contexts per parallel engine or in other words the
> +	 * number of batches in each submission
> +	 */
> +	__u16 width;
> +
> +	/**
> +	 * @num_siblings: number of siblings per context or in other words the
> +	 * number of possible placements for each submission
> +	 */
> +	__u16 num_siblings;
> +
> +	/**
> +	 * @mbz16: reserved for future use; must be zero
> +	 */
> +	__u16 mbz16;
> +
> +	/**
> +	 * @flags: all undefined flags must be zero, currently not defined flags
> +	 */
> +	__u64 flags;
> +
> +	/**
> +	 * @mbz64: reserved for future use; must be zero
> +	 */
> +	__u64 mbz64[3];
> +
> +	/**
> +	 * @engines: 2-d array of engine instances to configure parallel engine
> +	 *
> +	 * length = width (i) * num_siblings (j)
> +	 * index = j + i * num_siblings
> +	 */
> +	struct i915_engine_class_instance engines[0];
> +
> +} __packed;
> +
> +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> +	struct i915_user_extension base; \
> +	__u16 engine_index; \
> +	__u16 width; \
> +	__u16 num_siblings; \
> +	__u16 mbz16; \
> +	__u64 flags; \
> +	__u64 mbz64[3]; \
> +	struct i915_engine_class_instance engines[N__]; \
> +} __attribute__((packed)) name__
> +
>   /**
>    * DOC: Context Engine Map uAPI
>    *
> @@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
>   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>   	struct i915_engine_class_instance engines[0];
>   } __attribute__((packed));
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-10-11 22:09     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-11 22:09 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Introduce 'set parallel submit' extension to connect UAPI to GuC
> multi-lrc interface. Kernel doc in new uAPI should explain it all.
>
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
>
> v2:
>   (Daniel Vetter)
>    - Add IGT link and placeholder for media UMD link
> v3:
>   (Kernel test robot)
>    - Fix warning in unpin engines call
>   (John Harrison)
>    - Reword a bunch of the kernel doc
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
>   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
>   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
>   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
>   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
>   include/uapi/drm/i915_drm.h                   | 131 +++++++++++
>   9 files changed, 489 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 8c7ea6e56262..6290bc20ccb1 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
>   	return 0;
>   }
>   
> +static int
> +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> +				      void *data)
> +{
> +	struct i915_context_engines_parallel_submit __user *ext =
> +		container_of_user(base, typeof(*ext), base);
> +	const struct set_proto_ctx_engines *set = data;
> +	struct drm_i915_private *i915 = set->i915;
> +	u64 flags;
> +	int err = 0, n, i, j;
> +	u16 slot, width, num_siblings;
> +	struct intel_engine_cs **siblings = NULL;
> +	intel_engine_mask_t prev_mask;
> +
> +	/* Disabling for now */
> +	return -ENODEV;
> +
> +	/* FIXME: This is NIY for execlists */
> +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> +		return -ENODEV;
> +
> +	if (get_user(slot, &ext->engine_index))
> +		return -EFAULT;
> +
> +	if (get_user(width, &ext->width))
> +		return -EFAULT;
> +
> +	if (get_user(num_siblings, &ext->num_siblings))
> +		return -EFAULT;
> +
> +	if (slot >= set->num_engines) {
> +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> +			slot, set->num_engines);
> +		return -EINVAL;
> +	}
> +
> +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> +		drm_dbg(&i915->drm,
> +			"Invalid placement[%d], already occupied\n", slot);
> +		return -EINVAL;
> +	}
> +
> +	if (get_user(flags, &ext->flags))
> +		return -EFAULT;
> +
> +	if (flags) {
> +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> +		return -EINVAL;
> +	}
> +
> +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> +		err = check_user_mbz(&ext->mbz64[n]);
> +		if (err)
> +			return err;
> +	}
> +
> +	if (width < 2) {
> +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> +		return -EINVAL;
> +	}
> +
> +	if (num_siblings < 1) {
> +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> +			num_siblings);
> +		return -EINVAL;
> +	}
> +
> +	siblings = kmalloc_array(num_siblings * width,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return -ENOMEM;
> +
> +	/* Create contexts / engines */
> +	for (i = 0; i < width; ++i) {
> +		intel_engine_mask_t current_mask = 0;
> +		struct i915_engine_class_instance prev_engine;
> +
> +		for (j = 0; j < num_siblings; ++j) {
> +			struct i915_engine_class_instance ci;
> +
> +			n = i * num_siblings + j;
> +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> +				err = -EFAULT;
> +				goto out_err;
> +			}
> +
> +			siblings[n] =
> +				intel_engine_lookup_user(i915, ci.engine_class,
> +							 ci.engine_instance);
> +			if (!siblings[n]) {
> +				drm_dbg(&i915->drm,
> +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> +					n, ci.engine_class, ci.engine_instance);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +
> +			if (n) {
> +				if (prev_engine.engine_class !=
> +				    ci.engine_class) {
> +					drm_dbg(&i915->drm,
> +						"Mismatched class %d, %d\n",
> +						prev_engine.engine_class,
> +						ci.engine_class);
> +					err = -EINVAL;
> +					goto out_err;
> +				}
> +			}
> +
> +			prev_engine = ci;
> +			current_mask |= siblings[n]->logical_mask;
> +		}
> +
> +		if (i > 0) {
> +			if (current_mask != prev_mask << 1) {
> +				drm_dbg(&i915->drm,
> +					"Non contiguous logical mask 0x%x, 0x%x\n",
> +					prev_mask, current_mask);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +		}
> +		prev_mask = current_mask;
> +	}
> +
> +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> +	set->engines[slot].num_siblings = num_siblings;
> +	set->engines[slot].width = width;
> +	set->engines[slot].siblings = siblings;
> +
> +	return 0;
> +
> +out_err:
> +	kfree(siblings);
> +
> +	return err;
> +}
> +
>   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
>   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
>   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> +		set_proto_ctx_engines_parallel_submit,
>   };
>   
>   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> @@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
>   	return ret;
>   }
>   
> +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> +{
> +	while (count--) {
> +		struct intel_context *ce = e->engines[count], *child;
> +
> +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> +			continue;
> +
> +		for_each_child(ce, child)
> +			intel_context_unpin(child);
> +		intel_context_unpin(ce);
> +	}
> +}
> +
> +static void unpin_engines(struct i915_gem_engines *e)
> +{
> +	__unpin_engines(e, e->num_engines);
> +}
> +
>   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
>   {
>   	while (count--) {
> @@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
>   	return err;
>   }
>   
> +static int perma_pin_contexts(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +	int i = 0, j = 0, ret;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	ret = intel_context_pin(ce);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	for_each_child(ce, child) {
> +		ret = intel_context_pin(child);
> +		if (unlikely(ret))
> +			goto unwind;
> +		++i;
> +	}
> +
> +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> +
> +	return 0;
> +
> +unwind:
> +	intel_context_unpin(ce);
> +	for_each_child(ce, child) {
> +		if (j++ < i)
> +			intel_context_unpin(child);
> +		else
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
>   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   					     unsigned int num_engines,
>   					     struct i915_gem_proto_engine *pe)
> @@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	e->num_engines = num_engines;
>   
>   	for (n = 0; n < num_engines; n++) {
> -		struct intel_context *ce;
> +		struct intel_context *ce, *child;
>   		int ret;
>   
>   		switch (pe[n].type) {
> @@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   
>   		case I915_GEM_ENGINE_TYPE_BALANCED:
>   			ce = intel_engine_create_virtual(pe[n].siblings,
> -							 pe[n].num_siblings);
> +							 pe[n].num_siblings, 0);
> +			break;
> +
> +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> +			ce = intel_engine_create_parallel(pe[n].siblings,
> +							  pe[n].num_siblings,
> +							  pe[n].width);
>   			break;
>   
>   		case I915_GEM_ENGINE_TYPE_INVALID:
> @@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			err = ERR_PTR(ret);
>   			goto free_engines;
>   		}
> +		for_each_child(ce, child) {
> +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
> +
> +		/* XXX: Must be done after setting gem context */
There is still no explanation of this comment either here or in the 
commit message. It needs to say why it is a problem that the perma-pin 
must be done after the above set_gem call. And what must be done to fix 
this problem. And what issues could be expected because of this problem.

> +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> +			ret = perma_pin_contexts(ce);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
>   	}
>   
>   	return e;
> @@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
>   
>   	/* Flush any concurrent set_engines() */
>   	mutex_lock(&ctx->engines_mutex);
> +	unpin_engines(__context_engines_static(ctx));
>   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
>   	i915_gem_context_set_closed(ctx);
>   	mutex_unlock(&ctx->engines_mutex);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index c4617e4d9fa9..eb5f9b4f2d19 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
>   
>   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
>   	I915_GEM_ENGINE_TYPE_BALANCED,
> +
> +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> +	I915_GEM_ENGINE_TYPE_PARALLEL,
>   };
>   
>   /**
> @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
>   	/** @num_siblings: Number of balanced siblings */
Should this be updated to say 'number of balanced or parallel siblings'?

John.

>   	unsigned int num_siblings;
>   
> +	/** @width: Width of each sibling */
> +	unsigned int width;
> +
>   	/** @siblings: Balanced siblings */
>   	struct intel_engine_cs **siblings;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 8309d1141d0a..1d880303a7e4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -55,9 +55,13 @@ struct intel_context_ops {
>   	void (*reset)(struct intel_context *ce);
>   	void (*destroy)(struct kref *kref);
>   
> -	/* virtual engine/context interface */
> +	/* virtual/parallel engine/context interface */
>   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> -						unsigned int count);
> +						unsigned int count,
> +						unsigned long flags);
> +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> +						 unsigned int num_siblings,
> +						 unsigned int width);
>   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
>   					       unsigned int sibling);
>   };
> @@ -113,6 +117,7 @@ struct intel_context {
>   #define CONTEXT_NOPREEMPT		8
>   #define CONTEXT_LRCA_DIRTY		9
>   #define CONTEXT_GUC_INIT		10
> +#define CONTEXT_PERMA_PIN		11
>   
>   	struct {
>   		u64 timeout_us;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 87579affb952..43f16a8347ee 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>   	return intel_engine_has_preemption(engine);
>   }
>   
> +#define FORCE_VIRTUAL	BIT(0)
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count);
> +			    unsigned int count, unsigned long flags);
> +
> +static inline struct intel_context *
> +intel_engine_create_parallel(struct intel_engine_cs **engines,
> +			     unsigned int num_engines,
> +			     unsigned int width)
> +{
> +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> +}
>   
>   static inline bool
>   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2eb798ad068b..ff6753ccb129 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count)
> +			    unsigned int count, unsigned long flags)
>   {
>   	if (count == 0)
>   		return ERR_PTR(-EINVAL);
>   
> -	if (count == 1)
> +	if (count == 1 && !(flags & FORCE_VIRTUAL))
>   		return intel_context_create(siblings[0]);
>   
>   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> -	return siblings[0]->cops->create_virtual(siblings, count);
> +	return siblings[0]->cops->create_virtual(siblings, count, flags);
>   }
>   
>   struct i915_request *
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 5ed1e222c308..8d7f571029df 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags);
>   
>   static struct i915_request *
>   __active_request(const struct intel_timeline * const tl,
> @@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags)
>   {
>   	struct virtual_engine *ve;
>   	unsigned int n;
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index b3863abc51f5..74986b094b96 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
>   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
>   
>   	for (n = 0; n < nctx; n++) {
> -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ve[n])) {
>   			err = PTR_ERR(ve[n]);
>   			nctx = n;
> @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
>   	 * restrict it to our desired engine within the virtual engine.
>   	 */
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_close;
> @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
>   		i915_request_add(rq);
>   	}
>   
> -	ce = intel_engine_create_virtual(siblings, nsibling);
> +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ce)) {
>   		err = PTR_ERR(ce);
>   		goto out;
> @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
>   
>   	/* XXX We do not handle oversubscription and fairness with normal rq */
>   	for (n = 0; n < nsibling; n++) {
> -		ce = intel_engine_create_virtual(siblings, nsibling);
> +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ce)) {
>   			err = PTR_ERR(ce);
>   			goto out;
> @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
>   	if (err)
>   		goto out_scratch;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_scratch;
> @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
>   	if (igt_spinner_init(&spin, gt))
>   		return -ENOMEM;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_spin;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f69e984683aa..9b19e0d830a2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -124,7 +124,13 @@ struct guc_virtual_engine {
>   };
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags);
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> @@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
>   	.destroy = guc_context_destroy,
>   
>   	.create_virtual = guc_create_virtual,
> +	.create_parallel = guc_create_parallel,
>   };
>   
>   static void submit_work_cb(struct irq_work *wrk)
> @@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_parent_context_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> @@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(context_enabled(ce));
> @@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_post_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(!intel_context_is_child(ce));
> @@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
>   	intel_context_unpin(ce->parallel.parent);
>   }
>   
> +static void guc_child_context_destroy(struct kref *kref)
> +{
> +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> +
> +	__guc_context_destroy(ce);
> +}
> +
> +static const struct intel_context_ops virtual_parent_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_parent_context_pin,
> +	.unpin = guc_parent_context_unpin,
> +	.post_unpin = guc_context_post_unpin,
> +
> +	.ban = guc_context_ban,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.sched_disable = guc_context_sched_disable,
> +
> +	.destroy = guc_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static const struct intel_context_ops virtual_child_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_child_context_pin,
> +	.unpin = guc_child_context_unpin,
> +	.post_unpin = guc_child_context_post_unpin,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.destroy = guc_child_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width)
> +{
> +	struct intel_engine_cs **siblings = NULL;
> +	struct intel_context *parent = NULL, *ce, *err;
> +	int i, j;
> +
> +	siblings = kmalloc_array(num_siblings,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < width; ++i) {
> +		for (j = 0; j < num_siblings; ++j)
> +			siblings[j] = engines[i * num_siblings + j];
> +
> +		ce = intel_engine_create_virtual(siblings, num_siblings,
> +						 FORCE_VIRTUAL);
> +		if (!ce) {
> +			err = ERR_PTR(-ENOMEM);
> +			goto unwind;
> +		}
> +
> +		if (i == 0) {
> +			parent = ce;
> +			parent->ops = &virtual_parent_context_ops;
> +		} else {
> +			ce->ops = &virtual_child_context_ops;
> +			intel_context_bind_parent_child(parent, ce);
> +		}
> +	}
> +
> +	kfree(siblings);
> +	return parent;
> +
> +unwind:
> +	if (parent)
> +		intel_context_put(parent);
> +	kfree(siblings);
> +	return err;
> +}
> +
>   static bool
>   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
>   {
> @@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   }
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags)
>   {
>   	struct guc_virtual_engine *ve;
>   	struct intel_guc *guc;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index b1248a67b4f8..f7c19e5464ae 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
>    * Extensions:
>    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
>    */
>   #define I915_CONTEXT_PARAM_ENGINES	0xa
>   
> @@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
>   	struct i915_engine_class_instance engines[N__]; \
>   } __attribute__((packed)) name__
>   
> +/**
> + * struct i915_context_engines_parallel_submit - Configure engine for
> + * parallel submission.
> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915 to
> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> + * many BBs there are based on the slot's configuration. The N BBs are the last
> + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> + *
> + * The default placement behavior is to create implicit bonds between each
> + * context if each context maps to more than 1 physical engine (e.g. context is
> + * a virtual engine). Also we only allow contexts of same engine class and these
> + * contexts must be in logically contiguous order. Examples of the placement
> + * behavior are described below. Lastly, the default is to not allow BBs to be
> + * preempted mid-batch. Rather insert coordinated preemption points on all
> + * hardware contexts between each set of BBs. Flags could be added in the future
> + * to change both of these default behaviors.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * interface.
> + *
> + * .. code-block:: none
> + *
> + *	Examples syntax:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *
> + *	Example 1 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> + *		     engines=CS[0],CS[1])
> + *
> + *	Results in the following valid placement:
> + *	CS[0], CS[1]
> + *
> + *	Example 2 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[2],CS[1],CS[3])
> + *
> + *	Results in the following valid placements:
> + *	CS[0], CS[1]
> + *	CS[2], CS[3]
> + *
> + *	This can be thought of as two virtual engines, each containing two
> + *	engines thereby making a 2D array. However, there are bonds tying the
> + *	entries together and placing restrictions on how they can be scheduled.
> + *	Specifically, the scheduler can choose only vertical columns from the 2D
> + *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
> + *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
> + *	versa. Same for CS[2] requires also using CS[3].
> + *	VE[0] = CS[0], CS[2]
> + *	VE[1] = CS[1], CS[3]
> + *
> + *	Example 3 pseudo code:
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[1],CS[1],CS[3])
> + *
> + *	Results in the following valid and invalid placements:
> + *	CS[0], CS[1]
> + *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
> + */
> +struct i915_context_engines_parallel_submit {
> +	/**
> +	 * @base: base user extension.
> +	 */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @engine_index: slot for parallel engine
> +	 */
> +	__u16 engine_index;
> +
> +	/**
> +	 * @width: number of contexts per parallel engine or in other words the
> +	 * number of batches in each submission
> +	 */
> +	__u16 width;
> +
> +	/**
> +	 * @num_siblings: number of siblings per context or in other words the
> +	 * number of possible placements for each submission
> +	 */
> +	__u16 num_siblings;
> +
> +	/**
> +	 * @mbz16: reserved for future use; must be zero
> +	 */
> +	__u16 mbz16;
> +
> +	/**
> +	 * @flags: all undefined flags must be zero, currently not defined flags
> +	 */
> +	__u64 flags;
> +
> +	/**
> +	 * @mbz64: reserved for future use; must be zero
> +	 */
> +	__u64 mbz64[3];
> +
> +	/**
> +	 * @engines: 2-d array of engine instances to configure parallel engine
> +	 *
> +	 * length = width (i) * num_siblings (j)
> +	 * index = j + i * num_siblings
> +	 */
> +	struct i915_engine_class_instance engines[0];
> +
> +} __packed;
> +
> +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> +	struct i915_user_extension base; \
> +	__u16 engine_index; \
> +	__u16 width; \
> +	__u16 num_siblings; \
> +	__u16 mbz16; \
> +	__u64 flags; \
> +	__u64 mbz64[3]; \
> +	struct i915_engine_class_instance engines[N__]; \
> +} __attribute__((packed)) name__
> +
>   /**
>    * DOC: Context Engine Map uAPI
>    *
> @@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
>   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>   	struct i915_engine_class_instance engines[0];
>   } __attribute__((packed));
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-11 22:15     ` Daniele Ceraolo Spurio
  -1 siblings, 0 replies; 165+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-10-11 22:15 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: john.c.harrison


On 10/4/2021 3:06 PM, Matthew Brost wrote:
> Parallel submission create composite fences (dma_fence_array) for excl /
> shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
> determine the busyness of the object. Prior to patch it only check if
> the fence in the slot was a i915_request. Update the check to understand
> composite fences and correctly report the busyness.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
>   drivers/gpu/drm/i915/i915_request.h           |  6 ++
>   3 files changed, 58 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> index 6234e17259c1..b89d173c62eb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> @@ -4,6 +4,8 @@
>    * Copyright © 2014-2016 Intel Corporation
>    */
>   
> +#include <linux/dma-fence-array.h>
> +
>   #include "gt/intel_engine.h"
>   
>   #include "i915_gem_ioctls.h"
> @@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
>   }
>   
>   static __always_inline unsigned int
> -__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
> +__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
>   {
>   	const struct i915_request *rq;
>   
> @@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
>   	 * to eventually flush us, but to minimise latency just ask the
>   	 * hardware.
>   	 *
> -	 * Note we only report on the status of native fences.
> +	 * Note we only report on the status of native fences and we currently
> +	 * have two native fences:
> +	 *
> +	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
> +	 * created during a parallel submission. In this case we deconstruct the
> +	 * composite fence into individual i915 requests and check the status of
> +	 * each request.
> +	 *
> +	 * 2. A single i915 request.
>   	 */
> -	if (!dma_fence_is_i915(fence))
> +	if (dma_fence_is_array(fence)) {
> +		struct dma_fence_array *array = to_dma_fence_array(fence);
> +		struct dma_fence **child = array->fences;
> +		unsigned int nchild = array->num_fences;
> +
> +		do {
> +			struct dma_fence *current_fence = *child++;
> +
> +			/* Not an i915 fence, can't be busy per above */
> +			if (!dma_fence_is_i915(current_fence) ||
> +			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
> +				      &current_fence->flags)) {
> +				return 0;
> +			}
> +
> +			rq = to_request(current_fence);
> +			if (!i915_request_completed(rq)) {
> +				BUILD_BUG_ON(!typecheck(u16,
> +							rq->engine->uabi_class));
> +				return flag(rq->engine->uabi_class);
> +			}
> +		} while (--nchild);
> +
> +		/* All requests in array complete, not busy */
>   		return 0;
> +	} else {
> +		if (!dma_fence_is_i915(fence))
> +			return 0;
>   
> -	/* opencode to_request() in order to avoid const warnings */
> -	rq = container_of(fence, const struct i915_request, fence);
> -	if (i915_request_completed(rq))
> -		return 0;
> +		rq = to_request(fence);
> +		if (i915_request_completed(rq))
> +			return 0;
>   
> -	/* Beware type-expansion follies! */
> -	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> -	return flag(rq->engine->uabi_class);
> +		/* Beware type-expansion follies! */
> +		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> +		return flag(rq->engine->uabi_class);
> +	}
>   }
>   
>   static __always_inline unsigned int
> -busy_check_reader(const struct dma_fence *fence)
> +busy_check_reader(struct dma_fence *fence)
>   {
>   	return __busy_set_if_active(fence, __busy_read_flag);
>   }
>   
>   static __always_inline unsigned int
> -busy_check_writer(const struct dma_fence *fence)
> +busy_check_writer(struct dma_fence *fence)
>   {
>   	if (!fence)
>   		return 0;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 5c7fb6f68bbb..16276f406fd6 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
>   	if (!fences)
>   		return ERR_PTR(-ENOMEM);
>   
> -	for_each_batch_create_order(eb, i)
> +	for_each_batch_create_order(eb, i) {
>   		fences[i] = &eb->requests[i]->fence;
> +		__set_bit(I915_FENCE_FLAG_COMPOSITE,
> +			  &eb->requests[i]->fence.flags);
> +	}
>   
>   	fence_array = dma_fence_array_create(eb->num_batches,
>   					     fences,
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 24db8459376b..dc359242d1ae 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -156,6 +156,12 @@ enum {
>   	 * submission / relationship encoutered an error.
>   	 */
>   	I915_FENCE_FLAG_SKIP_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
> +	 * fence (dma_fence_array) and i915 generated for parallel submission.
> +	 */
> +	I915_FENCE_FLAG_COMPOSITE,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
@ 2021-10-11 22:15     ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 165+ messages in thread
From: Daniele Ceraolo Spurio @ 2021-10-11 22:15 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: john.c.harrison


On 10/4/2021 3:06 PM, Matthew Brost wrote:
> Parallel submission create composite fences (dma_fence_array) for excl /
> shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
> determine the busyness of the object. Prior to patch it only check if
> the fence in the slot was a i915_request. Update the check to understand
> composite fences and correctly report the busyness.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
>   drivers/gpu/drm/i915/i915_request.h           |  6 ++
>   3 files changed, 58 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> index 6234e17259c1..b89d173c62eb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> @@ -4,6 +4,8 @@
>    * Copyright © 2014-2016 Intel Corporation
>    */
>   
> +#include <linux/dma-fence-array.h>
> +
>   #include "gt/intel_engine.h"
>   
>   #include "i915_gem_ioctls.h"
> @@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
>   }
>   
>   static __always_inline unsigned int
> -__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
> +__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
>   {
>   	const struct i915_request *rq;
>   
> @@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
>   	 * to eventually flush us, but to minimise latency just ask the
>   	 * hardware.
>   	 *
> -	 * Note we only report on the status of native fences.
> +	 * Note we only report on the status of native fences and we currently
> +	 * have two native fences:
> +	 *
> +	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
> +	 * created during a parallel submission. In this case we deconstruct the
> +	 * composite fence into individual i915 requests and check the status of
> +	 * each request.
> +	 *
> +	 * 2. A single i915 request.
>   	 */
> -	if (!dma_fence_is_i915(fence))
> +	if (dma_fence_is_array(fence)) {
> +		struct dma_fence_array *array = to_dma_fence_array(fence);
> +		struct dma_fence **child = array->fences;
> +		unsigned int nchild = array->num_fences;
> +
> +		do {
> +			struct dma_fence *current_fence = *child++;
> +
> +			/* Not an i915 fence, can't be busy per above */
> +			if (!dma_fence_is_i915(current_fence) ||
> +			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
> +				      &current_fence->flags)) {
> +				return 0;
> +			}
> +
> +			rq = to_request(current_fence);
> +			if (!i915_request_completed(rq)) {
> +				BUILD_BUG_ON(!typecheck(u16,
> +							rq->engine->uabi_class));
> +				return flag(rq->engine->uabi_class);
> +			}
> +		} while (--nchild);
> +
> +		/* All requests in array complete, not busy */
>   		return 0;
> +	} else {
> +		if (!dma_fence_is_i915(fence))
> +			return 0;
>   
> -	/* opencode to_request() in order to avoid const warnings */
> -	rq = container_of(fence, const struct i915_request, fence);
> -	if (i915_request_completed(rq))
> -		return 0;
> +		rq = to_request(fence);
> +		if (i915_request_completed(rq))
> +			return 0;
>   
> -	/* Beware type-expansion follies! */
> -	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> -	return flag(rq->engine->uabi_class);
> +		/* Beware type-expansion follies! */
> +		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> +		return flag(rq->engine->uabi_class);
> +	}
>   }
>   
>   static __always_inline unsigned int
> -busy_check_reader(const struct dma_fence *fence)
> +busy_check_reader(struct dma_fence *fence)
>   {
>   	return __busy_set_if_active(fence, __busy_read_flag);
>   }
>   
>   static __always_inline unsigned int
> -busy_check_writer(const struct dma_fence *fence)
> +busy_check_writer(struct dma_fence *fence)
>   {
>   	if (!fence)
>   		return 0;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 5c7fb6f68bbb..16276f406fd6 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
>   	if (!fences)
>   		return ERR_PTR(-ENOMEM);
>   
> -	for_each_batch_create_order(eb, i)
> +	for_each_batch_create_order(eb, i) {
>   		fences[i] = &eb->requests[i]->fence;
> +		__set_bit(I915_FENCE_FLAG_COMPOSITE,
> +			  &eb->requests[i]->fence.flags);
> +	}
>   
>   	fence_array = dma_fence_array_create(eb->num_batches,
>   					     fences,
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 24db8459376b..dc359242d1ae 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -156,6 +156,12 @@ enum {
>   	 * submission / relationship encoutered an error.
>   	 */
>   	I915_FENCE_FLAG_SKIP_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
> +	 * fence (dma_fence_array) and i915 generated for parallel submission.
> +	 */
> +	I915_FENCE_FLAG_COMPOSITE,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-10-11 22:09     ` [Intel-gfx] " John Harrison
@ 2021-10-11 22:59       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-11 22:59 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Mon, Oct 11, 2021 at 03:09:43PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Introduce 'set parallel submit' extension to connect UAPI to GuC
> > multi-lrc interface. Kernel doc in new uAPI should explain it all.
> > 
> > IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > media UMD: https://github.com/intel/media-driver/pull/1252
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add IGT link and placeholder for media UMD link
> > v3:
> >   (Kernel test robot)
> >    - Fix warning in unpin engines call
> >   (John Harrison)
> >    - Reword a bunch of the kernel doc
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
> >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
> >   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
> >   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
> >   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
> >   include/uapi/drm/i915_drm.h                   | 131 +++++++++++
> >   9 files changed, 489 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index 8c7ea6e56262..6290bc20ccb1 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
> >   	return 0;
> >   }
> > +static int
> > +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> > +				      void *data)
> > +{
> > +	struct i915_context_engines_parallel_submit __user *ext =
> > +		container_of_user(base, typeof(*ext), base);
> > +	const struct set_proto_ctx_engines *set = data;
> > +	struct drm_i915_private *i915 = set->i915;
> > +	u64 flags;
> > +	int err = 0, n, i, j;
> > +	u16 slot, width, num_siblings;
> > +	struct intel_engine_cs **siblings = NULL;
> > +	intel_engine_mask_t prev_mask;
> > +
> > +	/* Disabling for now */
> > +	return -ENODEV;
> > +
> > +	/* FIXME: This is NIY for execlists */
> > +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> > +		return -ENODEV;
> > +
> > +	if (get_user(slot, &ext->engine_index))
> > +		return -EFAULT;
> > +
> > +	if (get_user(width, &ext->width))
> > +		return -EFAULT;
> > +
> > +	if (get_user(num_siblings, &ext->num_siblings))
> > +		return -EFAULT;
> > +
> > +	if (slot >= set->num_engines) {
> > +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> > +			slot, set->num_engines);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> > +		drm_dbg(&i915->drm,
> > +			"Invalid placement[%d], already occupied\n", slot);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (get_user(flags, &ext->flags))
> > +		return -EFAULT;
> > +
> > +	if (flags) {
> > +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> > +		err = check_user_mbz(&ext->mbz64[n]);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	if (width < 2) {
> > +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (num_siblings < 1) {
> > +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> > +			num_siblings);
> > +		return -EINVAL;
> > +	}
> > +
> > +	siblings = kmalloc_array(num_siblings * width,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return -ENOMEM;
> > +
> > +	/* Create contexts / engines */
> > +	for (i = 0; i < width; ++i) {
> > +		intel_engine_mask_t current_mask = 0;
> > +		struct i915_engine_class_instance prev_engine;
> > +
> > +		for (j = 0; j < num_siblings; ++j) {
> > +			struct i915_engine_class_instance ci;
> > +
> > +			n = i * num_siblings + j;
> > +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> > +				err = -EFAULT;
> > +				goto out_err;
> > +			}
> > +
> > +			siblings[n] =
> > +				intel_engine_lookup_user(i915, ci.engine_class,
> > +							 ci.engine_instance);
> > +			if (!siblings[n]) {
> > +				drm_dbg(&i915->drm,
> > +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> > +					n, ci.engine_class, ci.engine_instance);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +
> > +			if (n) {
> > +				if (prev_engine.engine_class !=
> > +				    ci.engine_class) {
> > +					drm_dbg(&i915->drm,
> > +						"Mismatched class %d, %d\n",
> > +						prev_engine.engine_class,
> > +						ci.engine_class);
> > +					err = -EINVAL;
> > +					goto out_err;
> > +				}
> > +			}
> > +
> > +			prev_engine = ci;
> > +			current_mask |= siblings[n]->logical_mask;
> > +		}
> > +
> > +		if (i > 0) {
> > +			if (current_mask != prev_mask << 1) {
> > +				drm_dbg(&i915->drm,
> > +					"Non contiguous logical mask 0x%x, 0x%x\n",
> > +					prev_mask, current_mask);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +		}
> > +		prev_mask = current_mask;
> > +	}
> > +
> > +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> > +	set->engines[slot].num_siblings = num_siblings;
> > +	set->engines[slot].width = width;
> > +	set->engines[slot].siblings = siblings;
> > +
> > +	return 0;
> > +
> > +out_err:
> > +	kfree(siblings);
> > +
> > +	return err;
> > +}
> > +
> >   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
> >   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
> >   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> > +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> > +		set_proto_ctx_engines_parallel_submit,
> >   };
> >   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> > @@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
> >   	return ret;
> >   }
> > +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> > +{
> > +	while (count--) {
> > +		struct intel_context *ce = e->engines[count], *child;
> > +
> > +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> > +			continue;
> > +
> > +		for_each_child(ce, child)
> > +			intel_context_unpin(child);
> > +		intel_context_unpin(ce);
> > +	}
> > +}
> > +
> > +static void unpin_engines(struct i915_gem_engines *e)
> > +{
> > +	__unpin_engines(e, e->num_engines);
> > +}
> > +
> >   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
> >   {
> >   	while (count--) {
> > @@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
> >   	return err;
> >   }
> > +static int perma_pin_contexts(struct intel_context *ce)
> > +{
> > +	struct intel_context *child;
> > +	int i = 0, j = 0, ret;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	ret = intel_context_pin(ce);
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	for_each_child(ce, child) {
> > +		ret = intel_context_pin(child);
> > +		if (unlikely(ret))
> > +			goto unwind;
> > +		++i;
> > +	}
> > +
> > +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	intel_context_unpin(ce);
> > +	for_each_child(ce, child) {
> > +		if (j++ < i)
> > +			intel_context_unpin(child);
> > +		else
> > +			break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   					     unsigned int num_engines,
> >   					     struct i915_gem_proto_engine *pe)
> > @@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   	e->num_engines = num_engines;
> >   	for (n = 0; n < num_engines; n++) {
> > -		struct intel_context *ce;
> > +		struct intel_context *ce, *child;
> >   		int ret;
> >   		switch (pe[n].type) {
> > @@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   		case I915_GEM_ENGINE_TYPE_BALANCED:
> >   			ce = intel_engine_create_virtual(pe[n].siblings,
> > -							 pe[n].num_siblings);
> > +							 pe[n].num_siblings, 0);
> > +			break;
> > +
> > +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> > +			ce = intel_engine_create_parallel(pe[n].siblings,
> > +							  pe[n].num_siblings,
> > +							  pe[n].width);
> >   			break;
> >   		case I915_GEM_ENGINE_TYPE_INVALID:
> > @@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   			err = ERR_PTR(ret);
> >   			goto free_engines;
> >   		}
> > +		for_each_child(ce, child) {
> > +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> > +
> > +		/* XXX: Must be done after setting gem context */
> There is still no explanation of this comment either here or in the commit
> message. It needs to say why it is a problem that the perma-pin must be done
> after the above set_gem call. And what must be done to fix this problem. And
> what issues could be expected because of this problem.
> 

I think the issue is intel_context_set_gem changes to the ring_size to
16k (default 4k) and the pin allocates the context state (including the
ring) so if alloc is done before calling intel_context_set_gem we have
mismatch between ring_size and what was allocated. I think this results
in hangs. Which some reordering we likely could push the perma-pinning
into the backend function intel_engine_create_parallel. It is fine as is
but at some point we may want to do this ordering so that's why I have
the XXX. Will add comment here explaining this.

> > +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> > +			ret = perma_pin_contexts(ce);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> >   	}
> >   	return e;
> > @@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
> >   	/* Flush any concurrent set_engines() */
> >   	mutex_lock(&ctx->engines_mutex);
> > +	unpin_engines(__context_engines_static(ctx));
> >   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
> >   	i915_gem_context_set_closed(ctx);
> >   	mutex_unlock(&ctx->engines_mutex);
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > index c4617e4d9fa9..eb5f9b4f2d19 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
> >   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
> >   	I915_GEM_ENGINE_TYPE_BALANCED,
> > +
> > +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> > +	I915_GEM_ENGINE_TYPE_PARALLEL,
> >   };
> >   /**
> > @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
> >   	/** @num_siblings: Number of balanced siblings */
> Should this be updated to say 'number of balanced or parallel siblings'?
> 

Yes it should. Will fix.

Matt

> John.
> 
> >   	unsigned int num_siblings;
> > +	/** @width: Width of each sibling */
> > +	unsigned int width;
> > +
> >   	/** @siblings: Balanced siblings */
> >   	struct intel_engine_cs **siblings;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 8309d1141d0a..1d880303a7e4 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -55,9 +55,13 @@ struct intel_context_ops {
> >   	void (*reset)(struct intel_context *ce);
> >   	void (*destroy)(struct kref *kref);
> > -	/* virtual engine/context interface */
> > +	/* virtual/parallel engine/context interface */
> >   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> > -						unsigned int count);
> > +						unsigned int count,
> > +						unsigned long flags);
> > +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> > +						 unsigned int num_siblings,
> > +						 unsigned int width);
> >   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
> >   					       unsigned int sibling);
> >   };
> > @@ -113,6 +117,7 @@ struct intel_context {
> >   #define CONTEXT_NOPREEMPT		8
> >   #define CONTEXT_LRCA_DIRTY		9
> >   #define CONTEXT_GUC_INIT		10
> > +#define CONTEXT_PERMA_PIN		11
> >   	struct {
> >   		u64 timeout_us;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> > index 87579affb952..43f16a8347ee 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> > @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
> >   	return intel_engine_has_preemption(engine);
> >   }
> > +#define FORCE_VIRTUAL	BIT(0)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count);
> > +			    unsigned int count, unsigned long flags);
> > +
> > +static inline struct intel_context *
> > +intel_engine_create_parallel(struct intel_engine_cs **engines,
> > +			     unsigned int num_engines,
> > +			     unsigned int width)
> > +{
> > +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> > +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> > +}
> >   static inline bool
> >   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > index 2eb798ad068b..ff6753ccb129 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > @@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count)
> > +			    unsigned int count, unsigned long flags)
> >   {
> >   	if (count == 0)
> >   		return ERR_PTR(-EINVAL);
> > -	if (count == 1)
> > +	if (count == 1 && !(flags & FORCE_VIRTUAL))
> >   		return intel_context_create(siblings[0]);
> >   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> > -	return siblings[0]->cops->create_virtual(siblings, count);
> > +	return siblings[0]->cops->create_virtual(siblings, count, flags);
> >   }
> >   struct i915_request *
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index 5ed1e222c308..8d7f571029df 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags);
> >   static struct i915_request *
> >   __active_request(const struct intel_timeline * const tl,
> > @@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags)
> >   {
> >   	struct virtual_engine *ve;
> >   	unsigned int n;
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > index b3863abc51f5..74986b094b96 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
> >   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
> >   	for (n = 0; n < nctx; n++) {
> > -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> > +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ve[n])) {
> >   			err = PTR_ERR(ve[n]);
> >   			nctx = n;
> > @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
> >   	 * restrict it to our desired engine within the virtual engine.
> >   	 */
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_close;
> > @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
> >   		i915_request_add(rq);
> >   	}
> > -	ce = intel_engine_create_virtual(siblings, nsibling);
> > +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ce)) {
> >   		err = PTR_ERR(ce);
> >   		goto out;
> > @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
> >   	/* XXX We do not handle oversubscription and fairness with normal rq */
> >   	for (n = 0; n < nsibling; n++) {
> > -		ce = intel_engine_create_virtual(siblings, nsibling);
> > +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ce)) {
> >   			err = PTR_ERR(ce);
> >   			goto out;
> > @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
> >   	if (err)
> >   		goto out_scratch;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_scratch;
> > @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
> >   	if (igt_spinner_init(&spin, gt))
> >   		return -ENOMEM;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_spin;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f69e984683aa..9b19e0d830a2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -124,7 +124,13 @@ struct guc_virtual_engine {
> >   };
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags);
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > @@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
> >   	.destroy = guc_context_destroy,
> >   	.create_virtual = guc_create_virtual,
> > +	.create_parallel = guc_create_parallel,
> >   };
> >   static void submit_work_cb(struct irq_work *wrk)
> > @@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_parent_context_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > @@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(context_enabled(ce));
> > @@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_post_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(!intel_context_is_child(ce));
> > @@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
> >   	intel_context_unpin(ce->parallel.parent);
> >   }
> > +static void guc_child_context_destroy(struct kref *kref)
> > +{
> > +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > +
> > +	__guc_context_destroy(ce);
> > +}
> > +
> > +static const struct intel_context_ops virtual_parent_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_parent_context_pin,
> > +	.unpin = guc_parent_context_unpin,
> > +	.post_unpin = guc_context_post_unpin,
> > +
> > +	.ban = guc_context_ban,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.sched_disable = guc_context_sched_disable,
> > +
> > +	.destroy = guc_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static const struct intel_context_ops virtual_child_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_child_context_pin,
> > +	.unpin = guc_child_context_unpin,
> > +	.post_unpin = guc_child_context_post_unpin,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.destroy = guc_child_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width)
> > +{
> > +	struct intel_engine_cs **siblings = NULL;
> > +	struct intel_context *parent = NULL, *ce, *err;
> > +	int i, j;
> > +
> > +	siblings = kmalloc_array(num_siblings,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	for (i = 0; i < width; ++i) {
> > +		for (j = 0; j < num_siblings; ++j)
> > +			siblings[j] = engines[i * num_siblings + j];
> > +
> > +		ce = intel_engine_create_virtual(siblings, num_siblings,
> > +						 FORCE_VIRTUAL);
> > +		if (!ce) {
> > +			err = ERR_PTR(-ENOMEM);
> > +			goto unwind;
> > +		}
> > +
> > +		if (i == 0) {
> > +			parent = ce;
> > +			parent->ops = &virtual_parent_context_ops;
> > +		} else {
> > +			ce->ops = &virtual_child_context_ops;
> > +			intel_context_bind_parent_child(parent, ce);
> > +		}
> > +	}
> > +
> > +	kfree(siblings);
> > +	return parent;
> > +
> > +unwind:
> > +	if (parent)
> > +		intel_context_put(parent);
> > +	kfree(siblings);
> > +	return err;
> > +}
> > +
> >   static bool
> >   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
> >   {
> > @@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   }
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags)
> >   {
> >   	struct guc_virtual_engine *ve;
> >   	struct intel_guc *guc;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index b1248a67b4f8..f7c19e5464ae 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
> >    * Extensions:
> >    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> >    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> >    */
> >   #define I915_CONTEXT_PARAM_ENGINES	0xa
> > @@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
> >   	struct i915_engine_class_instance engines[N__]; \
> >   } __attribute__((packed)) name__
> > +/**
> > + * struct i915_context_engines_parallel_submit - Configure engine for
> > + * parallel submission.
> > + *
> > + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> > + * in parallel. Multiple hardware contexts are created internally in the i915 to
> > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> > + * many BBs there are based on the slot's configuration. The N BBs are the last
> > + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> > + *
> > + * The default placement behavior is to create implicit bonds between each
> > + * context if each context maps to more than 1 physical engine (e.g. context is
> > + * a virtual engine). Also we only allow contexts of same engine class and these
> > + * contexts must be in logically contiguous order. Examples of the placement
> > + * behavior are described below. Lastly, the default is to not allow BBs to be
> > + * preempted mid-batch. Rather insert coordinated preemption points on all
> > + * hardware contexts between each set of BBs. Flags could be added in the future
> > + * to change both of these default behaviors.
> > + *
> > + * Returns -EINVAL if hardware context placement configuration is invalid or if
> > + * the placement configuration isn't supported on the platform / submission
> > + * interface.
> > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > + * interface.
> > + *
> > + * .. code-block:: none
> > + *
> > + *	Examples syntax:
> > + *	CS[X] = generic engine of same class, logical instance X
> > + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + *
> > + *	Example 1 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> > + *		     engines=CS[0],CS[1])
> > + *
> > + *	Results in the following valid placement:
> > + *	CS[0], CS[1]
> > + *
> > + *	Example 2 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[2],CS[1],CS[3])
> > + *
> > + *	Results in the following valid placements:
> > + *	CS[0], CS[1]
> > + *	CS[2], CS[3]
> > + *
> > + *	This can be thought of as two virtual engines, each containing two
> > + *	engines thereby making a 2D array. However, there are bonds tying the
> > + *	entries together and placing restrictions on how they can be scheduled.
> > + *	Specifically, the scheduler can choose only vertical columns from the 2D
> > + *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
> > + *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
> > + *	versa. Same for CS[2] requires also using CS[3].
> > + *	VE[0] = CS[0], CS[2]
> > + *	VE[1] = CS[1], CS[3]
> > + *
> > + *	Example 3 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[1],CS[1],CS[3])
> > + *
> > + *	Results in the following valid and invalid placements:
> > + *	CS[0], CS[1]
> > + *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
> > + */
> > +struct i915_context_engines_parallel_submit {
> > +	/**
> > +	 * @base: base user extension.
> > +	 */
> > +	struct i915_user_extension base;
> > +
> > +	/**
> > +	 * @engine_index: slot for parallel engine
> > +	 */
> > +	__u16 engine_index;
> > +
> > +	/**
> > +	 * @width: number of contexts per parallel engine or in other words the
> > +	 * number of batches in each submission
> > +	 */
> > +	__u16 width;
> > +
> > +	/**
> > +	 * @num_siblings: number of siblings per context or in other words the
> > +	 * number of possible placements for each submission
> > +	 */
> > +	__u16 num_siblings;
> > +
> > +	/**
> > +	 * @mbz16: reserved for future use; must be zero
> > +	 */
> > +	__u16 mbz16;
> > +
> > +	/**
> > +	 * @flags: all undefined flags must be zero, currently not defined flags
> > +	 */
> > +	__u64 flags;
> > +
> > +	/**
> > +	 * @mbz64: reserved for future use; must be zero
> > +	 */
> > +	__u64 mbz64[3];
> > +
> > +	/**
> > +	 * @engines: 2-d array of engine instances to configure parallel engine
> > +	 *
> > +	 * length = width (i) * num_siblings (j)
> > +	 * index = j + i * num_siblings
> > +	 */
> > +	struct i915_engine_class_instance engines[0];
> > +
> > +} __packed;
> > +
> > +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> > +	struct i915_user_extension base; \
> > +	__u16 engine_index; \
> > +	__u16 width; \
> > +	__u16 num_siblings; \
> > +	__u16 mbz16; \
> > +	__u64 flags; \
> > +	__u64 mbz64[3]; \
> > +	struct i915_engine_class_instance engines[N__]; \
> > +} __attribute__((packed)) name__
> > +
> >   /**
> >    * DOC: Context Engine Map uAPI
> >    *
> > @@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
> >   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> >   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> >   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> >   	struct i915_engine_class_instance engines[0];
> >   } __attribute__((packed));
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-10-11 22:59       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-11 22:59 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Mon, Oct 11, 2021 at 03:09:43PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Introduce 'set parallel submit' extension to connect UAPI to GuC
> > multi-lrc interface. Kernel doc in new uAPI should explain it all.
> > 
> > IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > media UMD: https://github.com/intel/media-driver/pull/1252
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add IGT link and placeholder for media UMD link
> > v3:
> >   (Kernel test robot)
> >    - Fix warning in unpin engines call
> >   (John Harrison)
> >    - Reword a bunch of the kernel doc
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 221 +++++++++++++++++-
> >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
> >   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
> >   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
> >   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
> >   include/uapi/drm/i915_drm.h                   | 131 +++++++++++
> >   9 files changed, 489 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index 8c7ea6e56262..6290bc20ccb1 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -522,9 +522,150 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
> >   	return 0;
> >   }
> > +static int
> > +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> > +				      void *data)
> > +{
> > +	struct i915_context_engines_parallel_submit __user *ext =
> > +		container_of_user(base, typeof(*ext), base);
> > +	const struct set_proto_ctx_engines *set = data;
> > +	struct drm_i915_private *i915 = set->i915;
> > +	u64 flags;
> > +	int err = 0, n, i, j;
> > +	u16 slot, width, num_siblings;
> > +	struct intel_engine_cs **siblings = NULL;
> > +	intel_engine_mask_t prev_mask;
> > +
> > +	/* Disabling for now */
> > +	return -ENODEV;
> > +
> > +	/* FIXME: This is NIY for execlists */
> > +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> > +		return -ENODEV;
> > +
> > +	if (get_user(slot, &ext->engine_index))
> > +		return -EFAULT;
> > +
> > +	if (get_user(width, &ext->width))
> > +		return -EFAULT;
> > +
> > +	if (get_user(num_siblings, &ext->num_siblings))
> > +		return -EFAULT;
> > +
> > +	if (slot >= set->num_engines) {
> > +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> > +			slot, set->num_engines);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> > +		drm_dbg(&i915->drm,
> > +			"Invalid placement[%d], already occupied\n", slot);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (get_user(flags, &ext->flags))
> > +		return -EFAULT;
> > +
> > +	if (flags) {
> > +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> > +		err = check_user_mbz(&ext->mbz64[n]);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	if (width < 2) {
> > +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (num_siblings < 1) {
> > +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> > +			num_siblings);
> > +		return -EINVAL;
> > +	}
> > +
> > +	siblings = kmalloc_array(num_siblings * width,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return -ENOMEM;
> > +
> > +	/* Create contexts / engines */
> > +	for (i = 0; i < width; ++i) {
> > +		intel_engine_mask_t current_mask = 0;
> > +		struct i915_engine_class_instance prev_engine;
> > +
> > +		for (j = 0; j < num_siblings; ++j) {
> > +			struct i915_engine_class_instance ci;
> > +
> > +			n = i * num_siblings + j;
> > +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> > +				err = -EFAULT;
> > +				goto out_err;
> > +			}
> > +
> > +			siblings[n] =
> > +				intel_engine_lookup_user(i915, ci.engine_class,
> > +							 ci.engine_instance);
> > +			if (!siblings[n]) {
> > +				drm_dbg(&i915->drm,
> > +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> > +					n, ci.engine_class, ci.engine_instance);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +
> > +			if (n) {
> > +				if (prev_engine.engine_class !=
> > +				    ci.engine_class) {
> > +					drm_dbg(&i915->drm,
> > +						"Mismatched class %d, %d\n",
> > +						prev_engine.engine_class,
> > +						ci.engine_class);
> > +					err = -EINVAL;
> > +					goto out_err;
> > +				}
> > +			}
> > +
> > +			prev_engine = ci;
> > +			current_mask |= siblings[n]->logical_mask;
> > +		}
> > +
> > +		if (i > 0) {
> > +			if (current_mask != prev_mask << 1) {
> > +				drm_dbg(&i915->drm,
> > +					"Non contiguous logical mask 0x%x, 0x%x\n",
> > +					prev_mask, current_mask);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +		}
> > +		prev_mask = current_mask;
> > +	}
> > +
> > +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> > +	set->engines[slot].num_siblings = num_siblings;
> > +	set->engines[slot].width = width;
> > +	set->engines[slot].siblings = siblings;
> > +
> > +	return 0;
> > +
> > +out_err:
> > +	kfree(siblings);
> > +
> > +	return err;
> > +}
> > +
> >   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
> >   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
> >   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> > +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> > +		set_proto_ctx_engines_parallel_submit,
> >   };
> >   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> > @@ -775,6 +916,25 @@ static int intel_context_set_gem(struct intel_context *ce,
> >   	return ret;
> >   }
> > +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> > +{
> > +	while (count--) {
> > +		struct intel_context *ce = e->engines[count], *child;
> > +
> > +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> > +			continue;
> > +
> > +		for_each_child(ce, child)
> > +			intel_context_unpin(child);
> > +		intel_context_unpin(ce);
> > +	}
> > +}
> > +
> > +static void unpin_engines(struct i915_gem_engines *e)
> > +{
> > +	__unpin_engines(e, e->num_engines);
> > +}
> > +
> >   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
> >   {
> >   	while (count--) {
> > @@ -890,6 +1050,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
> >   	return err;
> >   }
> > +static int perma_pin_contexts(struct intel_context *ce)
> > +{
> > +	struct intel_context *child;
> > +	int i = 0, j = 0, ret;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	ret = intel_context_pin(ce);
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	for_each_child(ce, child) {
> > +		ret = intel_context_pin(child);
> > +		if (unlikely(ret))
> > +			goto unwind;
> > +		++i;
> > +	}
> > +
> > +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	intel_context_unpin(ce);
> > +	for_each_child(ce, child) {
> > +		if (j++ < i)
> > +			intel_context_unpin(child);
> > +		else
> > +			break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   					     unsigned int num_engines,
> >   					     struct i915_gem_proto_engine *pe)
> > @@ -903,7 +1097,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   	e->num_engines = num_engines;
> >   	for (n = 0; n < num_engines; n++) {
> > -		struct intel_context *ce;
> > +		struct intel_context *ce, *child;
> >   		int ret;
> >   		switch (pe[n].type) {
> > @@ -913,7 +1107,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   		case I915_GEM_ENGINE_TYPE_BALANCED:
> >   			ce = intel_engine_create_virtual(pe[n].siblings,
> > -							 pe[n].num_siblings);
> > +							 pe[n].num_siblings, 0);
> > +			break;
> > +
> > +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> > +			ce = intel_engine_create_parallel(pe[n].siblings,
> > +							  pe[n].num_siblings,
> > +							  pe[n].width);
> >   			break;
> >   		case I915_GEM_ENGINE_TYPE_INVALID:
> > @@ -934,6 +1134,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   			err = ERR_PTR(ret);
> >   			goto free_engines;
> >   		}
> > +		for_each_child(ce, child) {
> > +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> > +
> > +		/* XXX: Must be done after setting gem context */
> There is still no explanation of this comment either here or in the commit
> message. It needs to say why it is a problem that the perma-pin must be done
> after the above set_gem call. And what must be done to fix this problem. And
> what issues could be expected because of this problem.
> 

I think the issue is intel_context_set_gem changes to the ring_size to
16k (default 4k) and the pin allocates the context state (including the
ring) so if alloc is done before calling intel_context_set_gem we have
mismatch between ring_size and what was allocated. I think this results
in hangs. Which some reordering we likely could push the perma-pinning
into the backend function intel_engine_create_parallel. It is fine as is
but at some point we may want to do this ordering so that's why I have
the XXX. Will add comment here explaining this.

> > +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> > +			ret = perma_pin_contexts(ce);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> >   	}
> >   	return e;
> > @@ -1173,6 +1389,7 @@ static void context_close(struct i915_gem_context *ctx)
> >   	/* Flush any concurrent set_engines() */
> >   	mutex_lock(&ctx->engines_mutex);
> > +	unpin_engines(__context_engines_static(ctx));
> >   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
> >   	i915_gem_context_set_closed(ctx);
> >   	mutex_unlock(&ctx->engines_mutex);
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > index c4617e4d9fa9..eb5f9b4f2d19 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
> >   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
> >   	I915_GEM_ENGINE_TYPE_BALANCED,
> > +
> > +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> > +	I915_GEM_ENGINE_TYPE_PARALLEL,
> >   };
> >   /**
> > @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
> >   	/** @num_siblings: Number of balanced siblings */
> Should this be updated to say 'number of balanced or parallel siblings'?
> 

Yes it should. Will fix.

Matt

> John.
> 
> >   	unsigned int num_siblings;
> > +	/** @width: Width of each sibling */
> > +	unsigned int width;
> > +
> >   	/** @siblings: Balanced siblings */
> >   	struct intel_engine_cs **siblings;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 8309d1141d0a..1d880303a7e4 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -55,9 +55,13 @@ struct intel_context_ops {
> >   	void (*reset)(struct intel_context *ce);
> >   	void (*destroy)(struct kref *kref);
> > -	/* virtual engine/context interface */
> > +	/* virtual/parallel engine/context interface */
> >   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> > -						unsigned int count);
> > +						unsigned int count,
> > +						unsigned long flags);
> > +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> > +						 unsigned int num_siblings,
> > +						 unsigned int width);
> >   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
> >   					       unsigned int sibling);
> >   };
> > @@ -113,6 +117,7 @@ struct intel_context {
> >   #define CONTEXT_NOPREEMPT		8
> >   #define CONTEXT_LRCA_DIRTY		9
> >   #define CONTEXT_GUC_INIT		10
> > +#define CONTEXT_PERMA_PIN		11
> >   	struct {
> >   		u64 timeout_us;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> > index 87579affb952..43f16a8347ee 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> > @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
> >   	return intel_engine_has_preemption(engine);
> >   }
> > +#define FORCE_VIRTUAL	BIT(0)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count);
> > +			    unsigned int count, unsigned long flags);
> > +
> > +static inline struct intel_context *
> > +intel_engine_create_parallel(struct intel_engine_cs **engines,
> > +			     unsigned int num_engines,
> > +			     unsigned int width)
> > +{
> > +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> > +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> > +}
> >   static inline bool
> >   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > index 2eb798ad068b..ff6753ccb129 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > @@ -1953,16 +1953,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count)
> > +			    unsigned int count, unsigned long flags)
> >   {
> >   	if (count == 0)
> >   		return ERR_PTR(-EINVAL);
> > -	if (count == 1)
> > +	if (count == 1 && !(flags & FORCE_VIRTUAL))
> >   		return intel_context_create(siblings[0]);
> >   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> > -	return siblings[0]->cops->create_virtual(siblings, count);
> > +	return siblings[0]->cops->create_virtual(siblings, count, flags);
> >   }
> >   struct i915_request *
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index 5ed1e222c308..8d7f571029df 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags);
> >   static struct i915_request *
> >   __active_request(const struct intel_timeline * const tl,
> > @@ -3784,7 +3785,8 @@ static void virtual_submit_request(struct i915_request *rq)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags)
> >   {
> >   	struct virtual_engine *ve;
> >   	unsigned int n;
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > index b3863abc51f5..74986b094b96 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
> >   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
> >   	for (n = 0; n < nctx; n++) {
> > -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> > +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ve[n])) {
> >   			err = PTR_ERR(ve[n]);
> >   			nctx = n;
> > @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
> >   	 * restrict it to our desired engine within the virtual engine.
> >   	 */
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_close;
> > @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
> >   		i915_request_add(rq);
> >   	}
> > -	ce = intel_engine_create_virtual(siblings, nsibling);
> > +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ce)) {
> >   		err = PTR_ERR(ce);
> >   		goto out;
> > @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
> >   	/* XXX We do not handle oversubscription and fairness with normal rq */
> >   	for (n = 0; n < nsibling; n++) {
> > -		ce = intel_engine_create_virtual(siblings, nsibling);
> > +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ce)) {
> >   			err = PTR_ERR(ce);
> >   			goto out;
> > @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
> >   	if (err)
> >   		goto out_scratch;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_scratch;
> > @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
> >   	if (igt_spinner_init(&spin, gt))
> >   		return -ENOMEM;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_spin;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f69e984683aa..9b19e0d830a2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -124,7 +124,13 @@ struct guc_virtual_engine {
> >   };
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags);
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > @@ -2611,6 +2617,7 @@ static const struct intel_context_ops guc_context_ops = {
> >   	.destroy = guc_context_destroy,
> >   	.create_virtual = guc_create_virtual,
> > +	.create_parallel = guc_create_parallel,
> >   };
> >   static void submit_work_cb(struct irq_work *wrk)
> > @@ -2860,8 +2867,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2878,8 +2883,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2891,8 +2894,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_parent_context_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > @@ -2908,8 +2909,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(context_enabled(ce));
> > @@ -2920,8 +2919,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_post_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(!intel_context_is_child(ce));
> > @@ -2932,6 +2929,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
> >   	intel_context_unpin(ce->parallel.parent);
> >   }
> > +static void guc_child_context_destroy(struct kref *kref)
> > +{
> > +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > +
> > +	__guc_context_destroy(ce);
> > +}
> > +
> > +static const struct intel_context_ops virtual_parent_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_parent_context_pin,
> > +	.unpin = guc_parent_context_unpin,
> > +	.post_unpin = guc_context_post_unpin,
> > +
> > +	.ban = guc_context_ban,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.sched_disable = guc_context_sched_disable,
> > +
> > +	.destroy = guc_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static const struct intel_context_ops virtual_child_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_child_context_pin,
> > +	.unpin = guc_child_context_unpin,
> > +	.post_unpin = guc_child_context_post_unpin,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.destroy = guc_child_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width)
> > +{
> > +	struct intel_engine_cs **siblings = NULL;
> > +	struct intel_context *parent = NULL, *ce, *err;
> > +	int i, j;
> > +
> > +	siblings = kmalloc_array(num_siblings,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	for (i = 0; i < width; ++i) {
> > +		for (j = 0; j < num_siblings; ++j)
> > +			siblings[j] = engines[i * num_siblings + j];
> > +
> > +		ce = intel_engine_create_virtual(siblings, num_siblings,
> > +						 FORCE_VIRTUAL);
> > +		if (!ce) {
> > +			err = ERR_PTR(-ENOMEM);
> > +			goto unwind;
> > +		}
> > +
> > +		if (i == 0) {
> > +			parent = ce;
> > +			parent->ops = &virtual_parent_context_ops;
> > +		} else {
> > +			ce->ops = &virtual_child_context_ops;
> > +			intel_context_bind_parent_child(parent, ce);
> > +		}
> > +	}
> > +
> > +	kfree(siblings);
> > +	return parent;
> > +
> > +unwind:
> > +	if (parent)
> > +		intel_context_put(parent);
> > +	kfree(siblings);
> > +	return err;
> > +}
> > +
> >   static bool
> >   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
> >   {
> > @@ -3759,7 +3848,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   }
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags)
> >   {
> >   	struct guc_virtual_engine *ve;
> >   	struct intel_guc *guc;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index b1248a67b4f8..f7c19e5464ae 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
> >    * Extensions:
> >    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> >    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> >    */
> >   #define I915_CONTEXT_PARAM_ENGINES	0xa
> > @@ -2049,6 +2050,135 @@ struct i915_context_engines_bond {
> >   	struct i915_engine_class_instance engines[N__]; \
> >   } __attribute__((packed)) name__
> > +/**
> > + * struct i915_context_engines_parallel_submit - Configure engine for
> > + * parallel submission.
> > + *
> > + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> > + * in parallel. Multiple hardware contexts are created internally in the i915 to
> > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> > + * many BBs there are based on the slot's configuration. The N BBs are the last
> > + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> > + *
> > + * The default placement behavior is to create implicit bonds between each
> > + * context if each context maps to more than 1 physical engine (e.g. context is
> > + * a virtual engine). Also we only allow contexts of same engine class and these
> > + * contexts must be in logically contiguous order. Examples of the placement
> > + * behavior are described below. Lastly, the default is to not allow BBs to be
> > + * preempted mid-batch. Rather insert coordinated preemption points on all
> > + * hardware contexts between each set of BBs. Flags could be added in the future
> > + * to change both of these default behaviors.
> > + *
> > + * Returns -EINVAL if hardware context placement configuration is invalid or if
> > + * the placement configuration isn't supported on the platform / submission
> > + * interface.
> > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > + * interface.
> > + *
> > + * .. code-block:: none
> > + *
> > + *	Examples syntax:
> > + *	CS[X] = generic engine of same class, logical instance X
> > + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + *
> > + *	Example 1 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> > + *		     engines=CS[0],CS[1])
> > + *
> > + *	Results in the following valid placement:
> > + *	CS[0], CS[1]
> > + *
> > + *	Example 2 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[2],CS[1],CS[3])
> > + *
> > + *	Results in the following valid placements:
> > + *	CS[0], CS[1]
> > + *	CS[2], CS[3]
> > + *
> > + *	This can be thought of as two virtual engines, each containing two
> > + *	engines thereby making a 2D array. However, there are bonds tying the
> > + *	entries together and placing restrictions on how they can be scheduled.
> > + *	Specifically, the scheduler can choose only vertical columns from the 2D
> > + *	array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
> > + *	scheduler wants to submit to CS[0], it must also choose CS[1] and vice
> > + *	versa. Same for CS[2] requires also using CS[3].
> > + *	VE[0] = CS[0], CS[2]
> > + *	VE[1] = CS[1], CS[3]
> > + *
> > + *	Example 3 pseudo code:
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[1],CS[1],CS[3])
> > + *
> > + *	Results in the following valid and invalid placements:
> > + *	CS[0], CS[1]
> > + *	CS[1], CS[3] - Not logically contiguous, return -EINVAL
> > + */
> > +struct i915_context_engines_parallel_submit {
> > +	/**
> > +	 * @base: base user extension.
> > +	 */
> > +	struct i915_user_extension base;
> > +
> > +	/**
> > +	 * @engine_index: slot for parallel engine
> > +	 */
> > +	__u16 engine_index;
> > +
> > +	/**
> > +	 * @width: number of contexts per parallel engine or in other words the
> > +	 * number of batches in each submission
> > +	 */
> > +	__u16 width;
> > +
> > +	/**
> > +	 * @num_siblings: number of siblings per context or in other words the
> > +	 * number of possible placements for each submission
> > +	 */
> > +	__u16 num_siblings;
> > +
> > +	/**
> > +	 * @mbz16: reserved for future use; must be zero
> > +	 */
> > +	__u16 mbz16;
> > +
> > +	/**
> > +	 * @flags: all undefined flags must be zero, currently not defined flags
> > +	 */
> > +	__u64 flags;
> > +
> > +	/**
> > +	 * @mbz64: reserved for future use; must be zero
> > +	 */
> > +	__u64 mbz64[3];
> > +
> > +	/**
> > +	 * @engines: 2-d array of engine instances to configure parallel engine
> > +	 *
> > +	 * length = width (i) * num_siblings (j)
> > +	 * index = j + i * num_siblings
> > +	 */
> > +	struct i915_engine_class_instance engines[0];
> > +
> > +} __packed;
> > +
> > +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> > +	struct i915_user_extension base; \
> > +	__u16 engine_index; \
> > +	__u16 width; \
> > +	__u16 num_siblings; \
> > +	__u16 mbz16; \
> > +	__u64 flags; \
> > +	__u64 mbz64[3]; \
> > +	struct i915_engine_class_instance engines[N__]; \
> > +} __attribute__((packed)) name__
> > +
> >   /**
> >    * DOC: Context Engine Map uAPI
> >    *
> > @@ -2108,6 +2238,7 @@ struct i915_context_param_engines {
> >   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> >   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> >   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> >   	struct i915_engine_class_instance engines[0];
> >   } __attribute__((packed));
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-11 23:32     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-11 23:32 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> mid BB. To safely enable preemption at the BB boundary, a handshake
> between to parent and child is needed. This is implemented via custom
between to parent -> between parent
> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
via by -> by

I'm also not seeing any mention of the forced re-group behavioural 
change in either the comments or commit description.

> default if a context is configured by set parallel extension.
>
> v2:
>   (John Harrison)
>    - Fix a few comments wording
>    - Add struture for parent page layout
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
>   4 files changed, 324 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 3b340eb59ada..ee84259959d0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>   	GEM_BUG_ON(intel_context_is_child(child));
>   	GEM_BUG_ON(intel_context_is_parent(child));
>   
> -	parent->parallel.number_children++;
> +	parent->parallel.child_index = parent->parallel.number_children++;
>   	list_add_tail(&child->parallel.child_link,
>   		      &parent->parallel.child_list);
>   	child->parallel.parent = parent;
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 1d880303a7e4..95a5b94b4ece 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -250,6 +250,8 @@ struct intel_context {
>   		struct i915_request *last_rq;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
> +		/** @child_index: index into child_list if child */
> +		u8 child_index;
>   		/** @guc: GuC specific members for parallel submission */
>   		struct {
>   			/** @wqi_head: head pointer in work queue */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index a00eeddc1449..663950d3badc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -181,7 +181,7 @@ struct guc_process_desc {
>   	u32 wq_status;
>   	u32 engine_presence;
>   	u32 priority;
> -	u32 reserved[30];
> +	u32 reserved[36];
Not seeing the promised explanation of this bug fix.

>   } __packed;
>   
>   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 12ee8ca76249..f28e36aa77c2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -11,6 +11,7 @@
>   #include "gt/intel_context.h"
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
> +#include "gt/intel_gpu_commands.h"
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
> @@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>   
>   /*
>    * When using multi-lrc submission an extra page in the context state is
> - * reserved for the process descriptor and work queue.
> + * reserved for the process descriptor, work queue, and handshake between the
> + * parent + childlren contexts to insert safe preemption points between each set
> + * of BBs.
>    *
>    * The layout of this page is below:
>    * 0						guc_process_desc
> + * + sizeof(struct guc_process_desc)		child go
> + * + CACHELINE_BYTES				child join[0]
> + * ...
> + * + CACHELINE_BYTES				child join[n - 1]
>    * ...						unused
>    * PAGE_SIZE / 2				work queue start
>    * ...						work queue
> @@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>    */
>   #define WQ_SIZE			(PAGE_SIZE / 2)
>   #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> -static u32 __get_process_desc_offset(struct intel_context *ce)
> +
> +struct parent_page {
> +	struct guc_process_desc pdesc;
> +
> +	u32 child_go_memory;
> +	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
> +
> +	struct {
> +		u32 child_join_memory;
> +		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
> +	} join[MAX_ENGINE_INSTANCE + 1];
Could have a common structure for these. Call the u32 'semaphore_memory' 
or something then just have:
   struct sync_semaphore go;
   struct sync_semaphore go[MAX + 1];

> +
> +	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
> +		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
And this bit could be 'sizeof(struct sync_semaphore) * MAX + 2' to be 
clearer what it refers to.

And to be totally paranoid about it, could also add 
'BUILD_BUG_ON(sizeof(struct sync_semaphore) != CACHELINE_BYTES'.

And BUILD_BUG_ON(sizeof(parent_page) == PARENT_PAGE_SIZE)'.

> +
> +	u32 wq[WQ_SIZE / sizeof(u32)];
> +};
> +
> +static u32 __get_parent_page_offset(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(!ce->parallel.guc.parent_page);
>   
> @@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
>   
>   static u32 __get_wq_offset(struct intel_context *ce)
>   {
> -	return __get_process_desc_offset(ce) + WQ_OFFSET;
> +	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
> +
> +	return __get_parent_page_offset(ce) + WQ_OFFSET;
>   }
>   
> -static struct guc_process_desc *
> -__get_process_desc(struct intel_context *ce)
> +static struct parent_page *
> +__get_parent_page(struct intel_context *ce)
>   {
> +	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
> +
>   	/*
>   	 * Need to subtract LRC_STATE_OFFSET here as the
>   	 * parallel.guc.parent_page is the offset into ce->state while
>   	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
>   	 */
> -	return (struct guc_process_desc *)
> +	return (struct parent_page *)
>   		(ce->lrc_reg_state +
> -		 ((__get_process_desc_offset(ce) -
> +		 ((__get_parent_page_offset(ce) -
>   		   LRC_STATE_OFFSET) / sizeof(u32)));
>   }
>   
> +static struct guc_process_desc *
> +__get_process_desc(struct intel_context *ce)
> +{
> +	struct parent_page *pp = __get_parent_page(ce);
> +
> +	return &pp->pdesc;
> +}
> +
>   static u32 *get_wq_pointer(struct guc_process_desc *desc,
>   			   struct intel_context *ce,
>   			   u32 wqi_size)
> @@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
>   	}
>   #undef AVAILABLE_SPACE
>   
> -	return ((u32 *)__get_process_desc(ce)) +
> -		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> +	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
>   }
>   
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> @@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   	return __guc_action_deregister_context(guc, guc_id);
>   }
>   
> +static inline void clear_children_join_go_memory(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
> +	u8 i;
> +
> +	for (i = 0; i < ce->parallel.number_children + 1; ++i)
> +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
Can't this be written as:
   pp->child_go_memory = 0;
   for(i = 0 to number_children)
     pp->child_join_memory = 0;

Seems like that would be much clearer than this magic casting and 
offsetting. I mean, that was the whole point of creating the parent_page 
structure.


> +}
> +
> +static inline u32 get_children_go_value(struct intel_context *ce)
> +{
> +	return __get_parent_page(ce)->child_go_memory;
> +}
> +
> +static inline u32 get_children_join_value(struct intel_context *ce,
> +					  u8 child_index)
> +{
> +	return __get_parent_page(ce)->join[child_index].child_join_memory;
> +}
> +
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   		ce->parallel.guc.wqi_head = 0;
>   
>   		desc->process_desc = i915_ggtt_offset(ce->state) +
> -			__get_process_desc_offset(ce);
> +			__get_parent_page_offset(ce);
>   		desc->wq_addr = i915_ggtt_offset(ce->state) +
>   			__get_wq_offset(ce);
>   		desc->wq_size = WQ_SIZE;
> @@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   			guc_context_policy_init(engine, desc);
>   		}
> +
> +		clear_children_join_go_memory(ce);
>   	}
>   
>   	/*
> @@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> +/*
> + * The below override of the breadcrumbs is enabled when the user configures a
> + * context for parallel submission (multi-lrc, parent-child).
> + *
> + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> + * safely preempt all the hw contexts configured for parallel submission
> + * between each BB. The contract between the i915 and GuC is if the parent
> + * context can be preempted, all the children can be preempted, and the GuC will
> + * always try to preempt the parent before the children. A handshake between the
> + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> + * creating a window to preempt between each set of BBs.
> + */
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags);
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags);
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs);
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs);
> +
>   static struct intel_context *
>   guc_create_parallel(struct intel_engine_cs **engines,
>   		    unsigned int num_siblings,
> @@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->engine->emit_bb_start =
> +		emit_bb_start_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb =
> +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb_dw =
> +		12 + 4 * parent->parallel.number_children;
> +	for_each_child(parent, ce) {
> +		ce->engine->emit_bb_start =
> +			emit_bb_start_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb =
> +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb_dw = 16;
> +	}
> +
>   	kfree(siblings);
>   	return parent;
>   
> @@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   			drm_printf(p, "\t\tWQI Status: %u\n\n",
>   				   READ_ONCE(desc->wq_status));
>   
> +			if (ce->engine->emit_bb_start ==
> +			    emit_bb_start_parent_no_preempt_mid_batch) {
> +				u8 i;
> +
> +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> +					   get_children_go_value(ce));
> +				for (i = 0; i < ce->parallel.number_children; ++i)
> +					drm_printf(p, "\t\tChildren Join: %u\n",
> +						   get_children_join_value(ce, i));
> +			}
> +
>   			for_each_child(ce, child)
>   				guc_log_context(p, child);
>   		}
> @@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
> +static inline u32 get_children_go_addr(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
> +		     sizeof(struct guc_process_desc));
> +
> +	return i915_ggtt_offset(ce->state) +
> +		__get_parent_page_offset(ce) +
> +		sizeof(struct guc_process_desc);
Rather than relying on the BUILD_BUG to make sure that the magic 
calculation matches the structure definition, can't this just say 
"ggtt_offset + pp_offset + offsetof(pp, child_go)"?

> +}
> +
> +static inline u32 get_children_join_addr(struct intel_context *ce,
> +					 u8 child_index)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
"ggtt_offset + pp_offset + offsetof(pp, child_join[i])"?


> +}
> +
> +#define PARENT_GO_BB			1
> +#define PARENT_GO_FINI_BREADCRUMB	0
> +#define CHILD_GO_BB			1
> +#define CHILD_GO_FINI_BREADCRUMB	0
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->parallel.number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_BB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_BB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	struct intel_context *parent = intel_context_to_parent(ce);
> +	u32 *cs;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	cs = intel_ring_begin(rq, 12);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_BB,
> +				  get_children_join_addr(parent,
> +							 ce->parallel.child_index),
> +				  0);
> +
> +	/* Wait on parent for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_BB;
> +	*cs++ = get_children_go_addr(parent);
> +	*cs++ = 0;
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->parallel.number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
You mentioned possibly needing to add an MI_ARB_CHECK in here but I'm 
not seeing it. Did the testing happen? I don't see that it should be 
necessary. Once you execute the MI_ARB_ENABLE, the CS can preempt 
anywhere, I thought? Even if it can't there should be an MI_ARB_CHECK 
added at the next level up after the breadcrumb code. Or do we not have 
those in between batches any more?

John.


> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_FINI_BREADCRUMB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	struct intel_context *parent = intel_context_to_parent(ce);
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_FINI_BREADCRUMB,
> +				  get_children_join_addr(parent,
> +							 ce->parallel.child_index),
> +				  0);
> +
> +	/* Wait parent on for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> +	*cs++ = get_children_go_addr(parent);
> +	*cs++ = 0;
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
>   static struct intel_context *
>   guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
>   		   unsigned long flags)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
@ 2021-10-11 23:32     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-11 23:32 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> mid BB. To safely enable preemption at the BB boundary, a handshake
> between to parent and child is needed. This is implemented via custom
between to parent -> between parent
> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
via by -> by

I'm also not seeing any mention of the forced re-group behavioural 
change in either the comments or commit description.

> default if a context is configured by set parallel extension.
>
> v2:
>   (John Harrison)
>    - Fix a few comments wording
>    - Add struture for parent page layout
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
>   4 files changed, 324 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 3b340eb59ada..ee84259959d0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>   	GEM_BUG_ON(intel_context_is_child(child));
>   	GEM_BUG_ON(intel_context_is_parent(child));
>   
> -	parent->parallel.number_children++;
> +	parent->parallel.child_index = parent->parallel.number_children++;
>   	list_add_tail(&child->parallel.child_link,
>   		      &parent->parallel.child_list);
>   	child->parallel.parent = parent;
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 1d880303a7e4..95a5b94b4ece 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -250,6 +250,8 @@ struct intel_context {
>   		struct i915_request *last_rq;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
> +		/** @child_index: index into child_list if child */
> +		u8 child_index;
>   		/** @guc: GuC specific members for parallel submission */
>   		struct {
>   			/** @wqi_head: head pointer in work queue */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index a00eeddc1449..663950d3badc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -181,7 +181,7 @@ struct guc_process_desc {
>   	u32 wq_status;
>   	u32 engine_presence;
>   	u32 priority;
> -	u32 reserved[30];
> +	u32 reserved[36];
Not seeing the promised explanation of this bug fix.

>   } __packed;
>   
>   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 12ee8ca76249..f28e36aa77c2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -11,6 +11,7 @@
>   #include "gt/intel_context.h"
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
> +#include "gt/intel_gpu_commands.h"
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
> @@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>   
>   /*
>    * When using multi-lrc submission an extra page in the context state is
> - * reserved for the process descriptor and work queue.
> + * reserved for the process descriptor, work queue, and handshake between the
> + * parent + childlren contexts to insert safe preemption points between each set
> + * of BBs.
>    *
>    * The layout of this page is below:
>    * 0						guc_process_desc
> + * + sizeof(struct guc_process_desc)		child go
> + * + CACHELINE_BYTES				child join[0]
> + * ...
> + * + CACHELINE_BYTES				child join[n - 1]
>    * ...						unused
>    * PAGE_SIZE / 2				work queue start
>    * ...						work queue
> @@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
>    */
>   #define WQ_SIZE			(PAGE_SIZE / 2)
>   #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> -static u32 __get_process_desc_offset(struct intel_context *ce)
> +
> +struct parent_page {
> +	struct guc_process_desc pdesc;
> +
> +	u32 child_go_memory;
> +	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
> +
> +	struct {
> +		u32 child_join_memory;
> +		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
> +	} join[MAX_ENGINE_INSTANCE + 1];
Could have a common structure for these. Call the u32 'semaphore_memory' 
or something then just have:
   struct sync_semaphore go;
   struct sync_semaphore go[MAX + 1];

> +
> +	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
> +		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
And this bit could be 'sizeof(struct sync_semaphore) * MAX + 2' to be 
clearer what it refers to.

And to be totally paranoid about it, could also add 
'BUILD_BUG_ON(sizeof(struct sync_semaphore) != CACHELINE_BYTES'.

And BUILD_BUG_ON(sizeof(parent_page) == PARENT_PAGE_SIZE)'.

> +
> +	u32 wq[WQ_SIZE / sizeof(u32)];
> +};
> +
> +static u32 __get_parent_page_offset(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(!ce->parallel.guc.parent_page);
>   
> @@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
>   
>   static u32 __get_wq_offset(struct intel_context *ce)
>   {
> -	return __get_process_desc_offset(ce) + WQ_OFFSET;
> +	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
> +
> +	return __get_parent_page_offset(ce) + WQ_OFFSET;
>   }
>   
> -static struct guc_process_desc *
> -__get_process_desc(struct intel_context *ce)
> +static struct parent_page *
> +__get_parent_page(struct intel_context *ce)
>   {
> +	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
> +
>   	/*
>   	 * Need to subtract LRC_STATE_OFFSET here as the
>   	 * parallel.guc.parent_page is the offset into ce->state while
>   	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
>   	 */
> -	return (struct guc_process_desc *)
> +	return (struct parent_page *)
>   		(ce->lrc_reg_state +
> -		 ((__get_process_desc_offset(ce) -
> +		 ((__get_parent_page_offset(ce) -
>   		   LRC_STATE_OFFSET) / sizeof(u32)));
>   }
>   
> +static struct guc_process_desc *
> +__get_process_desc(struct intel_context *ce)
> +{
> +	struct parent_page *pp = __get_parent_page(ce);
> +
> +	return &pp->pdesc;
> +}
> +
>   static u32 *get_wq_pointer(struct guc_process_desc *desc,
>   			   struct intel_context *ce,
>   			   u32 wqi_size)
> @@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
>   	}
>   #undef AVAILABLE_SPACE
>   
> -	return ((u32 *)__get_process_desc(ce)) +
> -		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> +	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
>   }
>   
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> @@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
>   	return __guc_action_deregister_context(guc, guc_id);
>   }
>   
> +static inline void clear_children_join_go_memory(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
> +	u8 i;
> +
> +	for (i = 0; i < ce->parallel.number_children + 1; ++i)
> +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
Can't this be written as:
   pp->child_go_memory = 0;
   for(i = 0 to number_children)
     pp->child_join_memory = 0;

Seems like that would be much clearer than this magic casting and 
offsetting. I mean, that was the whole point of creating the parent_page 
structure.


> +}
> +
> +static inline u32 get_children_go_value(struct intel_context *ce)
> +{
> +	return __get_parent_page(ce)->child_go_memory;
> +}
> +
> +static inline u32 get_children_join_value(struct intel_context *ce,
> +					  u8 child_index)
> +{
> +	return __get_parent_page(ce)->join[child_index].child_join_memory;
> +}
> +
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   		ce->parallel.guc.wqi_head = 0;
>   
>   		desc->process_desc = i915_ggtt_offset(ce->state) +
> -			__get_process_desc_offset(ce);
> +			__get_parent_page_offset(ce);
>   		desc->wq_addr = i915_ggtt_offset(ce->state) +
>   			__get_wq_offset(ce);
>   		desc->wq_size = WQ_SIZE;
> @@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   			guc_context_policy_init(engine, desc);
>   		}
> +
> +		clear_children_join_go_memory(ce);
>   	}
>   
>   	/*
> @@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> +/*
> + * The below override of the breadcrumbs is enabled when the user configures a
> + * context for parallel submission (multi-lrc, parent-child).
> + *
> + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> + * safely preempt all the hw contexts configured for parallel submission
> + * between each BB. The contract between the i915 and GuC is if the parent
> + * context can be preempted, all the children can be preempted, and the GuC will
> + * always try to preempt the parent before the children. A handshake between the
> + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> + * creating a window to preempt between each set of BBs.
> + */
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags);
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags);
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs);
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs);
> +
>   static struct intel_context *
>   guc_create_parallel(struct intel_engine_cs **engines,
>   		    unsigned int num_siblings,
> @@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->engine->emit_bb_start =
> +		emit_bb_start_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb =
> +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb_dw =
> +		12 + 4 * parent->parallel.number_children;
> +	for_each_child(parent, ce) {
> +		ce->engine->emit_bb_start =
> +			emit_bb_start_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb =
> +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb_dw = 16;
> +	}
> +
>   	kfree(siblings);
>   	return parent;
>   
> @@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   			drm_printf(p, "\t\tWQI Status: %u\n\n",
>   				   READ_ONCE(desc->wq_status));
>   
> +			if (ce->engine->emit_bb_start ==
> +			    emit_bb_start_parent_no_preempt_mid_batch) {
> +				u8 i;
> +
> +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> +					   get_children_go_value(ce));
> +				for (i = 0; i < ce->parallel.number_children; ++i)
> +					drm_printf(p, "\t\tChildren Join: %u\n",
> +						   get_children_join_value(ce, i));
> +			}
> +
>   			for_each_child(ce, child)
>   				guc_log_context(p, child);
>   		}
> @@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }
>   
> +static inline u32 get_children_go_addr(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
> +		     sizeof(struct guc_process_desc));
> +
> +	return i915_ggtt_offset(ce->state) +
> +		__get_parent_page_offset(ce) +
> +		sizeof(struct guc_process_desc);
Rather than relying on the BUILD_BUG to make sure that the magic 
calculation matches the structure definition, can't this just say 
"ggtt_offset + pp_offset + offsetof(pp, child_go)"?

> +}
> +
> +static inline u32 get_children_join_addr(struct intel_context *ce,
> +					 u8 child_index)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
"ggtt_offset + pp_offset + offsetof(pp, child_join[i])"?


> +}
> +
> +#define PARENT_GO_BB			1
> +#define PARENT_GO_FINI_BREADCRUMB	0
> +#define CHILD_GO_BB			1
> +#define CHILD_GO_FINI_BREADCRUMB	0
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->parallel.number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_BB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_BB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	struct intel_context *parent = intel_context_to_parent(ce);
> +	u32 *cs;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	cs = intel_ring_begin(rq, 12);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_BB,
> +				  get_children_join_addr(parent,
> +							 ce->parallel.child_index),
> +				  0);
> +
> +	/* Wait on parent for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_BB;
> +	*cs++ = get_children_go_addr(parent);
> +	*cs++ = 0;
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->parallel.number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
You mentioned possibly needing to add an MI_ARB_CHECK in here but I'm 
not seeing it. Did the testing happen? I don't see that it should be 
necessary. Once you execute the MI_ARB_ENABLE, the CS can preempt 
anywhere, I thought? Even if it can't there should be an MI_ARB_CHECK 
added at the next level up after the breadcrumb code. Or do we not have 
those in between batches any more?

John.


> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_FINI_BREADCRUMB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	struct intel_context *parent = intel_context_to_parent(ce);
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_FINI_BREADCRUMB,
> +				  get_children_join_addr(parent,
> +							 ce->parallel.child_index),
> +				  0);
> +
> +	/* Wait parent on for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> +	*cs++ = get_children_go_addr(parent);
> +	*cs++ = 0;
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
>   static struct intel_context *
>   guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
>   		   unsigned long flags)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
  (?)
  (?)
@ 2021-10-12  7:53   ` Tvrtko Ursulin
  2021-10-12 18:31     ` Matthew Brost
  -1 siblings, 1 reply; 165+ messages in thread
From: Tvrtko Ursulin @ 2021-10-12  7:53 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: john.c.harrison, daniele.ceraolospurio


On 04/10/2021 23:06, Matthew Brost wrote:
> Parallel submission create composite fences (dma_fence_array) for excl /
> shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
> determine the busyness of the object. Prior to patch it only check if
> the fence in the slot was a i915_request. Update the check to understand
> composite fences and correctly report the busyness.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
>   drivers/gpu/drm/i915/i915_request.h           |  6 ++
>   3 files changed, 58 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> index 6234e17259c1..b89d173c62eb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> @@ -4,6 +4,8 @@
>    * Copyright © 2014-2016 Intel Corporation
>    */
>   
> +#include <linux/dma-fence-array.h>
> +
>   #include "gt/intel_engine.h"
>   
>   #include "i915_gem_ioctls.h"
> @@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
>   }
>   
>   static __always_inline unsigned int
> -__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
> +__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
>   {
>   	const struct i915_request *rq;
>   
> @@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
>   	 * to eventually flush us, but to minimise latency just ask the
>   	 * hardware.
>   	 *
> -	 * Note we only report on the status of native fences.
> +	 * Note we only report on the status of native fences and we currently
> +	 * have two native fences:
> +	 *
> +	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
> +	 * created during a parallel submission. In this case we deconstruct the
> +	 * composite fence into individual i915 requests and check the status of
> +	 * each request.
> +	 *
> +	 * 2. A single i915 request.
>   	 */
> -	if (!dma_fence_is_i915(fence))
> +	if (dma_fence_is_array(fence)) {
> +		struct dma_fence_array *array = to_dma_fence_array(fence);
> +		struct dma_fence **child = array->fences;
> +		unsigned int nchild = array->num_fences;
> +
> +		do {
> +			struct dma_fence *current_fence = *child++;
> +
> +			/* Not an i915 fence, can't be busy per above */
> +			if (!dma_fence_is_i915(current_fence) ||
> +			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
> +				      &current_fence->flags)) {
> +				return 0;
> +			}
> +
> +			rq = to_request(current_fence);
> +			if (!i915_request_completed(rq)) {
> +				BUILD_BUG_ON(!typecheck(u16,
> +							rq->engine->uabi_class));
> +				return flag(rq->engine->uabi_class);
> +			}
> +		} while (--nchild);

Do you even need to introduce I915_FENCE_FLAG_COMPOSITE? If parallel 
submit is the only possible creator of array fences then possibly not. 
Probably even would result in less code which even keeps working in a 
hypothetical future. Otherwise you could add a debug bug on if array 
fence contains a fence without I915_FENCE_FLAG_COMPOSITE set.

Secondly, I'd also run the whole loop and not return on first busy or 
incompatible for simplicity.

And finally, with all above in place, I think you could have common 
function for the below (checking one fence) and call that both for a 
single fence and from an array loop above for less duplication. (Even 
duplicated BUILD_BUG_ON which makes no sense!)

End result would be a simpler patch like:

__busy_set_if_active_one(...)
{
    .. existing __busy_set_if_active ..
}

__busy_set_if_active(..)
{
   ...
   if (dma_fence_is_array(fence)) {
	...
	for (i = 0; i < array->num_fences; i++)
		flags |= __busy_set_if_active_one(...);
   } else {
	flags = __busy_set_if_active_one(...);
   }

Regards,

Tvrtko

> +
> +		/* All requests in array complete, not busy */
>   		return 0;
> +	} else {
> +		if (!dma_fence_is_i915(fence))
> +			return 0;
>   
> -	/* opencode to_request() in order to avoid const warnings */
> -	rq = container_of(fence, const struct i915_request, fence);
> -	if (i915_request_completed(rq))
> -		return 0;
> +		rq = to_request(fence);
> +		if (i915_request_completed(rq))
> +			return 0;
>   
> -	/* Beware type-expansion follies! */
> -	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> -	return flag(rq->engine->uabi_class);
> +		/* Beware type-expansion follies! */
> +		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> +		return flag(rq->engine->uabi_class);
> +	}
>   }
>   
>   static __always_inline unsigned int
> -busy_check_reader(const struct dma_fence *fence)
> +busy_check_reader(struct dma_fence *fence)
>   {
>   	return __busy_set_if_active(fence, __busy_read_flag);
>   }
>   
>   static __always_inline unsigned int
> -busy_check_writer(const struct dma_fence *fence)
> +busy_check_writer(struct dma_fence *fence)
>   {
>   	if (!fence)
>   		return 0;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 5c7fb6f68bbb..16276f406fd6 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
>   	if (!fences)
>   		return ERR_PTR(-ENOMEM);
>   
> -	for_each_batch_create_order(eb, i)
> +	for_each_batch_create_order(eb, i) {
>   		fences[i] = &eb->requests[i]->fence;
> +		__set_bit(I915_FENCE_FLAG_COMPOSITE,
> +			  &eb->requests[i]->fence.flags);
> +	}
>   
>   	fence_array = dma_fence_array_create(eb->num_batches,
>   					     fences,
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 24db8459376b..dc359242d1ae 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -156,6 +156,12 @@ enum {
>   	 * submission / relationship encoutered an error.
>   	 */
>   	I915_FENCE_FLAG_SKIP_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
> +	 * fence (dma_fence_array) and i915 generated for parallel submission.
> +	 */
> +	I915_FENCE_FLAG_COMPOSITE,
>   };
>   
>   /**
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
  2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
@ 2021-10-12 18:11   ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-12 18:11 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.

So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.

v2:
 (John Harrison)
  - Split structure changes out in different patch
 (Tvrtko)
  - Don't drop lock in deregister_destroyed_contexts
v3:
 (John Harrison)
  - Flush destroyed contexts before destroying context reg pool

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
 6 files changed, 121 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e9a0cad5c34d..1076066f41e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e7e3984aab78..4613d027cbc3 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,13 @@ struct intel_context {
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 8520c595f5e1..6fdeae668e6e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..05de6c1af25b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 82e248c2290c..74f071a0b6d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -90,6 +90,17 @@ struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b2646b088c7f..d2ce47b5541e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -90,8 +90,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Protects guc_id allocation for the given GuC, i.e. only one context can be
- * doing guc_id allocation operations at a time for each GuC in the system.
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
+	guc_flush_destroyed_contexts(guc);
 
 	/*
 	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
@@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	spin_lock_init(&guc->submission_state.lock);
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker,
+		  destroyed_worker_func);
 
 	return 0;
 }
@@ -1160,6 +1169,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	if (!guc->lrc_desc_pool)
 		return;
 
+	guc_flush_destroyed_contexts(guc);
 	guc_lrc_desc_pool_destroy(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
@@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id);
 }
 
@@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		guc_lrc_desc_unpin(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-		__guc_context_destroy(ce);
-		return;
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
-	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [Intel-gfx] [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-10-12 18:11   ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-12 18:11 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.

So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.

v2:
 (John Harrison)
  - Split structure changes out in different patch
 (Tvrtko)
  - Don't drop lock in deregister_destroyed_contexts
v3:
 (John Harrison)
  - Flush destroyed contexts before destroying context reg pool

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   4 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  11 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 146 +++++++++++-------
 6 files changed, 121 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e9a0cad5c34d..1076066f41e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e7e3984aab78..4613d027cbc3 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,13 @@ struct intel_context {
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 8520c595f5e1..6fdeae668e6e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..05de6c1af25b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,10 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 82e248c2290c..74f071a0b6d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -90,6 +90,17 @@ struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b2646b088c7f..d2ce47b5541e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -90,8 +90,8 @@
  * used for all of GuC submission but that could change in the future.
  *
  * guc->submission_state.lock
- * Protects guc_id allocation for the given GuC, i.e. only one context can be
- * doing guc_id allocation operations at a time for each GuC in the system.
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -719,6 +719,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -797,6 +798,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -815,6 +818,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
+	guc_flush_destroyed_contexts(guc);
 
 	/*
 	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
@@ -1126,6 +1130,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1151,6 +1157,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	spin_lock_init(&guc->submission_state.lock);
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker,
+		  destroyed_worker_func);
 
 	return 0;
 }
@@ -1160,6 +1169,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	if (!guc->lrc_desc_pool)
 		return;
 
+	guc_flush_destroyed_contexts(guc);
 	guc_lrc_desc_pool_destroy(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
@@ -1859,11 +1869,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id);
 }
 
@@ -1891,78 +1920,86 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		guc_lrc_desc_unpin(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-		__guc_context_destroy(ce);
-		return;
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
-	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2798,6 +2835,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
  2021-10-12  7:53   ` Tvrtko Ursulin
@ 2021-10-12 18:31     ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-12 18:31 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, dri-devel, john.c.harrison, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 08:53:25AM +0100, Tvrtko Ursulin wrote:
> 
> On 04/10/2021 23:06, Matthew Brost wrote:
> > Parallel submission create composite fences (dma_fence_array) for excl /
> > shared slots in objects. The I915_GEM_BUSY IOCTL checks these slots to
> > determine the busyness of the object. Prior to patch it only check if
> > the fence in the slot was a i915_request. Update the check to understand
> > composite fences and correctly report the busyness.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_busy.c      | 60 +++++++++++++++----
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  5 +-
> >   drivers/gpu/drm/i915/i915_request.h           |  6 ++
> >   3 files changed, 58 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_busy.c b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> > index 6234e17259c1..b89d173c62eb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_busy.c
> > @@ -4,6 +4,8 @@
> >    * Copyright © 2014-2016 Intel Corporation
> >    */
> > +#include <linux/dma-fence-array.h>
> > +
> >   #include "gt/intel_engine.h"
> >   #include "i915_gem_ioctls.h"
> > @@ -36,7 +38,7 @@ static __always_inline u32 __busy_write_id(u16 id)
> >   }
> >   static __always_inline unsigned int
> > -__busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
> > +__busy_set_if_active(struct dma_fence *fence, u32 (*flag)(u16 id))
> >   {
> >   	const struct i915_request *rq;
> > @@ -46,29 +48,63 @@ __busy_set_if_active(const struct dma_fence *fence, u32 (*flag)(u16 id))
> >   	 * to eventually flush us, but to minimise latency just ask the
> >   	 * hardware.
> >   	 *
> > -	 * Note we only report on the status of native fences.
> > +	 * Note we only report on the status of native fences and we currently
> > +	 * have two native fences:
> > +	 *
> > +	 * 1. A composite fence (dma_fence_array) constructed of i915 requests
> > +	 * created during a parallel submission. In this case we deconstruct the
> > +	 * composite fence into individual i915 requests and check the status of
> > +	 * each request.
> > +	 *
> > +	 * 2. A single i915 request.
> >   	 */
> > -	if (!dma_fence_is_i915(fence))
> > +	if (dma_fence_is_array(fence)) {
> > +		struct dma_fence_array *array = to_dma_fence_array(fence);
> > +		struct dma_fence **child = array->fences;
> > +		unsigned int nchild = array->num_fences;
> > +
> > +		do {
> > +			struct dma_fence *current_fence = *child++;
> > +
> > +			/* Not an i915 fence, can't be busy per above */
> > +			if (!dma_fence_is_i915(current_fence) ||
> > +			    !test_bit(I915_FENCE_FLAG_COMPOSITE,
> > +				      &current_fence->flags)) {
> > +				return 0;
> > +			}
> > +
> > +			rq = to_request(current_fence);
> > +			if (!i915_request_completed(rq)) {
> > +				BUILD_BUG_ON(!typecheck(u16,
> > +							rq->engine->uabi_class));
> > +				return flag(rq->engine->uabi_class);
> > +			}
> > +		} while (--nchild);
> 
> Do you even need to introduce I915_FENCE_FLAG_COMPOSITE? If parallel submit
> is the only possible creator of array fences then possibly not. Probably
> even would result in less code which even keeps working in a hypothetical
> future. Otherwise you could add a debug bug on if array fence contains a
> fence without I915_FENCE_FLAG_COMPOSITE set.
> 

Certainly other drivers can create a dma fence array and in theory could
include a i915_request in that array. Adding this flag makes it clear
that this fence was created by i915 for parallel submission and future
proofs this code.

> Secondly, I'd also run the whole loop and not return on first busy or
> incompatible for simplicity.
> 

I disagree. Short circuiting when a condition is found is pretty
standard and not hard to understand.

> And finally, with all above in place, I think you could have common function
> for the below (checking one fence) and call that both for a single fence and
> from an array loop above for less duplication. (Even duplicated BUILD_BUG_ON
> which makes no sense!)
>

Yea duplicating the BUILD_BUG_ON doesn't make a ton of sense. Will
remove.

Disagree on the helper, the code paths are different enough to just open
code this.

Matt

> End result would be a simpler patch like:
> 
> __busy_set_if_active_one(...)
> {
>    .. existing __busy_set_if_active ..
> }
> 
> __busy_set_if_active(..)
> {
>   ...
>   if (dma_fence_is_array(fence)) {
> 	...
> 	for (i = 0; i < array->num_fences; i++)
> 		flags |= __busy_set_if_active_one(...);
>   } else {
> 	flags = __busy_set_if_active_one(...);
>   }
> 
> Regards,
> 
> Tvrtko
> 
> > +
> > +		/* All requests in array complete, not busy */
> >   		return 0;
> > +	} else {
> > +		if (!dma_fence_is_i915(fence))
> > +			return 0;
> > -	/* opencode to_request() in order to avoid const warnings */
> > -	rq = container_of(fence, const struct i915_request, fence);
> > -	if (i915_request_completed(rq))
> > -		return 0;
> > +		rq = to_request(fence);
> > +		if (i915_request_completed(rq))
> > +			return 0;
> > -	/* Beware type-expansion follies! */
> > -	BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> > -	return flag(rq->engine->uabi_class);
> > +		/* Beware type-expansion follies! */
> > +		BUILD_BUG_ON(!typecheck(u16, rq->engine->uabi_class));
> > +		return flag(rq->engine->uabi_class);
> > +	}
> >   }
> >   static __always_inline unsigned int
> > -busy_check_reader(const struct dma_fence *fence)
> > +busy_check_reader(struct dma_fence *fence)
> >   {
> >   	return __busy_set_if_active(fence, __busy_read_flag);
> >   }
> >   static __always_inline unsigned int
> > -busy_check_writer(const struct dma_fence *fence)
> > +busy_check_writer(struct dma_fence *fence)
> >   {
> >   	if (!fence)
> >   		return 0;
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index 5c7fb6f68bbb..16276f406fd6 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -2988,8 +2988,11 @@ eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> >   	if (!fences)
> >   		return ERR_PTR(-ENOMEM);
> > -	for_each_batch_create_order(eb, i)
> > +	for_each_batch_create_order(eb, i) {
> >   		fences[i] = &eb->requests[i]->fence;
> > +		__set_bit(I915_FENCE_FLAG_COMPOSITE,
> > +			  &eb->requests[i]->fence.flags);
> > +	}
> >   	fence_array = dma_fence_array_create(eb->num_batches,
> >   					     fences,
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 24db8459376b..dc359242d1ae 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -156,6 +156,12 @@ enum {
> >   	 * submission / relationship encoutered an error.
> >   	 */
> >   	I915_FENCE_FLAG_SKIP_PARALLEL,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_COMPOSITE - Indicates fence is part of a composite
> > +	 * fence (dma_fence_array) and i915 generated for parallel submission.
> > +	 */
> > +	I915_FENCE_FLAG_COMPOSITE,
> >   };
> >   /**
> > 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-12 21:22     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 21:22 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> after a context has been configured with the 'set_parallel' extension.
> The number batches is implicit based on the contexts configuration.
>
> This is implemented with a series of loops. First a loop is used to find
> all the batches, a loop to pin all the HW contexts, a loop to create all
> the requests, a loop to submit (emit BB start, etc...) all the requests,
> a loop to tie the requests to the VMAs they touch, and finally a loop to
> commit the requests to the backend.
>
> A composite fence is also created for the generated requests to return
> to the user and to stick in dma resv slots.
>
> No behavior from the existing IOCTL should be changed aside from when
> throttling because the ring for a context is full, wait on the request
throttling because the ring for -> throttling the ring because

full, wait -> full. In this situation, i915 will now wait

> while holding the object locks.
, previously it would have dropped the locks for the wait.

And maybe explain why this change is necessary?


>
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
>
> v2:
>   (Matthew Brost)
>    - Return proper error value if i915_request_create fails
> v3:
>   (John Harrison)
>    - Add comment explaining create / add order loops + locking
>    - Update commit message explaining different in IOCTL behavior
>    - Line wrap some comments
>    - eb_add_request returns void
>    - Return -EINVAL rather triggering BUG_ON if cmd parser used
>   (Checkpatch)
>    - Check eb->batch_len[*current_batch]
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
>   drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
>   drivers/gpu/drm/i915/i915_request.h           |   9 +
>   drivers/gpu/drm/i915/i915_vma.c               |  21 +-
>   drivers/gpu/drm/i915/i915_vma.h               |  13 +-
>   7 files changed, 599 insertions(+), 257 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 2f2434b52317..5c7fb6f68bbb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -244,17 +244,25 @@ struct i915_execbuffer {
>   	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
>   	struct eb_vma *vma;
>   
> -	struct intel_engine_cs *engine; /** engine to queue the request to */
> +	struct intel_gt *gt; /* gt for the execbuf */
>   	struct intel_context *context; /* logical state for the request */
>   	struct i915_gem_context *gem_context; /** caller's context */
>   
> -	struct i915_request *request; /** our request to build */
> -	struct eb_vma *batch; /** identity of the batch obj/vma */
> +	/** our requests to build */
> +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> +	/** identity of the batch obj/vma */
> +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
>   	struct i915_vma *trampoline; /** trampoline used for chaining */
>   
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>   	/** actual size of execobj[] as we may extend it for the cmdparser */
>   	unsigned int buffer_count;
>   
> +	/* number of batches in execbuf IOCTL */
> +	unsigned int num_batches;
> +
>   	/** list of vma not yet bound during reservation phase */
>   	struct list_head unbound;
>   
> @@ -281,7 +289,8 @@ struct i915_execbuffer {
>   
>   	u64 invalid_flags; /** Set of execobj.flags that are invalid */
>   
> -	u64 batch_len; /** Length of batch within object */
> +	/** Length of batch within object */
> +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
>   	u32 batch_start_offset; /** Location within object of batch */
>   	u32 batch_flags; /** Flags composed for emit_bb_start() */
>   	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> @@ -299,14 +308,13 @@ struct i915_execbuffer {
>   };
>   
>   static int eb_parse(struct i915_execbuffer *eb);
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> -					  bool throttle);
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
>   static void eb_unpin_engine(struct i915_execbuffer *eb);
>   
>   static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
>   {
> -	return intel_engine_requires_cmd_parser(eb->engine) ||
> -		(intel_engine_using_cmd_parser(eb->engine) &&
> +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> +		(intel_engine_using_cmd_parser(eb->context->engine) &&
>   		 eb->args->batch_len);
>   }
>   
> @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
>   	return 0;
>   }
>   
> -static void
> +static inline bool
> +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> +{
> +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> +		buffer_idx < eb->num_batches :
> +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> +}
> +
> +static int
>   eb_add_vma(struct i915_execbuffer *eb,
> -	   unsigned int i, unsigned batch_idx,
> +	   unsigned int *current_batch,
> +	   unsigned int i,
>   	   struct i915_vma *vma)
>   {
> +	struct drm_i915_private *i915 = eb->i915;
>   	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
>   	struct eb_vma *ev = &eb->vma[i];
>   
> @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
>   	 * Note that actual hangs have only been observed on gen7, but for
>   	 * paranoia do it everywhere.
>   	 */
> -	if (i == batch_idx) {
> +	if (is_batch_buffer(eb, i)) {
>   		if (entry->relocation_count &&
>   		    !(ev->flags & EXEC_OBJECT_PINNED))
>   			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
>   		if (eb->reloc_cache.has_fence)
>   			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
>   
> -		eb->batch = ev;
> +		eb->batches[*current_batch] = ev;
> +
> +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> +			drm_dbg(&i915->drm,
> +				"Attempting to use self-modifying batch buffer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (range_overflows_t(u64,
> +				      eb->batch_start_offset,
> +				      eb->args->batch_len,
> +				      ev->vma->size)) {
> +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> +			return -EINVAL;
> +		}
> +
> +		if (eb->args->batch_len == 0)
> +			eb->batch_len[*current_batch] = ev->vma->size -
> +				eb->batch_start_offset;
> +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> +			drm_dbg(&i915->drm, "Invalid batch length\n");
> +			return -EINVAL;
> +		}
> +
> +		++*current_batch;
>   	}
> +
> +	return 0;
>   }
>   
>   static inline int use_cpu_reloc(const struct reloc_cache *cache,
> @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>   	} while (1);
>   }
>   
> -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> -{
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		return 0;
> -	else
> -		return eb->buffer_count - 1;
> -}
> -
>   static int eb_select_context(struct i915_execbuffer *eb)
>   {
>   	struct i915_gem_context *ctx;
> @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
>   
>   static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   {
> -	struct drm_i915_private *i915 = eb->i915;
> -	unsigned int batch = eb_batch_index(eb);
> -	unsigned int i;
> +	unsigned int i, current_batch = 0;
>   	int err = 0;
>   
>   	INIT_LIST_HEAD(&eb->relocs);
> @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   			goto err;
>   		}
>   
> -		eb_add_vma(eb, i, batch, vma);
> +		err = eb_add_vma(eb, &current_batch, i, vma);
> +		if (err)
> +			return err;
>   
>   		if (i915_gem_object_is_userptr(vma->obj)) {
>   			err = i915_gem_object_userptr_submit_init(vma->obj);
> @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   		}
>   	}
>   
> -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> -		drm_dbg(&i915->drm,
> -			"Attempting to use self-modifying batch buffer\n");
> -		return -EINVAL;
> -	}
> -
> -	if (range_overflows_t(u64,
> -			      eb->batch_start_offset, eb->batch_len,
> -			      eb->batch->vma->size)) {
> -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> -		return -EINVAL;
> -	}
> -
> -	if (eb->batch_len == 0)
> -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> -		drm_dbg(&i915->drm, "Invalid batch length\n");
> -		return -EINVAL;
> -	}
> -
>   	return 0;
>   
>   err:
> @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
>   	return 0;
>   }
>   
> -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> -					   struct i915_request *rq)
> +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
>   {
>   	bool have_copy = false;
>   	struct eb_vma *ev;
> @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   	eb_release_vmas(eb, false);
>   	i915_gem_ww_ctx_fini(&eb->ww);
>   
> -	if (rq) {
> -		/* nonblocking is always false */
> -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> -			i915_request_put(rq);
> -			rq = NULL;
> -
> -			err = -EINTR;
> -			goto err_relock;
> -		}
> -
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>   	/*
>   	 * We take 3 passes through the slowpatch.
>   	 *
> @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   	if (!err)
>   		err = eb_reinit_userptr(eb);
>   
> -err_relock:
>   	i915_gem_ww_ctx_init(&eb->ww, true);
>   	if (err)
>   		goto out;
>   
>   	/* reacquire the objects */
>   repeat_validate:
> -	rq = eb_pin_engine(eb, false);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, false);
> +	if (err)
>   		goto err;
> -	}
> -
> -	/* We didn't throttle, should be NULL */
> -	GEM_WARN_ON(rq);
>   
>   	err = eb_validate_vmas(eb);
>   	if (err)
>   		goto err;
>   
> -	GEM_BUG_ON(!eb->batch);
> +	GEM_BUG_ON(!eb->batches[0]);
>   
>   	list_for_each_entry(ev, &eb->relocs, reloc_link) {
>   		if (!have_copy) {
> @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   		}
>   	}
>   
> -	if (rq)
> -		i915_request_put(rq);
> -
>   	return err;
>   }
>   
>   static int eb_relocate_parse(struct i915_execbuffer *eb)
>   {
>   	int err;
> -	struct i915_request *rq = NULL;
>   	bool throttle = true;
>   
>   retry:
> -	rq = eb_pin_engine(eb, throttle);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, throttle);
> +	if (err) {
>   		if (err != -EDEADLK)
>   			return err;
>   
>   		goto err;
>   	}
>   
> -	if (rq) {
> -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> -
> -		/* Need to drop all locks now for throttling, take slowpath */
> -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> -		if (err == -ETIME) {
> -			if (nonblock) {
> -				err = -EWOULDBLOCK;
> -				i915_request_put(rq);
> -				goto err;
> -			}
> -			goto slow;
> -		}
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>   	/* only throttle once, even if we didn't need to throttle */
>   	throttle = false;
>   
> @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>   	return err;
>   
>   slow:
> -	err = eb_relocate_parse_slow(eb, rq);
> +	err = eb_relocate_parse_slow(eb);
>   	if (err)
>   		/*
>   		 * If the user expects the execobject.offset and
> @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>   	return err;
>   }
>   
> +/*
> + * Using two helper loops for the order of which requests / batches are created
> + * and added the to backend. Requests are created in order from the parent to
> + * the last child. Requests are add in the reverse order, from the last child to
> + * parent. This is down from locking reasons as the timeline lock is acquired
down from -> done for

John.

> + * during request creation and released when the request is added to the
> + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> + * the ordering.
> + */
> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> +
> +static struct i915_request *
> +eb_find_first_request_added(struct i915_execbuffer *eb)
> +{
> +	int i;
> +
> +	for_each_batch_add_order(eb, i)
> +		if (eb->requests[i])
> +			return eb->requests[i];
> +
> +	GEM_BUG_ON("Request not found");
> +
> +	return NULL;
> +}
> +
>   static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   {
>   	const unsigned int count = eb->buffer_count;
>   	unsigned int i = count;
> -	int err = 0;
> +	int err = 0, j;
>   
>   	while (i--) {
>   		struct eb_vma *ev = &eb->vma[i];
> @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   		if (flags & EXEC_OBJECT_CAPTURE) {
>   			struct i915_capture_list *capture;
>   
> -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> -			if (capture) {
> -				capture->next = eb->request->capture_list;
> -				capture->vma = vma;
> -				eb->request->capture_list = capture;
> +			for_each_batch_create_order(eb, j) {
> +				if (!eb->requests[j])
> +					break;
> +
> +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> +				if (capture) {
> +					capture->next =
> +						eb->requests[j]->capture_list;
> +					capture->vma = vma;
> +					eb->requests[j]->capture_list = capture;
> +				}
>   			}
>   		}
>   
> @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   				flags &= ~EXEC_OBJECT_ASYNC;
>   		}
>   
> +		/* We only need to await on the first request */
>   		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
>   			err = i915_request_await_object
> -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> +				(eb_find_first_request_added(eb), obj,
> +				 flags & EXEC_OBJECT_WRITE);
>   		}
>   
> -		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +		for_each_batch_add_order(eb, j) {
> +			if (err)
> +				break;
> +			if (!eb->requests[j])
> +				continue;
> +
> +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> +						       j ? NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->requests[j]->fence,
> +						       flags | __EXEC_OBJECT_NO_RESERVE);
> +		}
>   	}
>   
>   #ifdef CONFIG_MMU_NOTIFIER
> @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   		goto err_skip;
>   
>   	/* Unconditionally flush any chipset caches (for streaming writes). */
> -	intel_gt_chipset_flush(eb->engine->gt);
> +	intel_gt_chipset_flush(eb->gt);
>   	return 0;
>   
>   err_skip:
> -	i915_request_set_error_once(eb->request, err);
> +	for_each_batch_create_order(eb, j) {
> +		if (!eb->requests[j])
> +			break;
> +
> +		i915_request_set_error_once(eb->requests[j], err);
> +	}
>   	return err;
>   }
>   
> @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	int err;
>   
>   	if (!eb_use_cmdparser(eb)) {
> -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
>   		if (IS_ERR(batch))
>   			return PTR_ERR(batch);
>   
>   		goto secure_batch;
>   	}
>   
> -	len = eb->batch_len;
> +	if (intel_context_is_parallel(eb->context))
> +		return -EINVAL;
> +
> +	len = eb->batch_len[0];
>   	if (!CMDPARSER_USES_GGTT(eb->i915)) {
>   		/*
>   		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	} else {
>   		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
>   	}
> -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
>   		return -EINVAL;
>   
>   	if (!pool) {
> -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> +		pool = intel_gt_get_buffer_pool(eb->gt, len,
>   						I915_MAP_WB);
>   		if (IS_ERR(pool))
>   			return PTR_ERR(pool);
> @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>   		trampoline = shadow;
>   
>   		shadow = shadow_batch_pin(eb, pool->obj,
> -					  &eb->engine->gt->ggtt->vm,
> +					  &eb->gt->ggtt->vm,
>   					  PIN_GLOBAL);
>   		if (IS_ERR(shadow)) {
>   			err = PTR_ERR(shadow);
> @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	if (err)
>   		goto err_trampoline;
>   
> -	err = intel_engine_cmd_parser(eb->engine,
> -				      eb->batch->vma,
> +	err = intel_engine_cmd_parser(eb->context->engine,
> +				      eb->batches[0]->vma,
>   				      eb->batch_start_offset,
> -				      eb->batch_len,
> +				      eb->batch_len[0],
>   				      shadow, trampoline);
>   	if (err)
>   		goto err_unpin_batch;
>   
> -	eb->batch = &eb->vma[eb->buffer_count++];
> -	eb->batch->vma = i915_vma_get(shadow);
> -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> +	eb->batches[0]->vma = i915_vma_get(shadow);
> +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
>   
>   	eb->trampoline = trampoline;
>   	eb->batch_start_offset = 0;
>   
>   secure_batch:
>   	if (batch) {
> -		eb->batch = &eb->vma[eb->buffer_count++];
> -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> -		eb->batch->vma = i915_vma_get(batch);
> +		if (intel_context_is_parallel(eb->context))
> +			return -EINVAL;
> +
> +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> +		eb->batches[0]->vma = i915_vma_get(batch);
>   	}
>   	return 0;
>   
> @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	return err;
>   }
>   
> -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> +static int eb_request_submit(struct i915_execbuffer *eb,
> +			     struct i915_request *rq,
> +			     struct i915_vma *batch,
> +			     u64 batch_len)
>   {
>   	int err;
>   
> -	if (intel_context_nopreempt(eb->context))
> -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> -
> -	err = eb_move_to_gpu(eb);
> -	if (err)
> -		return err;
> +	if (intel_context_nopreempt(rq->context))
> +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
>   
>   	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> -		err = i915_reset_gen7_sol_offsets(eb->request);
> +		err = i915_reset_gen7_sol_offsets(rq);
>   		if (err)
>   			return err;
>   	}
> @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>   	 * allows us to determine if the batch is still waiting on the GPU
>   	 * or actually running by checking the breadcrumb.
>   	 */
> -	if (eb->engine->emit_init_breadcrumb) {
> -		err = eb->engine->emit_init_breadcrumb(eb->request);
> +	if (rq->context->engine->emit_init_breadcrumb) {
> +		err = rq->context->engine->emit_init_breadcrumb(rq);
>   		if (err)
>   			return err;
>   	}
>   
> -	err = eb->engine->emit_bb_start(eb->request,
> -					batch->node.start +
> -					eb->batch_start_offset,
> -					eb->batch_len,
> -					eb->batch_flags);
> +	err = rq->context->engine->emit_bb_start(rq,
> +						 batch->node.start +
> +						 eb->batch_start_offset,
> +						 batch_len,
> +						 eb->batch_flags);
>   	if (err)
>   		return err;
>   
>   	if (eb->trampoline) {
> +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
>   		GEM_BUG_ON(eb->batch_start_offset);
> -		err = eb->engine->emit_bb_start(eb->request,
> -						eb->trampoline->node.start +
> -						eb->batch_len,
> -						0, 0);
> +		err = rq->context->engine->emit_bb_start(rq,
> +							 eb->trampoline->node.start +
> +							 batch_len, 0, 0);
>   		if (err)
>   			return err;
>   	}
> @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>   	return 0;
>   }
>   
> +static int eb_submit(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	err = eb_move_to_gpu(eb);
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> +		if (!err)
> +			err = eb_request_submit(eb, eb->requests[i],
> +						eb->batches[i]->vma,
> +						eb->batch_len[i]);
> +	}
> +
> +	return err;
> +}
> +
>   static int num_vcs_engines(const struct drm_i915_private *i915)
>   {
>   	return hweight_long(VDBOX_MASK(&i915->gt));
> @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
>   	return i915_request_get(rq);
>   }
>   
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> +			   bool throttle)
>   {
> -	struct intel_context *ce = eb->context;
>   	struct intel_timeline *tl;
> -	struct i915_request *rq = NULL;
> -	int err;
> -
> -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> -
> -	if (unlikely(intel_context_is_banned(ce)))
> -		return ERR_PTR(-EIO);
> -
> -	/*
> -	 * Pinning the contexts may generate requests in order to acquire
> -	 * GGTT space, so do this first before we reserve a seqno for
> -	 * ourselves.
> -	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> -	if (err)
> -		return ERR_PTR(err);
> +	struct i915_request *rq;
>   
>   	/*
>   	 * Take a local wakeref for preparing to dispatch the execbuf as
> @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>   	 * taken on the engine, and the parent device.
>   	 */
>   	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl)) {
> -		intel_context_unpin(ce);
> -		return ERR_CAST(tl);
> -	}
> +	if (IS_ERR(tl))
> +		return PTR_ERR(tl);
>   
>   	intel_context_enter(ce);
>   	if (throttle)
>   		rq = eb_throttle(eb, ce);
>   	intel_context_timeline_unlock(tl);
>   
> +	if (rq) {
> +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> +
> +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> +				      timeout) < 0) {
> +			i915_request_put(rq);
> +
> +			tl = intel_context_timeline_lock(ce);
> +			intel_context_exit(ce);
> +			intel_context_timeline_unlock(tl);
> +
> +			if (nonblock)
> +				return -EWOULDBLOCK;
> +			else
> +				return -EINTR;
> +		}
> +		i915_request_put(rq);
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +{
> +	struct intel_context *ce = eb->context, *child;
> +	int err;
> +	int i = 0, j = 0;
> +
> +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> +
> +	if (unlikely(intel_context_is_banned(ce)))
> +		return -EIO;
> +
> +	/*
> +	 * Pinning the contexts may generate requests in order to acquire
> +	 * GGTT space, so do this first before we reserve a seqno for
> +	 * ourselves.
> +	 */
> +	err = intel_context_pin_ww(ce, &eb->ww);
> +	if (err)
> +		return err;
> +	for_each_child(ce, child) {
> +		err = intel_context_pin_ww(child, &eb->ww);
> +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = eb_pin_timeline(eb, child, throttle);
> +		if (err)
> +			goto unwind;
> +		++i;
> +	}
> +	err = eb_pin_timeline(eb, ce, throttle);
> +	if (err)
> +		goto unwind;
> +
>   	eb->args->flags |= __EXEC_ENGINE_PINNED;
> -	return rq;
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (j++ < i) {
> +			mutex_lock(&child->timeline->mutex);
> +			intel_context_exit(child);
> +			mutex_unlock(&child->timeline->mutex);
> +		}
> +	}
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +	return err;
>   }
>   
>   static void eb_unpin_engine(struct i915_execbuffer *eb)
>   {
> -	struct intel_context *ce = eb->context;
> -	struct intel_timeline *tl = ce->timeline;
> +	struct intel_context *ce = eb->context, *child;
>   
>   	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
>   		return;
>   
>   	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
>   
> -	mutex_lock(&tl->mutex);
> +	for_each_child(ce, child) {
> +		mutex_lock(&child->timeline->mutex);
> +		intel_context_exit(child);
> +		mutex_unlock(&child->timeline->mutex);
> +
> +		intel_context_unpin(child);
> +	}
> +
> +	mutex_lock(&ce->timeline->mutex);
>   	intel_context_exit(ce);
> -	mutex_unlock(&tl->mutex);
> +	mutex_unlock(&ce->timeline->mutex);
>   
>   	intel_context_unpin(ce);
>   }
> @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>   static int
>   eb_select_engine(struct i915_execbuffer *eb)
>   {
> -	struct intel_context *ce;
> +	struct intel_context *ce, *child;
>   	unsigned int idx;
>   	int err;
>   
> @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
>   	if (IS_ERR(ce))
>   		return PTR_ERR(ce);
>   
> +	if (intel_context_is_parallel(ce)) {
> +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +		if (eb->batch_start_offset || eb->args->batch_len) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +	}
> +	eb->num_batches = ce->parallel.number_children + 1;
> +
> +	for_each_child(ce, child)
> +		intel_context_get(child);
>   	intel_gt_pm_get(ce->engine->gt);
>   
>   	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
>   		if (err)
>   			goto err;
>   	}
> +	for_each_child(ce, child) {
> +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> +			err = intel_context_alloc_state(child);
> +			if (err)
> +				goto err;
> +		}
> +	}
>   
>   	/*
>   	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
>   		goto err;
>   
>   	eb->context = ce;
> -	eb->engine = ce->engine;
> +	eb->gt = ce->engine->gt;
>   
>   	/*
>   	 * Make sure engine pool stays alive even if we call intel_context_put
> @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
>   
>   err:
>   	intel_gt_pm_put(ce->engine->gt);
> +	for_each_child(ce, child)
> +		intel_context_put(child);
>   	intel_context_put(ce);
>   	return err;
>   }
> @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
>   static void
>   eb_put_engine(struct i915_execbuffer *eb)
>   {
> -	intel_gt_pm_put(eb->engine->gt);
> +	struct intel_context *child;
> +
> +	intel_gt_pm_put(eb->gt);
> +	for_each_child(eb->context, child)
> +		intel_context_put(child);
>   	intel_context_put(eb->context);
>   }
>   
> @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
>   }
>   
>   static int
> -await_fence_array(struct i915_execbuffer *eb)
> +await_fence_array(struct i915_execbuffer *eb,
> +		  struct i915_request *rq)
>   {
>   	unsigned int n;
>   	int err;
> @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
>   		if (!eb->fences[n].dma_fence)
>   			continue;
>   
> -		err = i915_request_await_dma_fence(eb->request,
> -						   eb->fences[n].dma_fence);
> +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
>   		if (err < 0)
>   			return err;
>   	}
> @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
>   	return 0;
>   }
>   
> -static void signal_fence_array(const struct i915_execbuffer *eb)
> +static void signal_fence_array(const struct i915_execbuffer *eb,
> +			       struct dma_fence * const fence)
>   {
> -	struct dma_fence * const fence = &eb->request->fence;
>   	unsigned int n;
>   
>   	for (n = 0; n < eb->num_fences; n++) {
> @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
>   			break;
>   }
>   
> -static int eb_request_add(struct i915_execbuffer *eb, int err)
> +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
>   {
> -	struct i915_request *rq = eb->request;
>   	struct intel_timeline * const tl = i915_request_timeline(rq);
>   	struct i915_sched_attr attr = {};
>   	struct i915_request *prev;
> @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>   	/* Check that the context wasn't destroyed before submission */
>   	if (likely(!intel_context_is_closed(eb->context))) {
>   		attr = eb->gem_context->sched;
> -	} else {
> -		/* Serialise with context_close via the add_to_timeline */
> -		i915_request_set_error_once(rq, -ENOENT);
> -		__i915_request_skip(rq);
> -		err = -ENOENT; /* override any transient errors */
>   	}
>   
>   	__i915_request_queue(rq, &attr);
> @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>   		retire_requests(tl, prev);
>   
>   	mutex_unlock(&tl->mutex);
> +}
> +
> +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> +{
> +	int i;
> +
> +	/*
> +	 * We iterate in reverse order of creation to release timeline mutexes in
> +	 * same order.
> +	 */
> +	for_each_batch_add_order(eb, i) {
> +		struct i915_request *rq = eb->requests[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		if (unlikely(intel_context_is_closed(eb->context))) {
> +			/* Serialise with context_close via the add_to_timeline */
> +			i915_request_set_error_once(rq, -ENOENT);
> +			__i915_request_skip(rq);
> +			err = -ENOENT; /* override any transient errors */
> +		}
> +
> +		if (intel_context_is_parallel(eb->context)) {
> +			if (err) {
> +				__i915_request_skip(rq);
> +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> +					&rq->fence.flags);
> +			}
> +			if (i == 0)
> +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +					&rq->fence.flags);
> +		}
> +
> +		eb_request_add(eb, rq);
> +	}
>   
>   	return err;
>   }
> @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>   				    eb);
>   }
>   
> +static void eb_requests_get(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_get(eb->requests[i]);
> +	}
> +}
> +
> +static void eb_requests_put(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_put(eb->requests[i]);
> +	}
> +}
> +
> +static struct sync_file *
> +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences;
> +	unsigned int i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> +
> +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> +	if (!fences)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for_each_batch_create_order(eb, i)
> +		fences[i] = &eb->requests[i]->fence;
> +
> +	fence_array = dma_fence_array_create(eb->num_batches,
> +					     fences,
> +					     eb->context->parallel.fence_context,
> +					     eb->context->parallel.seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for_each_batch_create_order(eb, i)
> +		dma_fence_get(fences[i]);
> +
> +	if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&fence_array->base);
> +		/* sync_file now owns fence_arry, drop creation ref */
> +		dma_fence_put(&fence_array->base);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return out_fence;
> +}
> +
> +static struct sync_file *
> +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> +	      struct dma_fence *in_fence, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	int err;
> +
> +	if (unlikely(eb->gem_context->syncobj)) {
> +		struct dma_fence *fence;
> +
> +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (in_fence) {
> +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> +			err = i915_request_await_execution(rq, in_fence);
> +		else
> +			err = i915_request_await_dma_fence(rq, in_fence);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (eb->fences) {
> +		err = await_fence_array(eb, rq);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (intel_context_is_parallel(eb->context)) {
> +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> +		if (IS_ERR(out_fence))
> +			return ERR_PTR(-ENOMEM);
> +	} else if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&rq->fence);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	return out_fence;
> +}
> +
> +static struct intel_context *
> +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> +{
> +	struct intel_context *child;
> +
> +	if (likely(context_number == 0))
> +		return eb->context;
> +
> +	for_each_child(eb->context, child)
> +		if (!--context_number)
> +			return child;
> +
> +	GEM_BUG_ON("Context not found");
> +
> +	return NULL;
> +}
> +
> +static struct sync_file *
> +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> +		   int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		/* Allocate a request for this batch buffer nice and early. */
> +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> +		if (IS_ERR(eb->requests[i])) {
> +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> +			eb->requests[i] = NULL;
> +			return out_fence;
> +		}
> +
> +		/*
> +		 * Only the first request added (committed to backend) has to
> +		 * take the in fences into account as all subsequent requests
> +		 * will have fences inserted inbetween them.
> +		 */
> +		if (i + 1 == eb->num_batches) {
> +			out_fence = eb_fences_add(eb, eb->requests[i],
> +						  in_fence, out_fence_fd);
> +			if (IS_ERR(out_fence))
> +				return out_fence;
> +		}
> +
> +		/*
> +		 * Whilst this request exists, batch_obj will be on the
> +		 * active_list, and so will hold the active reference. Only when
> +		 * this request is retired will the batch_obj be moved onto
> +		 * the inactive_list and lose its active reference. Hence we do
> +		 * not need to explicitly hold another reference here.
> +		 */
> +		eb->requests[i]->batch = eb->batches[i]->vma;
> +		if (eb->batch_pool) {
> +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> +							 eb->requests[i]);
> +		}
> +	}
> +
> +	return out_fence;
> +}
> +
>   static int
>   i915_gem_do_execbuffer(struct drm_device *dev,
>   		       struct drm_file *file,
> @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   	struct i915_execbuffer eb;
>   	struct dma_fence *in_fence = NULL;
>   	struct sync_file *out_fence = NULL;
> -	struct i915_vma *batch;
>   	int out_fence_fd = -1;
>   	int err;
>   
> @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	eb.buffer_count = args->buffer_count;
>   	eb.batch_start_offset = args->batch_start_offset;
> -	eb.batch_len = args->batch_len;
>   	eb.trampoline = NULL;
>   
>   	eb.fences = NULL;
>   	eb.num_fences = 0;
>   
> +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> +	       ARRAY_SIZE(eb.requests));
> +	eb.composite_fence = NULL;
> +
>   	eb.batch_flags = 0;
>   	if (args->flags & I915_EXEC_SECURE) {
>   		if (GRAPHICS_VER(i915) >= 11)
> @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	ww_acquire_done(&eb.ww.ctx);
>   
> -	batch = eb.batch->vma;
> -
> -	/* Allocate a request for this batch buffer nice and early. */
> -	eb.request = i915_request_create(eb.context);
> -	if (IS_ERR(eb.request)) {
> -		err = PTR_ERR(eb.request);
> -		goto err_vma;
> -	}
> -
> -	if (unlikely(eb.gem_context->syncobj)) {
> -		struct dma_fence *fence;
> -
> -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> -		err = i915_request_await_dma_fence(eb.request, fence);
> -		dma_fence_put(fence);
> -		if (err)
> -			goto err_ext;
> -	}
> -
> -	if (in_fence) {
> -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> -			err = i915_request_await_execution(eb.request,
> -							   in_fence);
> -		else
> -			err = i915_request_await_dma_fence(eb.request,
> -							   in_fence);
> -		if (err < 0)
> -			goto err_request;
> -	}
> -
> -	if (eb.fences) {
> -		err = await_fence_array(&eb);
> -		if (err)
> +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> +	if (IS_ERR(out_fence)) {
> +		err = PTR_ERR(out_fence);
> +		if (eb.requests[0])
>   			goto err_request;
> +		else
> +			goto err_vma;
>   	}
>   
> -	if (out_fence_fd != -1) {
> -		out_fence = sync_file_create(&eb.request->fence);
> -		if (!out_fence) {
> -			err = -ENOMEM;
> -			goto err_request;
> -		}
> -	}
> -
> -	/*
> -	 * Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> -	 */
> -	eb.request->batch = batch;
> -	if (eb.batch_pool)
> -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> -
> -	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch);
> +	err = eb_submit(&eb);
>   
>   err_request:
> -	i915_request_get(eb.request);
> -	err = eb_request_add(&eb, err);
> +	eb_requests_get(&eb);
> +	err = eb_requests_add(&eb, err);
>   
>   	if (eb.fences)
> -		signal_fence_array(&eb);
> +		signal_fence_array(&eb, eb.composite_fence ?
> +				   eb.composite_fence :
> +				   &eb.requests[0]->fence);
>   
>   	if (out_fence) {
>   		if (err == 0) {
> @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	if (unlikely(eb.gem_context->syncobj)) {
>   		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> -					  &eb.request->fence);
> +					  eb.composite_fence ?
> +					  eb.composite_fence :
> +					  &eb.requests[0]->fence);
>   	}
>   
> -	i915_request_put(eb.request);
> +	if (!out_fence && eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
> +
> +	eb_requests_put(&eb);
>   
>   err_vma:
>   	eb_release_vmas(&eb, true);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 1bc705f98e2a..1781419fa105 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
>   	struct intel_timeline *tl = ce->timeline;
>   	int err;
>   
> -	err = mutex_lock_interruptible(&tl->mutex);
> +	if (intel_context_is_parent(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> +	else if (intel_context_is_child(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex,
> +						      ce->parallel.child_index + 1);
> +	else
> +		err = mutex_lock_interruptible(&tl->mutex);
>   	if (err)
>   		return ERR_PTR(err);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 95a5b94b4ece..9e0177dc5484 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -248,6 +248,16 @@ struct intel_context {
>   		 * context
>   		 */
>   		struct i915_request *last_rq;
> +		/**
> +		 * @fence_context: fence context composite fence when doing
> +		 * parallel submission
> +		 */
> +		u64 fence_context;
> +		/**
> +		 * @seqno: seqno for composite fence when doing parallel
> +		 * submission
> +		 */
> +		u32 seqno;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
>   		/** @child_index: index into child_list if child */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f28e36aa77c2..83b0d2a114af 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> +
>   	parent->engine->emit_bb_start =
>   		emit_bb_start_parent_no_preempt_mid_batch;
>   	parent->engine->emit_fini_breadcrumb =
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 8950785e55d6..24db8459376b 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -147,6 +147,15 @@ enum {
>   	 * tail.
>   	 */
>   	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) that
> +	 * hit an error while generating requests in the execbuf IOCTL.
> +	 * Indicates this request should be skipped as another request in
> +	 * submission / relationship encoutered an error.
> +	 */
> +	I915_FENCE_FLAG_SKIP_PARALLEL,
>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..90546fa58fc1 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   	return i915_active_add_request(&vma->active, rq);
>   }
>   
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     struct dma_fence *fence,
> +			     unsigned int flags)
>   {
>   	struct drm_i915_gem_object *obj = vma->obj;
>   	int err;
> @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>   			intel_frontbuffer_put(front);
>   		}
>   
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> -		obj->read_domains = 0;
> +		if (fence) {
> +			dma_resv_add_excl_fence(vma->resv, fence);
> +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> +			obj->read_domains = 0;
> +		}
>   	} else {
>   		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
>   			err = dma_resv_reserve_shared(vma->resv, 1);
> @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>   				return err;
>   		}
>   
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> -		obj->write_domain = 0;
> +		if (fence) {
> +			dma_resv_add_shared_fence(vma->resv, fence);
> +			obj->write_domain = 0;
> +		}
>   	}
>   
>   	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..648dbe744c96 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>   
>   int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>   					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  struct dma_fence *fence,
> +					  unsigned int flags);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> +}
>   
>   #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
@ 2021-10-12 21:22     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 21:22 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> after a context has been configured with the 'set_parallel' extension.
> The number batches is implicit based on the contexts configuration.
>
> This is implemented with a series of loops. First a loop is used to find
> all the batches, a loop to pin all the HW contexts, a loop to create all
> the requests, a loop to submit (emit BB start, etc...) all the requests,
> a loop to tie the requests to the VMAs they touch, and finally a loop to
> commit the requests to the backend.
>
> A composite fence is also created for the generated requests to return
> to the user and to stick in dma resv slots.
>
> No behavior from the existing IOCTL should be changed aside from when
> throttling because the ring for a context is full, wait on the request
throttling because the ring for -> throttling the ring because

full, wait -> full. In this situation, i915 will now wait

> while holding the object locks.
, previously it would have dropped the locks for the wait.

And maybe explain why this change is necessary?


>
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: https://github.com/intel/media-driver/pull/1252
>
> v2:
>   (Matthew Brost)
>    - Return proper error value if i915_request_create fails
> v3:
>   (John Harrison)
>    - Add comment explaining create / add order loops + locking
>    - Update commit message explaining different in IOCTL behavior
>    - Line wrap some comments
>    - eb_add_request returns void
>    - Return -EINVAL rather triggering BUG_ON if cmd parser used
>   (Checkpatch)
>    - Check eb->batch_len[*current_batch]
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
>   drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
>   drivers/gpu/drm/i915/i915_request.h           |   9 +
>   drivers/gpu/drm/i915/i915_vma.c               |  21 +-
>   drivers/gpu/drm/i915/i915_vma.h               |  13 +-
>   7 files changed, 599 insertions(+), 257 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 2f2434b52317..5c7fb6f68bbb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -244,17 +244,25 @@ struct i915_execbuffer {
>   	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
>   	struct eb_vma *vma;
>   
> -	struct intel_engine_cs *engine; /** engine to queue the request to */
> +	struct intel_gt *gt; /* gt for the execbuf */
>   	struct intel_context *context; /* logical state for the request */
>   	struct i915_gem_context *gem_context; /** caller's context */
>   
> -	struct i915_request *request; /** our request to build */
> -	struct eb_vma *batch; /** identity of the batch obj/vma */
> +	/** our requests to build */
> +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> +	/** identity of the batch obj/vma */
> +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
>   	struct i915_vma *trampoline; /** trampoline used for chaining */
>   
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>   	/** actual size of execobj[] as we may extend it for the cmdparser */
>   	unsigned int buffer_count;
>   
> +	/* number of batches in execbuf IOCTL */
> +	unsigned int num_batches;
> +
>   	/** list of vma not yet bound during reservation phase */
>   	struct list_head unbound;
>   
> @@ -281,7 +289,8 @@ struct i915_execbuffer {
>   
>   	u64 invalid_flags; /** Set of execobj.flags that are invalid */
>   
> -	u64 batch_len; /** Length of batch within object */
> +	/** Length of batch within object */
> +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
>   	u32 batch_start_offset; /** Location within object of batch */
>   	u32 batch_flags; /** Flags composed for emit_bb_start() */
>   	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> @@ -299,14 +308,13 @@ struct i915_execbuffer {
>   };
>   
>   static int eb_parse(struct i915_execbuffer *eb);
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> -					  bool throttle);
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
>   static void eb_unpin_engine(struct i915_execbuffer *eb);
>   
>   static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
>   {
> -	return intel_engine_requires_cmd_parser(eb->engine) ||
> -		(intel_engine_using_cmd_parser(eb->engine) &&
> +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> +		(intel_engine_using_cmd_parser(eb->context->engine) &&
>   		 eb->args->batch_len);
>   }
>   
> @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
>   	return 0;
>   }
>   
> -static void
> +static inline bool
> +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> +{
> +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> +		buffer_idx < eb->num_batches :
> +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> +}
> +
> +static int
>   eb_add_vma(struct i915_execbuffer *eb,
> -	   unsigned int i, unsigned batch_idx,
> +	   unsigned int *current_batch,
> +	   unsigned int i,
>   	   struct i915_vma *vma)
>   {
> +	struct drm_i915_private *i915 = eb->i915;
>   	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
>   	struct eb_vma *ev = &eb->vma[i];
>   
> @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
>   	 * Note that actual hangs have only been observed on gen7, but for
>   	 * paranoia do it everywhere.
>   	 */
> -	if (i == batch_idx) {
> +	if (is_batch_buffer(eb, i)) {
>   		if (entry->relocation_count &&
>   		    !(ev->flags & EXEC_OBJECT_PINNED))
>   			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
>   		if (eb->reloc_cache.has_fence)
>   			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
>   
> -		eb->batch = ev;
> +		eb->batches[*current_batch] = ev;
> +
> +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> +			drm_dbg(&i915->drm,
> +				"Attempting to use self-modifying batch buffer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (range_overflows_t(u64,
> +				      eb->batch_start_offset,
> +				      eb->args->batch_len,
> +				      ev->vma->size)) {
> +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> +			return -EINVAL;
> +		}
> +
> +		if (eb->args->batch_len == 0)
> +			eb->batch_len[*current_batch] = ev->vma->size -
> +				eb->batch_start_offset;
> +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> +			drm_dbg(&i915->drm, "Invalid batch length\n");
> +			return -EINVAL;
> +		}
> +
> +		++*current_batch;
>   	}
> +
> +	return 0;
>   }
>   
>   static inline int use_cpu_reloc(const struct reloc_cache *cache,
> @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>   	} while (1);
>   }
>   
> -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> -{
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		return 0;
> -	else
> -		return eb->buffer_count - 1;
> -}
> -
>   static int eb_select_context(struct i915_execbuffer *eb)
>   {
>   	struct i915_gem_context *ctx;
> @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
>   
>   static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   {
> -	struct drm_i915_private *i915 = eb->i915;
> -	unsigned int batch = eb_batch_index(eb);
> -	unsigned int i;
> +	unsigned int i, current_batch = 0;
>   	int err = 0;
>   
>   	INIT_LIST_HEAD(&eb->relocs);
> @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   			goto err;
>   		}
>   
> -		eb_add_vma(eb, i, batch, vma);
> +		err = eb_add_vma(eb, &current_batch, i, vma);
> +		if (err)
> +			return err;
>   
>   		if (i915_gem_object_is_userptr(vma->obj)) {
>   			err = i915_gem_object_userptr_submit_init(vma->obj);
> @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>   		}
>   	}
>   
> -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> -		drm_dbg(&i915->drm,
> -			"Attempting to use self-modifying batch buffer\n");
> -		return -EINVAL;
> -	}
> -
> -	if (range_overflows_t(u64,
> -			      eb->batch_start_offset, eb->batch_len,
> -			      eb->batch->vma->size)) {
> -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> -		return -EINVAL;
> -	}
> -
> -	if (eb->batch_len == 0)
> -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> -		drm_dbg(&i915->drm, "Invalid batch length\n");
> -		return -EINVAL;
> -	}
> -
>   	return 0;
>   
>   err:
> @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
>   	return 0;
>   }
>   
> -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> -					   struct i915_request *rq)
> +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
>   {
>   	bool have_copy = false;
>   	struct eb_vma *ev;
> @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   	eb_release_vmas(eb, false);
>   	i915_gem_ww_ctx_fini(&eb->ww);
>   
> -	if (rq) {
> -		/* nonblocking is always false */
> -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> -			i915_request_put(rq);
> -			rq = NULL;
> -
> -			err = -EINTR;
> -			goto err_relock;
> -		}
> -
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>   	/*
>   	 * We take 3 passes through the slowpatch.
>   	 *
> @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   	if (!err)
>   		err = eb_reinit_userptr(eb);
>   
> -err_relock:
>   	i915_gem_ww_ctx_init(&eb->ww, true);
>   	if (err)
>   		goto out;
>   
>   	/* reacquire the objects */
>   repeat_validate:
> -	rq = eb_pin_engine(eb, false);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, false);
> +	if (err)
>   		goto err;
> -	}
> -
> -	/* We didn't throttle, should be NULL */
> -	GEM_WARN_ON(rq);
>   
>   	err = eb_validate_vmas(eb);
>   	if (err)
>   		goto err;
>   
> -	GEM_BUG_ON(!eb->batch);
> +	GEM_BUG_ON(!eb->batches[0]);
>   
>   	list_for_each_entry(ev, &eb->relocs, reloc_link) {
>   		if (!have_copy) {
> @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>   		}
>   	}
>   
> -	if (rq)
> -		i915_request_put(rq);
> -
>   	return err;
>   }
>   
>   static int eb_relocate_parse(struct i915_execbuffer *eb)
>   {
>   	int err;
> -	struct i915_request *rq = NULL;
>   	bool throttle = true;
>   
>   retry:
> -	rq = eb_pin_engine(eb, throttle);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, throttle);
> +	if (err) {
>   		if (err != -EDEADLK)
>   			return err;
>   
>   		goto err;
>   	}
>   
> -	if (rq) {
> -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> -
> -		/* Need to drop all locks now for throttling, take slowpath */
> -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> -		if (err == -ETIME) {
> -			if (nonblock) {
> -				err = -EWOULDBLOCK;
> -				i915_request_put(rq);
> -				goto err;
> -			}
> -			goto slow;
> -		}
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>   	/* only throttle once, even if we didn't need to throttle */
>   	throttle = false;
>   
> @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>   	return err;
>   
>   slow:
> -	err = eb_relocate_parse_slow(eb, rq);
> +	err = eb_relocate_parse_slow(eb);
>   	if (err)
>   		/*
>   		 * If the user expects the execobject.offset and
> @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>   	return err;
>   }
>   
> +/*
> + * Using two helper loops for the order of which requests / batches are created
> + * and added the to backend. Requests are created in order from the parent to
> + * the last child. Requests are add in the reverse order, from the last child to
> + * parent. This is down from locking reasons as the timeline lock is acquired
down from -> done for

John.

> + * during request creation and released when the request is added to the
> + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> + * the ordering.
> + */
> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> +
> +static struct i915_request *
> +eb_find_first_request_added(struct i915_execbuffer *eb)
> +{
> +	int i;
> +
> +	for_each_batch_add_order(eb, i)
> +		if (eb->requests[i])
> +			return eb->requests[i];
> +
> +	GEM_BUG_ON("Request not found");
> +
> +	return NULL;
> +}
> +
>   static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   {
>   	const unsigned int count = eb->buffer_count;
>   	unsigned int i = count;
> -	int err = 0;
> +	int err = 0, j;
>   
>   	while (i--) {
>   		struct eb_vma *ev = &eb->vma[i];
> @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   		if (flags & EXEC_OBJECT_CAPTURE) {
>   			struct i915_capture_list *capture;
>   
> -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> -			if (capture) {
> -				capture->next = eb->request->capture_list;
> -				capture->vma = vma;
> -				eb->request->capture_list = capture;
> +			for_each_batch_create_order(eb, j) {
> +				if (!eb->requests[j])
> +					break;
> +
> +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> +				if (capture) {
> +					capture->next =
> +						eb->requests[j]->capture_list;
> +					capture->vma = vma;
> +					eb->requests[j]->capture_list = capture;
> +				}
>   			}
>   		}
>   
> @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   				flags &= ~EXEC_OBJECT_ASYNC;
>   		}
>   
> +		/* We only need to await on the first request */
>   		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
>   			err = i915_request_await_object
> -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> +				(eb_find_first_request_added(eb), obj,
> +				 flags & EXEC_OBJECT_WRITE);
>   		}
>   
> -		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +		for_each_batch_add_order(eb, j) {
> +			if (err)
> +				break;
> +			if (!eb->requests[j])
> +				continue;
> +
> +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> +						       j ? NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->requests[j]->fence,
> +						       flags | __EXEC_OBJECT_NO_RESERVE);
> +		}
>   	}
>   
>   #ifdef CONFIG_MMU_NOTIFIER
> @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>   		goto err_skip;
>   
>   	/* Unconditionally flush any chipset caches (for streaming writes). */
> -	intel_gt_chipset_flush(eb->engine->gt);
> +	intel_gt_chipset_flush(eb->gt);
>   	return 0;
>   
>   err_skip:
> -	i915_request_set_error_once(eb->request, err);
> +	for_each_batch_create_order(eb, j) {
> +		if (!eb->requests[j])
> +			break;
> +
> +		i915_request_set_error_once(eb->requests[j], err);
> +	}
>   	return err;
>   }
>   
> @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	int err;
>   
>   	if (!eb_use_cmdparser(eb)) {
> -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
>   		if (IS_ERR(batch))
>   			return PTR_ERR(batch);
>   
>   		goto secure_batch;
>   	}
>   
> -	len = eb->batch_len;
> +	if (intel_context_is_parallel(eb->context))
> +		return -EINVAL;
> +
> +	len = eb->batch_len[0];
>   	if (!CMDPARSER_USES_GGTT(eb->i915)) {
>   		/*
>   		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	} else {
>   		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
>   	}
> -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
>   		return -EINVAL;
>   
>   	if (!pool) {
> -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> +		pool = intel_gt_get_buffer_pool(eb->gt, len,
>   						I915_MAP_WB);
>   		if (IS_ERR(pool))
>   			return PTR_ERR(pool);
> @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>   		trampoline = shadow;
>   
>   		shadow = shadow_batch_pin(eb, pool->obj,
> -					  &eb->engine->gt->ggtt->vm,
> +					  &eb->gt->ggtt->vm,
>   					  PIN_GLOBAL);
>   		if (IS_ERR(shadow)) {
>   			err = PTR_ERR(shadow);
> @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	if (err)
>   		goto err_trampoline;
>   
> -	err = intel_engine_cmd_parser(eb->engine,
> -				      eb->batch->vma,
> +	err = intel_engine_cmd_parser(eb->context->engine,
> +				      eb->batches[0]->vma,
>   				      eb->batch_start_offset,
> -				      eb->batch_len,
> +				      eb->batch_len[0],
>   				      shadow, trampoline);
>   	if (err)
>   		goto err_unpin_batch;
>   
> -	eb->batch = &eb->vma[eb->buffer_count++];
> -	eb->batch->vma = i915_vma_get(shadow);
> -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> +	eb->batches[0]->vma = i915_vma_get(shadow);
> +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
>   
>   	eb->trampoline = trampoline;
>   	eb->batch_start_offset = 0;
>   
>   secure_batch:
>   	if (batch) {
> -		eb->batch = &eb->vma[eb->buffer_count++];
> -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> -		eb->batch->vma = i915_vma_get(batch);
> +		if (intel_context_is_parallel(eb->context))
> +			return -EINVAL;
> +
> +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> +		eb->batches[0]->vma = i915_vma_get(batch);
>   	}
>   	return 0;
>   
> @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
>   	return err;
>   }
>   
> -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> +static int eb_request_submit(struct i915_execbuffer *eb,
> +			     struct i915_request *rq,
> +			     struct i915_vma *batch,
> +			     u64 batch_len)
>   {
>   	int err;
>   
> -	if (intel_context_nopreempt(eb->context))
> -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> -
> -	err = eb_move_to_gpu(eb);
> -	if (err)
> -		return err;
> +	if (intel_context_nopreempt(rq->context))
> +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
>   
>   	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> -		err = i915_reset_gen7_sol_offsets(eb->request);
> +		err = i915_reset_gen7_sol_offsets(rq);
>   		if (err)
>   			return err;
>   	}
> @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>   	 * allows us to determine if the batch is still waiting on the GPU
>   	 * or actually running by checking the breadcrumb.
>   	 */
> -	if (eb->engine->emit_init_breadcrumb) {
> -		err = eb->engine->emit_init_breadcrumb(eb->request);
> +	if (rq->context->engine->emit_init_breadcrumb) {
> +		err = rq->context->engine->emit_init_breadcrumb(rq);
>   		if (err)
>   			return err;
>   	}
>   
> -	err = eb->engine->emit_bb_start(eb->request,
> -					batch->node.start +
> -					eb->batch_start_offset,
> -					eb->batch_len,
> -					eb->batch_flags);
> +	err = rq->context->engine->emit_bb_start(rq,
> +						 batch->node.start +
> +						 eb->batch_start_offset,
> +						 batch_len,
> +						 eb->batch_flags);
>   	if (err)
>   		return err;
>   
>   	if (eb->trampoline) {
> +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
>   		GEM_BUG_ON(eb->batch_start_offset);
> -		err = eb->engine->emit_bb_start(eb->request,
> -						eb->trampoline->node.start +
> -						eb->batch_len,
> -						0, 0);
> +		err = rq->context->engine->emit_bb_start(rq,
> +							 eb->trampoline->node.start +
> +							 batch_len, 0, 0);
>   		if (err)
>   			return err;
>   	}
> @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>   	return 0;
>   }
>   
> +static int eb_submit(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	err = eb_move_to_gpu(eb);
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> +		if (!err)
> +			err = eb_request_submit(eb, eb->requests[i],
> +						eb->batches[i]->vma,
> +						eb->batch_len[i]);
> +	}
> +
> +	return err;
> +}
> +
>   static int num_vcs_engines(const struct drm_i915_private *i915)
>   {
>   	return hweight_long(VDBOX_MASK(&i915->gt));
> @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
>   	return i915_request_get(rq);
>   }
>   
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> +			   bool throttle)
>   {
> -	struct intel_context *ce = eb->context;
>   	struct intel_timeline *tl;
> -	struct i915_request *rq = NULL;
> -	int err;
> -
> -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> -
> -	if (unlikely(intel_context_is_banned(ce)))
> -		return ERR_PTR(-EIO);
> -
> -	/*
> -	 * Pinning the contexts may generate requests in order to acquire
> -	 * GGTT space, so do this first before we reserve a seqno for
> -	 * ourselves.
> -	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> -	if (err)
> -		return ERR_PTR(err);
> +	struct i915_request *rq;
>   
>   	/*
>   	 * Take a local wakeref for preparing to dispatch the execbuf as
> @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>   	 * taken on the engine, and the parent device.
>   	 */
>   	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl)) {
> -		intel_context_unpin(ce);
> -		return ERR_CAST(tl);
> -	}
> +	if (IS_ERR(tl))
> +		return PTR_ERR(tl);
>   
>   	intel_context_enter(ce);
>   	if (throttle)
>   		rq = eb_throttle(eb, ce);
>   	intel_context_timeline_unlock(tl);
>   
> +	if (rq) {
> +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> +
> +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> +				      timeout) < 0) {
> +			i915_request_put(rq);
> +
> +			tl = intel_context_timeline_lock(ce);
> +			intel_context_exit(ce);
> +			intel_context_timeline_unlock(tl);
> +
> +			if (nonblock)
> +				return -EWOULDBLOCK;
> +			else
> +				return -EINTR;
> +		}
> +		i915_request_put(rq);
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +{
> +	struct intel_context *ce = eb->context, *child;
> +	int err;
> +	int i = 0, j = 0;
> +
> +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> +
> +	if (unlikely(intel_context_is_banned(ce)))
> +		return -EIO;
> +
> +	/*
> +	 * Pinning the contexts may generate requests in order to acquire
> +	 * GGTT space, so do this first before we reserve a seqno for
> +	 * ourselves.
> +	 */
> +	err = intel_context_pin_ww(ce, &eb->ww);
> +	if (err)
> +		return err;
> +	for_each_child(ce, child) {
> +		err = intel_context_pin_ww(child, &eb->ww);
> +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = eb_pin_timeline(eb, child, throttle);
> +		if (err)
> +			goto unwind;
> +		++i;
> +	}
> +	err = eb_pin_timeline(eb, ce, throttle);
> +	if (err)
> +		goto unwind;
> +
>   	eb->args->flags |= __EXEC_ENGINE_PINNED;
> -	return rq;
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (j++ < i) {
> +			mutex_lock(&child->timeline->mutex);
> +			intel_context_exit(child);
> +			mutex_unlock(&child->timeline->mutex);
> +		}
> +	}
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +	return err;
>   }
>   
>   static void eb_unpin_engine(struct i915_execbuffer *eb)
>   {
> -	struct intel_context *ce = eb->context;
> -	struct intel_timeline *tl = ce->timeline;
> +	struct intel_context *ce = eb->context, *child;
>   
>   	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
>   		return;
>   
>   	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
>   
> -	mutex_lock(&tl->mutex);
> +	for_each_child(ce, child) {
> +		mutex_lock(&child->timeline->mutex);
> +		intel_context_exit(child);
> +		mutex_unlock(&child->timeline->mutex);
> +
> +		intel_context_unpin(child);
> +	}
> +
> +	mutex_lock(&ce->timeline->mutex);
>   	intel_context_exit(ce);
> -	mutex_unlock(&tl->mutex);
> +	mutex_unlock(&ce->timeline->mutex);
>   
>   	intel_context_unpin(ce);
>   }
> @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>   static int
>   eb_select_engine(struct i915_execbuffer *eb)
>   {
> -	struct intel_context *ce;
> +	struct intel_context *ce, *child;
>   	unsigned int idx;
>   	int err;
>   
> @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
>   	if (IS_ERR(ce))
>   		return PTR_ERR(ce);
>   
> +	if (intel_context_is_parallel(ce)) {
> +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +		if (eb->batch_start_offset || eb->args->batch_len) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +	}
> +	eb->num_batches = ce->parallel.number_children + 1;
> +
> +	for_each_child(ce, child)
> +		intel_context_get(child);
>   	intel_gt_pm_get(ce->engine->gt);
>   
>   	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
>   		if (err)
>   			goto err;
>   	}
> +	for_each_child(ce, child) {
> +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> +			err = intel_context_alloc_state(child);
> +			if (err)
> +				goto err;
> +		}
> +	}
>   
>   	/*
>   	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
>   		goto err;
>   
>   	eb->context = ce;
> -	eb->engine = ce->engine;
> +	eb->gt = ce->engine->gt;
>   
>   	/*
>   	 * Make sure engine pool stays alive even if we call intel_context_put
> @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
>   
>   err:
>   	intel_gt_pm_put(ce->engine->gt);
> +	for_each_child(ce, child)
> +		intel_context_put(child);
>   	intel_context_put(ce);
>   	return err;
>   }
> @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
>   static void
>   eb_put_engine(struct i915_execbuffer *eb)
>   {
> -	intel_gt_pm_put(eb->engine->gt);
> +	struct intel_context *child;
> +
> +	intel_gt_pm_put(eb->gt);
> +	for_each_child(eb->context, child)
> +		intel_context_put(child);
>   	intel_context_put(eb->context);
>   }
>   
> @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
>   }
>   
>   static int
> -await_fence_array(struct i915_execbuffer *eb)
> +await_fence_array(struct i915_execbuffer *eb,
> +		  struct i915_request *rq)
>   {
>   	unsigned int n;
>   	int err;
> @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
>   		if (!eb->fences[n].dma_fence)
>   			continue;
>   
> -		err = i915_request_await_dma_fence(eb->request,
> -						   eb->fences[n].dma_fence);
> +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
>   		if (err < 0)
>   			return err;
>   	}
> @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
>   	return 0;
>   }
>   
> -static void signal_fence_array(const struct i915_execbuffer *eb)
> +static void signal_fence_array(const struct i915_execbuffer *eb,
> +			       struct dma_fence * const fence)
>   {
> -	struct dma_fence * const fence = &eb->request->fence;
>   	unsigned int n;
>   
>   	for (n = 0; n < eb->num_fences; n++) {
> @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
>   			break;
>   }
>   
> -static int eb_request_add(struct i915_execbuffer *eb, int err)
> +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
>   {
> -	struct i915_request *rq = eb->request;
>   	struct intel_timeline * const tl = i915_request_timeline(rq);
>   	struct i915_sched_attr attr = {};
>   	struct i915_request *prev;
> @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>   	/* Check that the context wasn't destroyed before submission */
>   	if (likely(!intel_context_is_closed(eb->context))) {
>   		attr = eb->gem_context->sched;
> -	} else {
> -		/* Serialise with context_close via the add_to_timeline */
> -		i915_request_set_error_once(rq, -ENOENT);
> -		__i915_request_skip(rq);
> -		err = -ENOENT; /* override any transient errors */
>   	}
>   
>   	__i915_request_queue(rq, &attr);
> @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>   		retire_requests(tl, prev);
>   
>   	mutex_unlock(&tl->mutex);
> +}
> +
> +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> +{
> +	int i;
> +
> +	/*
> +	 * We iterate in reverse order of creation to release timeline mutexes in
> +	 * same order.
> +	 */
> +	for_each_batch_add_order(eb, i) {
> +		struct i915_request *rq = eb->requests[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		if (unlikely(intel_context_is_closed(eb->context))) {
> +			/* Serialise with context_close via the add_to_timeline */
> +			i915_request_set_error_once(rq, -ENOENT);
> +			__i915_request_skip(rq);
> +			err = -ENOENT; /* override any transient errors */
> +		}
> +
> +		if (intel_context_is_parallel(eb->context)) {
> +			if (err) {
> +				__i915_request_skip(rq);
> +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> +					&rq->fence.flags);
> +			}
> +			if (i == 0)
> +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +					&rq->fence.flags);
> +		}
> +
> +		eb_request_add(eb, rq);
> +	}
>   
>   	return err;
>   }
> @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>   				    eb);
>   }
>   
> +static void eb_requests_get(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_get(eb->requests[i]);
> +	}
> +}
> +
> +static void eb_requests_put(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_put(eb->requests[i]);
> +	}
> +}
> +
> +static struct sync_file *
> +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences;
> +	unsigned int i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> +
> +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> +	if (!fences)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for_each_batch_create_order(eb, i)
> +		fences[i] = &eb->requests[i]->fence;
> +
> +	fence_array = dma_fence_array_create(eb->num_batches,
> +					     fences,
> +					     eb->context->parallel.fence_context,
> +					     eb->context->parallel.seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for_each_batch_create_order(eb, i)
> +		dma_fence_get(fences[i]);
> +
> +	if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&fence_array->base);
> +		/* sync_file now owns fence_arry, drop creation ref */
> +		dma_fence_put(&fence_array->base);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return out_fence;
> +}
> +
> +static struct sync_file *
> +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> +	      struct dma_fence *in_fence, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	int err;
> +
> +	if (unlikely(eb->gem_context->syncobj)) {
> +		struct dma_fence *fence;
> +
> +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (in_fence) {
> +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> +			err = i915_request_await_execution(rq, in_fence);
> +		else
> +			err = i915_request_await_dma_fence(rq, in_fence);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (eb->fences) {
> +		err = await_fence_array(eb, rq);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
> +	if (intel_context_is_parallel(eb->context)) {
> +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> +		if (IS_ERR(out_fence))
> +			return ERR_PTR(-ENOMEM);
> +	} else if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&rq->fence);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	return out_fence;
> +}
> +
> +static struct intel_context *
> +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> +{
> +	struct intel_context *child;
> +
> +	if (likely(context_number == 0))
> +		return eb->context;
> +
> +	for_each_child(eb->context, child)
> +		if (!--context_number)
> +			return child;
> +
> +	GEM_BUG_ON("Context not found");
> +
> +	return NULL;
> +}
> +
> +static struct sync_file *
> +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> +		   int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		/* Allocate a request for this batch buffer nice and early. */
> +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> +		if (IS_ERR(eb->requests[i])) {
> +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> +			eb->requests[i] = NULL;
> +			return out_fence;
> +		}
> +
> +		/*
> +		 * Only the first request added (committed to backend) has to
> +		 * take the in fences into account as all subsequent requests
> +		 * will have fences inserted inbetween them.
> +		 */
> +		if (i + 1 == eb->num_batches) {
> +			out_fence = eb_fences_add(eb, eb->requests[i],
> +						  in_fence, out_fence_fd);
> +			if (IS_ERR(out_fence))
> +				return out_fence;
> +		}
> +
> +		/*
> +		 * Whilst this request exists, batch_obj will be on the
> +		 * active_list, and so will hold the active reference. Only when
> +		 * this request is retired will the batch_obj be moved onto
> +		 * the inactive_list and lose its active reference. Hence we do
> +		 * not need to explicitly hold another reference here.
> +		 */
> +		eb->requests[i]->batch = eb->batches[i]->vma;
> +		if (eb->batch_pool) {
> +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> +							 eb->requests[i]);
> +		}
> +	}
> +
> +	return out_fence;
> +}
> +
>   static int
>   i915_gem_do_execbuffer(struct drm_device *dev,
>   		       struct drm_file *file,
> @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   	struct i915_execbuffer eb;
>   	struct dma_fence *in_fence = NULL;
>   	struct sync_file *out_fence = NULL;
> -	struct i915_vma *batch;
>   	int out_fence_fd = -1;
>   	int err;
>   
> @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	eb.buffer_count = args->buffer_count;
>   	eb.batch_start_offset = args->batch_start_offset;
> -	eb.batch_len = args->batch_len;
>   	eb.trampoline = NULL;
>   
>   	eb.fences = NULL;
>   	eb.num_fences = 0;
>   
> +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> +	       ARRAY_SIZE(eb.requests));
> +	eb.composite_fence = NULL;
> +
>   	eb.batch_flags = 0;
>   	if (args->flags & I915_EXEC_SECURE) {
>   		if (GRAPHICS_VER(i915) >= 11)
> @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	ww_acquire_done(&eb.ww.ctx);
>   
> -	batch = eb.batch->vma;
> -
> -	/* Allocate a request for this batch buffer nice and early. */
> -	eb.request = i915_request_create(eb.context);
> -	if (IS_ERR(eb.request)) {
> -		err = PTR_ERR(eb.request);
> -		goto err_vma;
> -	}
> -
> -	if (unlikely(eb.gem_context->syncobj)) {
> -		struct dma_fence *fence;
> -
> -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> -		err = i915_request_await_dma_fence(eb.request, fence);
> -		dma_fence_put(fence);
> -		if (err)
> -			goto err_ext;
> -	}
> -
> -	if (in_fence) {
> -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> -			err = i915_request_await_execution(eb.request,
> -							   in_fence);
> -		else
> -			err = i915_request_await_dma_fence(eb.request,
> -							   in_fence);
> -		if (err < 0)
> -			goto err_request;
> -	}
> -
> -	if (eb.fences) {
> -		err = await_fence_array(&eb);
> -		if (err)
> +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> +	if (IS_ERR(out_fence)) {
> +		err = PTR_ERR(out_fence);
> +		if (eb.requests[0])
>   			goto err_request;
> +		else
> +			goto err_vma;
>   	}
>   
> -	if (out_fence_fd != -1) {
> -		out_fence = sync_file_create(&eb.request->fence);
> -		if (!out_fence) {
> -			err = -ENOMEM;
> -			goto err_request;
> -		}
> -	}
> -
> -	/*
> -	 * Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> -	 */
> -	eb.request->batch = batch;
> -	if (eb.batch_pool)
> -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> -
> -	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch);
> +	err = eb_submit(&eb);
>   
>   err_request:
> -	i915_request_get(eb.request);
> -	err = eb_request_add(&eb, err);
> +	eb_requests_get(&eb);
> +	err = eb_requests_add(&eb, err);
>   
>   	if (eb.fences)
> -		signal_fence_array(&eb);
> +		signal_fence_array(&eb, eb.composite_fence ?
> +				   eb.composite_fence :
> +				   &eb.requests[0]->fence);
>   
>   	if (out_fence) {
>   		if (err == 0) {
> @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   
>   	if (unlikely(eb.gem_context->syncobj)) {
>   		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> -					  &eb.request->fence);
> +					  eb.composite_fence ?
> +					  eb.composite_fence :
> +					  &eb.requests[0]->fence);
>   	}
>   
> -	i915_request_put(eb.request);
> +	if (!out_fence && eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
> +
> +	eb_requests_put(&eb);
>   
>   err_vma:
>   	eb_release_vmas(&eb, true);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 1bc705f98e2a..1781419fa105 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
>   	struct intel_timeline *tl = ce->timeline;
>   	int err;
>   
> -	err = mutex_lock_interruptible(&tl->mutex);
> +	if (intel_context_is_parent(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> +	else if (intel_context_is_child(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex,
> +						      ce->parallel.child_index + 1);
> +	else
> +		err = mutex_lock_interruptible(&tl->mutex);
>   	if (err)
>   		return ERR_PTR(err);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 95a5b94b4ece..9e0177dc5484 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -248,6 +248,16 @@ struct intel_context {
>   		 * context
>   		 */
>   		struct i915_request *last_rq;
> +		/**
> +		 * @fence_context: fence context composite fence when doing
> +		 * parallel submission
> +		 */
> +		u64 fence_context;
> +		/**
> +		 * @seqno: seqno for composite fence when doing parallel
> +		 * submission
> +		 */
> +		u32 seqno;
>   		/** @number_children: number of children if parent */
>   		u8 number_children;
>   		/** @child_index: index into child_list if child */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f28e36aa77c2..83b0d2a114af 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> +
>   	parent->engine->emit_bb_start =
>   		emit_bb_start_parent_no_preempt_mid_batch;
>   	parent->engine->emit_fini_breadcrumb =
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 8950785e55d6..24db8459376b 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -147,6 +147,15 @@ enum {
>   	 * tail.
>   	 */
>   	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) that
> +	 * hit an error while generating requests in the execbuf IOCTL.
> +	 * Indicates this request should be skipped as another request in
> +	 * submission / relationship encoutered an error.
> +	 */
> +	I915_FENCE_FLAG_SKIP_PARALLEL,
>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..90546fa58fc1 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   	return i915_active_add_request(&vma->active, rq);
>   }
>   
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     struct dma_fence *fence,
> +			     unsigned int flags)
>   {
>   	struct drm_i915_gem_object *obj = vma->obj;
>   	int err;
> @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>   			intel_frontbuffer_put(front);
>   		}
>   
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> -		obj->read_domains = 0;
> +		if (fence) {
> +			dma_resv_add_excl_fence(vma->resv, fence);
> +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> +			obj->read_domains = 0;
> +		}
>   	} else {
>   		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
>   			err = dma_resv_reserve_shared(vma->resv, 1);
> @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>   				return err;
>   		}
>   
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> -		obj->write_domain = 0;
> +		if (fence) {
> +			dma_resv_add_shared_fence(vma->resv, fence);
> +			obj->write_domain = 0;
> +		}
>   	}
>   
>   	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..648dbe744c96 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>   
>   int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>   					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  struct dma_fence *fence,
> +					  unsigned int flags);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> +}
>   
>   #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>   


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-12 21:56     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 21:56 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> If an error occurs in the front end when multi-lrc requests are getting
> generated we need to skip these in the backend but we still need to
> emit the breadcrumbs seqno. An issues arises because with multi-lrc
> breadcrumbs there is a handshake between the parent and children to make
> forward progress. If all the requests are not present this handshake
> doesn't work. To work around this, if multi-lrc request has an error we
> skip the handshake but still emit the breadcrumbs seqno.
>
> v2:
>   (John Harrison)
>    - Add comment explaining the skipping of the handshake logic
>    - Fix typos in the commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
>   1 file changed, 68 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 83b0d2a114af..05e8b199e4ce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> -						 u32 *cs)
> +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						   u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   	u8 i;
> @@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   				  get_children_go_addr(ce),
>   				  0);
>   
> +	return cs;
> +}
> +
> +/*
> + * If this true, a submission of multi-lrc requests had an error and the
> + * requests need to be skipped. The front end (execuf IOCTL) should've called
> + * i915_request_skip which squashes the BB but we still need to emit the fini
> + * breadrcrumbs seqno write. At this point we don't know how many of the
> + * requests in the multi-lrc submission were generated so we can't do the
> + * handshake between the parent and children (e.g. if 4 requests should be
> + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> + * has occurred on any of the requests in submission / relationship.
> + */
> +static inline bool skip_handshake(struct i915_request *rq)
> +{
> +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		/*
> +		 * NOP everything in
> +		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
The line wrapping makes this look confusing. It seems like the function 
name should fit on the line before. Even if it is a few characters over 
(although the limit is now 100 not 80, I think), the checkpatch warning 
is worth the readability of the code.

> +		 * comes of the length emission below.
-> comes from the length of the emits below.

John.

> +		 */
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> +	} else {
> +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> +	}
> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,
> @@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						  u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   	struct intel_context *parent = intel_context_to_parent(ce);
> @@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
>   	*cs++ = get_children_go_addr(parent);
>   	*cs++ = 0;
>   
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		/*
> +		 * NOP everything in
> +		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
> +		 * comes from the length the emission below.
> +		 */
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> +	} else {
> +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> +	}
> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
@ 2021-10-12 21:56     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 21:56 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> If an error occurs in the front end when multi-lrc requests are getting
> generated we need to skip these in the backend but we still need to
> emit the breadcrumbs seqno. An issues arises because with multi-lrc
> breadcrumbs there is a handshake between the parent and children to make
> forward progress. If all the requests are not present this handshake
> doesn't work. To work around this, if multi-lrc request has an error we
> skip the handshake but still emit the breadcrumbs seqno.
>
> v2:
>   (John Harrison)
>    - Add comment explaining the skipping of the handshake logic
>    - Fix typos in the commit message
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
>   1 file changed, 68 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 83b0d2a114af..05e8b199e4ce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> -						 u32 *cs)
> +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						   u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   	u8 i;
> @@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   				  get_children_go_addr(ce),
>   				  0);
>   
> +	return cs;
> +}
> +
> +/*
> + * If this true, a submission of multi-lrc requests had an error and the
> + * requests need to be skipped. The front end (execuf IOCTL) should've called
> + * i915_request_skip which squashes the BB but we still need to emit the fini
> + * breadrcrumbs seqno write. At this point we don't know how many of the
> + * requests in the multi-lrc submission were generated so we can't do the
> + * handshake between the parent and children (e.g. if 4 requests should be
> + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> + * has occurred on any of the requests in submission / relationship.
> + */
> +static inline bool skip_handshake(struct i915_request *rq)
> +{
> +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		/*
> +		 * NOP everything in
> +		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
The line wrapping makes this look confusing. It seems like the function 
name should fit on the line before. Even if it is a few characters over 
(although the limit is now 100 not 80, I think), the checkpatch warning 
is worth the readability of the code.

> +		 * comes of the length emission below.
-> comes from the length of the emits below.

John.

> +		 */
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> +	} else {
> +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> +	}
> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,
> @@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						  u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   	struct intel_context *parent = intel_context_to_parent(ce);
> @@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
>   	*cs++ = get_children_go_addr(parent);
>   	*cs++ = 0;
>   
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		/*
> +		 * NOP everything in
> +		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
> +		 * comes from the length the emission below.
> +		 */
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> +	} else {
> +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> +	}
> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
@ 2021-10-12 22:08     ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 22:08 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> If an object in the excl or shared slot is a composite fence from a
> parallel submit and the current request in the conflict tracking is from
> the same parallel context there is no need to enforce ordering as the
> ordering already implicit. Make the request conflict tracking understand
ordering already -> ordering is already

> this by comparing the parents parallel fence values and skipping the
parents -> parent's

> conflict insertion if the values match.
Presumably, this is to cope with the fact that the parallel submit 
fences do not look like regular submission fences. And hence the 
existing code that says 'new fence belongs to same context as old fence, 
so safe to ignore' does not work with parallel submission. However, this 
change does not appear to be adding parallel submit support to an 
existing 'same context' check. It seems to be a brand new check that 
does not exist for single submission. What makes parallel submit 
different? If we aren't skipping same context fences for single submits, 
why do we need it for parallel? Conversely, if we need it for parallel 
then why don't we need it for single?

And if the single submission version is simply somewhere else in the 
code, why do the parallel version here instead of at the same place?

John.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>   1 file changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index e9bfa32f9270..cf89624020ad 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>   	return err;
>   }
>   
> +static inline bool is_parallel_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_parallel(rq->context);
> +}
> +
> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
> +static bool is_same_parallel_context(struct i915_request *to,
> +				     struct i915_request *from)
> +{
> +	if (is_parallel_rq(to))
Should this not say '&& is_parallel_rq(from)'?

> +		return request_to_parent(to) == request_to_parent(from);
> +
> +	return false;
> +}
> +
>   int
>   i915_request_await_execution(struct i915_request *rq,
>   			     struct dma_fence *fence)
> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>   		 * want to run our callback in all cases.
>   		 */
>   
> -		if (dma_fence_is_i915(fence))
> +		if (dma_fence_is_i915(fence)) {
> +			if (is_same_parallel_context(rq, to_request(fence)))
> +				continue;
>   			ret = __i915_request_await_execution(rq,
>   							     to_request(fence));
> -		else
> +		} else {
>   			ret = i915_request_await_external(rq, fence);
> +		}
>   		if (ret < 0)
>   			return ret;
>   	} while (--nchild);
> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>   						 fence))
>   			continue;
>   
> -		if (dma_fence_is_i915(fence))
> +		if (dma_fence_is_i915(fence)) {
> +			if (is_same_parallel_context(rq, to_request(fence)))
> +				continue;
>   			ret = i915_request_await_request(rq, to_request(fence));
> -		else
> +		} else {
>   			ret = i915_request_await_external(rq, fence);
> +		}
>   		if (ret < 0)
>   			return ret;
>   
> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>   	return ret;
>   }
>   
> -static inline bool is_parallel_rq(struct i915_request *rq)
> -{
> -	return intel_context_is_parallel(rq->context);
> -}
> -
> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> -{
> -	return intel_context_to_parent(rq->context);
> -}
> -
>   static struct i915_request *
>   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>   					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-12 22:08     ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-12 22:08 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniele.ceraolospurio

On 10/4/2021 15:06, Matthew Brost wrote:
> If an object in the excl or shared slot is a composite fence from a
> parallel submit and the current request in the conflict tracking is from
> the same parallel context there is no need to enforce ordering as the
> ordering already implicit. Make the request conflict tracking understand
ordering already -> ordering is already

> this by comparing the parents parallel fence values and skipping the
parents -> parent's

> conflict insertion if the values match.
Presumably, this is to cope with the fact that the parallel submit 
fences do not look like regular submission fences. And hence the 
existing code that says 'new fence belongs to same context as old fence, 
so safe to ignore' does not work with parallel submission. However, this 
change does not appear to be adding parallel submit support to an 
existing 'same context' check. It seems to be a brand new check that 
does not exist for single submission. What makes parallel submit 
different? If we aren't skipping same context fences for single submits, 
why do we need it for parallel? Conversely, if we need it for parallel 
then why don't we need it for single?

And if the single submission version is simply somewhere else in the 
code, why do the parallel version here instead of at the same place?

John.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>   1 file changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index e9bfa32f9270..cf89624020ad 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>   	return err;
>   }
>   
> +static inline bool is_parallel_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_parallel(rq->context);
> +}
> +
> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
> +static bool is_same_parallel_context(struct i915_request *to,
> +				     struct i915_request *from)
> +{
> +	if (is_parallel_rq(to))
Should this not say '&& is_parallel_rq(from)'?

> +		return request_to_parent(to) == request_to_parent(from);
> +
> +	return false;
> +}
> +
>   int
>   i915_request_await_execution(struct i915_request *rq,
>   			     struct dma_fence *fence)
> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>   		 * want to run our callback in all cases.
>   		 */
>   
> -		if (dma_fence_is_i915(fence))
> +		if (dma_fence_is_i915(fence)) {
> +			if (is_same_parallel_context(rq, to_request(fence)))
> +				continue;
>   			ret = __i915_request_await_execution(rq,
>   							     to_request(fence));
> -		else
> +		} else {
>   			ret = i915_request_await_external(rq, fence);
> +		}
>   		if (ret < 0)
>   			return ret;
>   	} while (--nchild);
> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>   						 fence))
>   			continue;
>   
> -		if (dma_fence_is_i915(fence))
> +		if (dma_fence_is_i915(fence)) {
> +			if (is_same_parallel_context(rq, to_request(fence)))
> +				continue;
>   			ret = i915_request_await_request(rq, to_request(fence));
> -		else
> +		} else {
>   			ret = i915_request_await_external(rq, fence);
> +		}
>   		if (ret < 0)
>   			return ret;
>   
> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>   	return ret;
>   }
>   
> -static inline bool is_parallel_rq(struct i915_request *rq)
> -{
> -	return intel_context_is_parallel(rq->context);
> -}
> -
> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> -{
> -	return intel_context_to_parent(rq->context);
> -}
> -
>   static struct i915_request *
>   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>   					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx]  ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:21 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4) Patchwork
@ 2021-10-12 22:15   ` John Harrison
  2021-10-13  0:15     ` Matthew Brost
  0 siblings, 1 reply; 165+ messages in thread
From: John Harrison @ 2021-10-12 22:15 UTC (permalink / raw)
  To: intel-gfx, Patchwork, Matthew Brost

On 10/4/2021 15:21, Patchwork wrote:
> == Series Details ==
>
> Series: Parallel submission aka multi-bb execbuf (rev4)
> URL   : https://patchwork.freedesktop.org/series/92789/
> State : warning
>
> == Summary ==
>
> $ dim checkpatch origin/drm-tip
> e2a47a99bf9d drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
> f83d8f1539fa drm/i915/guc: Take GT PM ref when deregistering context
> -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
> #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
>
> -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
> #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
Not sure what these two are complaining about? But 'gt' and 'tmp' should 
be wrapped with parentheses when used?

>
> total: 0 errors, 0 warnings, 2 checks, 290 lines checked
> 93e5284929b3 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
> 4dd6554d994d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
> 8629b55f536c drm/i915: Add logical engine mapping
> 8117ec0a1ca7 drm/i915: Expose logical engine instance to user
> aa8e1eb4dd4e drm/i915/guc: Introduce context parent-child relationship
> aaf50eacc2fd drm/i915/guc: Add multi-lrc context registration
> e5f6f50e66d1 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
> adf21ba138f3 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
> 40ef33318b81 drm/i915/guc: Implement parallel context pin / unpin functions
> 1ad560c70346 drm/i915/guc: Implement multi-lrc submission
> -:364: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
> #364: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:771:
> +		*wqi++ = child->ring->tail / sizeof(u64);
>   		^
This seems like a bogus warning.

>
> total: 0 errors, 0 warnings, 1 checks, 570 lines checked
> 466c01457dec drm/i915/guc: Insert submit fences between requests in parent-child relationship
> 2ece815c1f18 drm/i915/guc: Implement multi-lrc reset
> 7add5784199f drm/i915/guc: Update debugfs for GuC multi-lrc
> -:23: CHECK:LINE_SPACING: Please don't use multiple blank lines
> #23: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3707:
>   
> +
This should be fixed.

>
> total: 0 errors, 0 warnings, 1 checks, 67 lines checked
> 966991d7bbed drm/i915: Fix bug in user proto-context creation that leaked contexts
> 0eb3d3bf0c84 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
> 68c6596b649a drm/i915/doc: Update parallel submit doc to point to i915_drm.h
> -:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
> #13:
> deleted file mode 100644
>
> total: 0 errors, 1 warnings, 0 checks, 10 lines checked
> 8290f5d15ca2 drm/i915/guc: Add basic GuC multi-lrc selftest
> -:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
> #22:
> new file mode 100644
These two can be ignored.

> total: 0 errors, 1 warnings, 0 checks, 190 lines checked
> ade3768c42d5 drm/i915/guc: Implement no mid batch preemption for multi-lrc
> 57882939d788 drm/i915: Multi-BB execbuf
> -:369: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
> #369: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1854:
> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
Again, not sure the 'reuse' comment means but should also use '(_i)'?

>
> -:371: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
> #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
This seems bogus. Wrapping it in a do/while will break the purpose!

>
> -:371: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
> #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
As above.

>
> total: 1 errors, 0 warnings, 2 checks, 1298 lines checked
> 28b699ece289 drm/i915/guc: Handle errors in multi-lrc requests
> 962e6b3dce59 drm/i915: Make request conflict tracking understand parallel submits
> 368ab12f5205 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
> b52570f01859 drm/i915: Enable multi-bb execbuf
> 8766155832d7 drm/i915/execlists: Weak parallel submission support for execlists
>
>


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx]  ✗ Fi.CI.DOCS: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-04 22:26 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
@ 2021-10-12 22:15   ` John Harrison
  2021-10-13  0:12     ` Matthew Brost
  0 siblings, 1 reply; 165+ messages in thread
From: John Harrison @ 2021-10-12 22:15 UTC (permalink / raw)
  To: intel-gfx, Patchwork, Matthew Brost

On 10/4/2021 15:26, Patchwork wrote:
> == Series Details ==
>
> Series: Parallel submission aka multi-bb execbuf (rev4)
> URL   : https://patchwork.freedesktop.org/series/92789/
> State : warning
>
> == Summary ==
>
> $ make htmldocs 2>&1 > /dev/null | grep i915
> ./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_stall_reason' not described in 'intel_guc'
> ./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_state' not described in 'intel_guc'
>
>
These seem like valid things that need to be fixed.

John.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx]  ✗ Fi.CI.DOCS: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-12 22:15   ` John Harrison
@ 2021-10-13  0:12     ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:12 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, Patchwork

On Tue, Oct 12, 2021 at 03:15:37PM -0700, John Harrison wrote:
> On 10/4/2021 15:26, Patchwork wrote:
> > == Series Details ==
> > 
> > Series: Parallel submission aka multi-bb execbuf (rev4)
> > URL   : https://patchwork.freedesktop.org/series/92789/
> > State : warning
> > 
> > == Summary ==
> > 
> > $ make htmldocs 2>&1 > /dev/null | grep i915
> > ./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_stall_reason' not described in 'intel_guc'
> > ./drivers/gpu/drm/i915/gt/uc/intel_guc.h:166: warning: Function parameter or member 'submission_state' not described in 'intel_guc'
> > 
> > 
> These seem like valid things that need to be fixed.
> 

Yep, already done.

Matt

> John.
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx]  ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-12 22:15   ` John Harrison
@ 2021-10-13  0:15     ` Matthew Brost
  2021-10-13 19:24       ` John Harrison
  0 siblings, 1 reply; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:15 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, Patchwork

On Tue, Oct 12, 2021 at 03:15:00PM -0700, John Harrison wrote:
> On 10/4/2021 15:21, Patchwork wrote:
> > == Series Details ==
> > 
> > Series: Parallel submission aka multi-bb execbuf (rev4)
> > URL   : https://patchwork.freedesktop.org/series/92789/
> > State : warning
> > 
> > == Summary ==
> > 
> > $ dim checkpatch origin/drm-tip
> > e2a47a99bf9d drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
> > f83d8f1539fa drm/i915/guc: Take GT PM ref when deregistering context
> > -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
> > #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > 
> > -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
> > #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> Not sure what these two are complaining about? But 'gt' and 'tmp' should be
> wrapped with parentheses when used?
> 

Not, sure but I think this one is fine.

> > 
> > total: 0 errors, 0 warnings, 2 checks, 290 lines checked
> > 93e5284929b3 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
> > 4dd6554d994d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
> > 8629b55f536c drm/i915: Add logical engine mapping
> > 8117ec0a1ca7 drm/i915: Expose logical engine instance to user
> > aa8e1eb4dd4e drm/i915/guc: Introduce context parent-child relationship
> > aaf50eacc2fd drm/i915/guc: Add multi-lrc context registration
> > e5f6f50e66d1 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
> > adf21ba138f3 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
> > 40ef33318b81 drm/i915/guc: Implement parallel context pin / unpin functions
> > 1ad560c70346 drm/i915/guc: Implement multi-lrc submission
> > -:364: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
> > #364: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:771:
> > +		*wqi++ = child->ring->tail / sizeof(u64);
> >   		^
> This seems like a bogus warning.
> 

Agree.

> > 
> > total: 0 errors, 0 warnings, 1 checks, 570 lines checked
> > 466c01457dec drm/i915/guc: Insert submit fences between requests in parent-child relationship
> > 2ece815c1f18 drm/i915/guc: Implement multi-lrc reset
> > 7add5784199f drm/i915/guc: Update debugfs for GuC multi-lrc
> > -:23: CHECK:LINE_SPACING: Please don't use multiple blank lines
> > #23: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3707:
> > +
> This should be fixed.
>

Done.
 
> > 
> > total: 0 errors, 0 warnings, 1 checks, 67 lines checked
> > 966991d7bbed drm/i915: Fix bug in user proto-context creation that leaked contexts
> > 0eb3d3bf0c84 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
> > 68c6596b649a drm/i915/doc: Update parallel submit doc to point to i915_drm.h
> > -:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
> > #13:
> > deleted file mode 100644
> > 
> > total: 0 errors, 1 warnings, 0 checks, 10 lines checked
> > 8290f5d15ca2 drm/i915/guc: Add basic GuC multi-lrc selftest
> > -:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
> > #22:
> > new file mode 100644
> These two can be ignored.

Agree.

> 
> > total: 0 errors, 1 warnings, 0 checks, 190 lines checked
> > ade3768c42d5 drm/i915/guc: Implement no mid batch preemption for multi-lrc
> > 57882939d788 drm/i915: Multi-BB execbuf
> > -:369: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
> > #369: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1854:
> > +#define for_each_batch_create_order(_eb, _i) \
> > +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> Again, not sure the 'reuse' comment means but should also use '(_i)'?
>

I haven't been able to figure out how to fix these ones. I think you
only need () if you dref the variable.
 
> > 
> > -:371: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
> > #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
> > +#define for_each_batch_add_order(_eb, _i) \
> > +	BUILD_BUG_ON(!typecheck(int, _i)); \
> > +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> This seems bogus. Wrapping it in a do/while will break the purpose!
> 

Right. Added the BUILD_BUG_ON here because I did have a bug where I used
an unsigned with this macro and that breaks the macro. 

Matt

> > 
> > -:371: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
> > #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
> > +#define for_each_batch_add_order(_eb, _i) \
> > +	BUILD_BUG_ON(!typecheck(int, _i)); \
> > +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> As above.
> 
> > 
> > total: 1 errors, 0 warnings, 2 checks, 1298 lines checked
> > 28b699ece289 drm/i915/guc: Handle errors in multi-lrc requests
> > 962e6b3dce59 drm/i915: Make request conflict tracking understand parallel submits
> > 368ab12f5205 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
> > b52570f01859 drm/i915: Enable multi-bb execbuf
> > 8766155832d7 drm/i915/execlists: Weak parallel submission support for execlists
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
  2021-10-12 21:56     ` [Intel-gfx] " John Harrison
@ 2021-10-13  0:18       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:18 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 02:56:36PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an error occurs in the front end when multi-lrc requests are getting
> > generated we need to skip these in the backend but we still need to
> > emit the breadcrumbs seqno. An issues arises because with multi-lrc
> > breadcrumbs there is a handshake between the parent and children to make
> > forward progress. If all the requests are not present this handshake
> > doesn't work. To work around this, if multi-lrc request has an error we
> > skip the handshake but still emit the breadcrumbs seqno.
> > 
> > v2:
> >   (John Harrison)
> >    - Add comment explaining the skipping of the handshake logic
> >    - Fix typos in the commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
> >   1 file changed, 68 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 83b0d2a114af..05e8b199e4ce 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > -						 u32 *cs)
> > +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						   u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> >   	u8 i;
> > @@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   				  get_children_go_addr(ce),
> >   				  0);
> > +	return cs;
> > +}
> > +
> > +/*
> > + * If this true, a submission of multi-lrc requests had an error and the
> > + * requests need to be skipped. The front end (execuf IOCTL) should've called
> > + * i915_request_skip which squashes the BB but we still need to emit the fini
> > + * breadrcrumbs seqno write. At this point we don't know how many of the
> > + * requests in the multi-lrc submission were generated so we can't do the
> > + * handshake between the parent and children (e.g. if 4 requests should be
> > + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> > + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> > + * has occurred on any of the requests in submission / relationship.
> > + */
> > +static inline bool skip_handshake(struct i915_request *rq)
> > +{
> > +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		/*
> > +		 * NOP everything in
> > +		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
> The line wrapping makes this look confusing. It seems like the function name
> should fit on the line before. Even if it is a few characters over (although
> the limit is now 100 not 80, I think), the checkpatch warning is worth the
> readability of the code.
> 

My vi setting wrap everything as 80 but agree it would be more readable
if __emit_fini_breadcrumb_parent_no_preempt_mid_batch was on the
previous line.

> > +		 * comes of the length emission below.
> -> comes from the length of the emits below.
>

Sure. Will fix.

Matt

> John.
> 
> > +		 */
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> > +	} else {
> > +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> > +	}
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> > @@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						  u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> >   	struct intel_context *parent = intel_context_to_parent(ce);
> > @@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
> >   	*cs++ = get_children_go_addr(parent);
> >   	*cs++ = 0;
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		/*
> > +		 * NOP everything in
> > +		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
> > +		 * comes from the length the emission below.
> > +		 */
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> > +	} else {
> > +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> > +	}
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests
@ 2021-10-13  0:18       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:18 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 02:56:36PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an error occurs in the front end when multi-lrc requests are getting
> > generated we need to skip these in the backend but we still need to
> > emit the breadcrumbs seqno. An issues arises because with multi-lrc
> > breadcrumbs there is a handshake between the parent and children to make
> > forward progress. If all the requests are not present this handshake
> > doesn't work. To work around this, if multi-lrc request has an error we
> > skip the handshake but still emit the breadcrumbs seqno.
> > 
> > v2:
> >   (John Harrison)
> >    - Add comment explaining the skipping of the handshake logic
> >    - Fix typos in the commit message
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 71 ++++++++++++++++++-
> >   1 file changed, 68 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 83b0d2a114af..05e8b199e4ce 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -4072,8 +4072,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > -						 u32 *cs)
> > +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						   u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> >   	u8 i;
> > @@ -4101,6 +4101,46 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   				  get_children_go_addr(ce),
> >   				  0);
> > +	return cs;
> > +}
> > +
> > +/*
> > + * If this true, a submission of multi-lrc requests had an error and the
> > + * requests need to be skipped. The front end (execuf IOCTL) should've called
> > + * i915_request_skip which squashes the BB but we still need to emit the fini
> > + * breadrcrumbs seqno write. At this point we don't know how many of the
> > + * requests in the multi-lrc submission were generated so we can't do the
> > + * handshake between the parent and children (e.g. if 4 requests should be
> > + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> > + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> > + * has occurred on any of the requests in submission / relationship.
> > + */
> > +static inline bool skip_handshake(struct i915_request *rq)
> > +{
> > +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		/*
> > +		 * NOP everything in
> > +		 * __emit_fini_breadcrumb_parent_no_preempt_mid_batch, the -6
> The line wrapping makes this look confusing. It seems like the function name
> should fit on the line before. Even if it is a few characters over (although
> the limit is now 100 not 80, I think), the checkpatch warning is worth the
> readability of the code.
> 

My vi setting wrap everything as 80 but agree it would be more readable
if __emit_fini_breadcrumb_parent_no_preempt_mid_batch was on the
previous line.

> > +		 * comes of the length emission below.
> -> comes from the length of the emits below.
>

Sure. Will fix.

Matt

> John.
> 
> > +		 */
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> > +	} else {
> > +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> > +	}
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> > @@ -4117,7 +4157,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						  u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> >   	struct intel_context *parent = intel_context_to_parent(ce);
> > @@ -4144,6 +4185,30 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
> >   	*cs++ = get_children_go_addr(parent);
> >   	*cs++ = 0;
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		/*
> > +		 * NOP everything in
> > +		 * __emit_fini_breadcrumb_child_no_preempt_mid_batch, the -6
> > +		 * comes from the length the emission below.
> > +		 */
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> > +	} else {
> > +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> > +	}
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-12 22:08     ` [Intel-gfx] " John Harrison
@ 2021-10-13  0:32       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:32 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an object in the excl or shared slot is a composite fence from a
> > parallel submit and the current request in the conflict tracking is from
> > the same parallel context there is no need to enforce ordering as the
> > ordering already implicit. Make the request conflict tracking understand
> ordering already -> ordering is already
> 

Yep.

> > this by comparing the parents parallel fence values and skipping the
> parents -> parent's
>

Yep.

> > conflict insertion if the values match.
> Presumably, this is to cope with the fact that the parallel submit fences do
> not look like regular submission fences. And hence the existing code that
> says 'new fence belongs to same context as old fence, so safe to ignore'
> does not work with parallel submission. However, this change does not appear

Yes. The check for 'if (fence->context == rq->fence.context)' doesn't
work with parallel submission as each rq->fence.context corresponds to a
timeline. With parallel submission each intel_context in the parallel
submit has its own timeline (seqno) so the compare fails for different
intel_context within the same parallel submit. This is the reason for
the additional compare on parallel submits parents, if they have the
same parent it is the same parallel submission and there is no need to
enforce additional ordering.

> to be adding parallel submit support to an existing 'same context' check. It
> seems to be a brand new check that does not exist for single submission.
> What makes parallel submit different? If we aren't skipping same context
> fences for single submits, why do we need it for parallel? Conversely, if we
> need it for parallel then why don't we need it for single?
>

I'm confused by what you are asking here. The existing same context
check is fine for parallel submits - it will just return true when we
compare requests with the same intel_context and new additional check
only true parallel submissions with the same parent.

> And if the single submission version is simply somewhere else in the code,
> why do the parallel version here instead of at the same place?
>

Again I'm confused by what you are asking. We might just need to sync on
a quick call.

Matt
 
> John.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
> >   1 file changed, 29 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index e9bfa32f9270..cf89624020ad 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
> >   	return err;
> >   }
> > +static inline bool is_parallel_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_parallel(rq->context);
> > +}
> > +
> > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> > +static bool is_same_parallel_context(struct i915_request *to,
> > +				     struct i915_request *from)
> > +{
> > +	if (is_parallel_rq(to))
> Should this not say '&& is_parallel_rq(from)'?
> 
> > +		return request_to_parent(to) == request_to_parent(from);
> > +
> > +	return false;
> > +}
> > +
> >   int
> >   i915_request_await_execution(struct i915_request *rq,
> >   			     struct dma_fence *fence)
> > @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
> >   		 * want to run our callback in all cases.
> >   		 */
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = __i915_request_await_execution(rq,
> >   							     to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> >   	} while (--nchild);
> > @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
> >   						 fence))
> >   			continue;
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = i915_request_await_request(rq, to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> > @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
> >   	return ret;
> >   }
> > -static inline bool is_parallel_rq(struct i915_request *rq)
> > -{
> > -	return intel_context_is_parallel(rq->context);
> > -}
> > -
> > -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > -{
> > -	return intel_context_to_parent(rq->context);
> > -}
> > -
> >   static struct i915_request *
> >   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
> >   					struct intel_timeline *timeline)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-13  0:32       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:32 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an object in the excl or shared slot is a composite fence from a
> > parallel submit and the current request in the conflict tracking is from
> > the same parallel context there is no need to enforce ordering as the
> > ordering already implicit. Make the request conflict tracking understand
> ordering already -> ordering is already
> 

Yep.

> > this by comparing the parents parallel fence values and skipping the
> parents -> parent's
>

Yep.

> > conflict insertion if the values match.
> Presumably, this is to cope with the fact that the parallel submit fences do
> not look like regular submission fences. And hence the existing code that
> says 'new fence belongs to same context as old fence, so safe to ignore'
> does not work with parallel submission. However, this change does not appear

Yes. The check for 'if (fence->context == rq->fence.context)' doesn't
work with parallel submission as each rq->fence.context corresponds to a
timeline. With parallel submission each intel_context in the parallel
submit has its own timeline (seqno) so the compare fails for different
intel_context within the same parallel submit. This is the reason for
the additional compare on parallel submits parents, if they have the
same parent it is the same parallel submission and there is no need to
enforce additional ordering.

> to be adding parallel submit support to an existing 'same context' check. It
> seems to be a brand new check that does not exist for single submission.
> What makes parallel submit different? If we aren't skipping same context
> fences for single submits, why do we need it for parallel? Conversely, if we
> need it for parallel then why don't we need it for single?
>

I'm confused by what you are asking here. The existing same context
check is fine for parallel submits - it will just return true when we
compare requests with the same intel_context and new additional check
only true parallel submissions with the same parent.

> And if the single submission version is simply somewhere else in the code,
> why do the parallel version here instead of at the same place?
>

Again I'm confused by what you are asking. We might just need to sync on
a quick call.

Matt
 
> John.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
> >   1 file changed, 29 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index e9bfa32f9270..cf89624020ad 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
> >   	return err;
> >   }
> > +static inline bool is_parallel_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_parallel(rq->context);
> > +}
> > +
> > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> > +static bool is_same_parallel_context(struct i915_request *to,
> > +				     struct i915_request *from)
> > +{
> > +	if (is_parallel_rq(to))
> Should this not say '&& is_parallel_rq(from)'?
> 
> > +		return request_to_parent(to) == request_to_parent(from);
> > +
> > +	return false;
> > +}
> > +
> >   int
> >   i915_request_await_execution(struct i915_request *rq,
> >   			     struct dma_fence *fence)
> > @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
> >   		 * want to run our callback in all cases.
> >   		 */
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = __i915_request_await_execution(rq,
> >   							     to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> >   	} while (--nchild);
> > @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
> >   						 fence))
> >   			continue;
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = i915_request_await_request(rq, to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> > @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
> >   	return ret;
> >   }
> > -static inline bool is_parallel_rq(struct i915_request *rq)
> > -{
> > -	return intel_context_is_parallel(rq->context);
> > -}
> > -
> > -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > -{
> > -	return intel_context_to_parent(rq->context);
> > -}
> > -
> >   static struct i915_request *
> >   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
> >   					struct intel_timeline *timeline)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 21/26] drm/i915: Multi-BB execbuf
  2021-10-12 21:22     ` [Intel-gfx] " John Harrison
@ 2021-10-13  0:37       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:37 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 02:22:41PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> > after a context has been configured with the 'set_parallel' extension.
> > The number batches is implicit based on the contexts configuration.
> > 
> > This is implemented with a series of loops. First a loop is used to find
> > all the batches, a loop to pin all the HW contexts, a loop to create all
> > the requests, a loop to submit (emit BB start, etc...) all the requests,
> > a loop to tie the requests to the VMAs they touch, and finally a loop to
> > commit the requests to the backend.
> > 
> > A composite fence is also created for the generated requests to return
> > to the user and to stick in dma resv slots.
> > 
> > No behavior from the existing IOCTL should be changed aside from when
> > throttling because the ring for a context is full, wait on the request
> throttling because the ring for -> throttling the ring because
> 
> full, wait -> full. In this situation, i915 will now wait
> 

Yep.

> > while holding the object locks.
> , previously it would have dropped the locks for the wait.
> 
> And maybe explain why this change is necessary?
>

We could drop the lock but it would make the code way more complicated
and IMO simpler code far out weighs the potential benefit of dropping
the lock. The dropping of the lock probably was a premature optimization
that landed in the code without any data backing it up that it helped in
any meaningful way. I can add a comment stating this.

> 
> > 
> > IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > media UMD: https://github.com/intel/media-driver/pull/1252
> > 
> > v2:
> >   (Matthew Brost)
> >    - Return proper error value if i915_request_create fails
> > v3:
> >   (John Harrison)
> >    - Add comment explaining create / add order loops + locking
> >    - Update commit message explaining different in IOCTL behavior
> >    - Line wrap some comments
> >    - eb_add_request returns void
> >    - Return -EINVAL rather triggering BUG_ON if cmd parser used
> >   (Checkpatch)
> >    - Check eb->batch_len[*current_batch]
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
> >   drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
> >   drivers/gpu/drm/i915/i915_request.h           |   9 +
> >   drivers/gpu/drm/i915/i915_vma.c               |  21 +-
> >   drivers/gpu/drm/i915/i915_vma.h               |  13 +-
> >   7 files changed, 599 insertions(+), 257 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index 2f2434b52317..5c7fb6f68bbb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -244,17 +244,25 @@ struct i915_execbuffer {
> >   	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
> >   	struct eb_vma *vma;
> > -	struct intel_engine_cs *engine; /** engine to queue the request to */
> > +	struct intel_gt *gt; /* gt for the execbuf */
> >   	struct intel_context *context; /* logical state for the request */
> >   	struct i915_gem_context *gem_context; /** caller's context */
> > -	struct i915_request *request; /** our request to build */
> > -	struct eb_vma *batch; /** identity of the batch obj/vma */
> > +	/** our requests to build */
> > +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> > +	/** identity of the batch obj/vma */
> > +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
> >   	struct i915_vma *trampoline; /** trampoline used for chaining */
> > +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> > +	struct dma_fence *composite_fence;
> > +
> >   	/** actual size of execobj[] as we may extend it for the cmdparser */
> >   	unsigned int buffer_count;
> > +	/* number of batches in execbuf IOCTL */
> > +	unsigned int num_batches;
> > +
> >   	/** list of vma not yet bound during reservation phase */
> >   	struct list_head unbound;
> > @@ -281,7 +289,8 @@ struct i915_execbuffer {
> >   	u64 invalid_flags; /** Set of execobj.flags that are invalid */
> > -	u64 batch_len; /** Length of batch within object */
> > +	/** Length of batch within object */
> > +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
> >   	u32 batch_start_offset; /** Location within object of batch */
> >   	u32 batch_flags; /** Flags composed for emit_bb_start() */
> >   	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> > @@ -299,14 +308,13 @@ struct i915_execbuffer {
> >   };
> >   static int eb_parse(struct i915_execbuffer *eb);
> > -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> > -					  bool throttle);
> > +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
> >   static void eb_unpin_engine(struct i915_execbuffer *eb);
> >   static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
> >   {
> > -	return intel_engine_requires_cmd_parser(eb->engine) ||
> > -		(intel_engine_using_cmd_parser(eb->engine) &&
> > +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> > +		(intel_engine_using_cmd_parser(eb->context->engine) &&
> >   		 eb->args->batch_len);
> >   }
> > @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
> >   	return 0;
> >   }
> > -static void
> > +static inline bool
> > +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> > +{
> > +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> > +		buffer_idx < eb->num_batches :
> > +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> > +}
> > +
> > +static int
> >   eb_add_vma(struct i915_execbuffer *eb,
> > -	   unsigned int i, unsigned batch_idx,
> > +	   unsigned int *current_batch,
> > +	   unsigned int i,
> >   	   struct i915_vma *vma)
> >   {
> > +	struct drm_i915_private *i915 = eb->i915;
> >   	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
> >   	struct eb_vma *ev = &eb->vma[i];
> > @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
> >   	 * Note that actual hangs have only been observed on gen7, but for
> >   	 * paranoia do it everywhere.
> >   	 */
> > -	if (i == batch_idx) {
> > +	if (is_batch_buffer(eb, i)) {
> >   		if (entry->relocation_count &&
> >   		    !(ev->flags & EXEC_OBJECT_PINNED))
> >   			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
> >   		if (eb->reloc_cache.has_fence)
> >   			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
> > -		eb->batch = ev;
> > +		eb->batches[*current_batch] = ev;
> > +
> > +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> > +			drm_dbg(&i915->drm,
> > +				"Attempting to use self-modifying batch buffer\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (range_overflows_t(u64,
> > +				      eb->batch_start_offset,
> > +				      eb->args->batch_len,
> > +				      ev->vma->size)) {
> > +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (eb->args->batch_len == 0)
> > +			eb->batch_len[*current_batch] = ev->vma->size -
> > +				eb->batch_start_offset;
> > +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> > +			drm_dbg(&i915->drm, "Invalid batch length\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		++*current_batch;
> >   	}
> > +
> > +	return 0;
> >   }
> >   static inline int use_cpu_reloc(const struct reloc_cache *cache,
> > @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
> >   	} while (1);
> >   }
> > -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> > -{
> > -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> > -		return 0;
> > -	else
> > -		return eb->buffer_count - 1;
> > -}
> > -
> >   static int eb_select_context(struct i915_execbuffer *eb)
> >   {
> >   	struct i915_gem_context *ctx;
> > @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
> >   static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   {
> > -	struct drm_i915_private *i915 = eb->i915;
> > -	unsigned int batch = eb_batch_index(eb);
> > -	unsigned int i;
> > +	unsigned int i, current_batch = 0;
> >   	int err = 0;
> >   	INIT_LIST_HEAD(&eb->relocs);
> > @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   			goto err;
> >   		}
> > -		eb_add_vma(eb, i, batch, vma);
> > +		err = eb_add_vma(eb, &current_batch, i, vma);
> > +		if (err)
> > +			return err;
> >   		if (i915_gem_object_is_userptr(vma->obj)) {
> >   			err = i915_gem_object_userptr_submit_init(vma->obj);
> > @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   		}
> >   	}
> > -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> > -		drm_dbg(&i915->drm,
> > -			"Attempting to use self-modifying batch buffer\n");
> > -		return -EINVAL;
> > -	}
> > -
> > -	if (range_overflows_t(u64,
> > -			      eb->batch_start_offset, eb->batch_len,
> > -			      eb->batch->vma->size)) {
> > -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> > -		return -EINVAL;
> > -	}
> > -
> > -	if (eb->batch_len == 0)
> > -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> > -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> > -		drm_dbg(&i915->drm, "Invalid batch length\n");
> > -		return -EINVAL;
> > -	}
> > -
> >   	return 0;
> >   err:
> > @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
> >   	return 0;
> >   }
> > -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> > -					   struct i915_request *rq)
> > +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
> >   {
> >   	bool have_copy = false;
> >   	struct eb_vma *ev;
> > @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   	eb_release_vmas(eb, false);
> >   	i915_gem_ww_ctx_fini(&eb->ww);
> > -	if (rq) {
> > -		/* nonblocking is always false */
> > -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> > -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> > -			i915_request_put(rq);
> > -			rq = NULL;
> > -
> > -			err = -EINTR;
> > -			goto err_relock;
> > -		}
> > -
> > -		i915_request_put(rq);
> > -		rq = NULL;
> > -	}
> > -
> >   	/*
> >   	 * We take 3 passes through the slowpatch.
> >   	 *
> > @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   	if (!err)
> >   		err = eb_reinit_userptr(eb);
> > -err_relock:
> >   	i915_gem_ww_ctx_init(&eb->ww, true);
> >   	if (err)
> >   		goto out;
> >   	/* reacquire the objects */
> >   repeat_validate:
> > -	rq = eb_pin_engine(eb, false);
> > -	if (IS_ERR(rq)) {
> > -		err = PTR_ERR(rq);
> > -		rq = NULL;
> > +	err = eb_pin_engine(eb, false);
> > +	if (err)
> >   		goto err;
> > -	}
> > -
> > -	/* We didn't throttle, should be NULL */
> > -	GEM_WARN_ON(rq);
> >   	err = eb_validate_vmas(eb);
> >   	if (err)
> >   		goto err;
> > -	GEM_BUG_ON(!eb->batch);
> > +	GEM_BUG_ON(!eb->batches[0]);
> >   	list_for_each_entry(ev, &eb->relocs, reloc_link) {
> >   		if (!have_copy) {
> > @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   		}
> >   	}
> > -	if (rq)
> > -		i915_request_put(rq);
> > -
> >   	return err;
> >   }
> >   static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   {
> >   	int err;
> > -	struct i915_request *rq = NULL;
> >   	bool throttle = true;
> >   retry:
> > -	rq = eb_pin_engine(eb, throttle);
> > -	if (IS_ERR(rq)) {
> > -		err = PTR_ERR(rq);
> > -		rq = NULL;
> > +	err = eb_pin_engine(eb, throttle);
> > +	if (err) {
> >   		if (err != -EDEADLK)
> >   			return err;
> >   		goto err;
> >   	}
> > -	if (rq) {
> > -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> > -
> > -		/* Need to drop all locks now for throttling, take slowpath */
> > -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> > -		if (err == -ETIME) {
> > -			if (nonblock) {
> > -				err = -EWOULDBLOCK;
> > -				i915_request_put(rq);
> > -				goto err;
> > -			}
> > -			goto slow;
> > -		}
> > -		i915_request_put(rq);
> > -		rq = NULL;
> > -	}
> > -
> >   	/* only throttle once, even if we didn't need to throttle */
> >   	throttle = false;
> > @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   slow:
> > -	err = eb_relocate_parse_slow(eb, rq);
> > +	err = eb_relocate_parse_slow(eb);
> >   	if (err)
> >   		/*
> >   		 * If the user expects the execobject.offset and
> > @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   }
> > +/*
> > + * Using two helper loops for the order of which requests / batches are created
> > + * and added the to backend. Requests are created in order from the parent to
> > + * the last child. Requests are add in the reverse order, from the last child to
> > + * parent. This is down from locking reasons as the timeline lock is acquired
> down from -> done for
> 

Yep.

Matt

> John.
> 
> > + * during request creation and released when the request is added to the
> > + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> > + * the ordering.
> > + */
> > +#define for_each_batch_create_order(_eb, _i) \
> > +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> > +#define for_each_batch_add_order(_eb, _i) \
> > +	BUILD_BUG_ON(!typecheck(int, _i)); \
> > +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> > +
> > +static struct i915_request *
> > +eb_find_first_request_added(struct i915_execbuffer *eb)
> > +{
> > +	int i;
> > +
> > +	for_each_batch_add_order(eb, i)
> > +		if (eb->requests[i])
> > +			return eb->requests[i];
> > +
> > +	GEM_BUG_ON("Request not found");
> > +
> > +	return NULL;
> > +}
> > +
> >   static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   {
> >   	const unsigned int count = eb->buffer_count;
> >   	unsigned int i = count;
> > -	int err = 0;
> > +	int err = 0, j;
> >   	while (i--) {
> >   		struct eb_vma *ev = &eb->vma[i];
> > @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   		if (flags & EXEC_OBJECT_CAPTURE) {
> >   			struct i915_capture_list *capture;
> > -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> > -			if (capture) {
> > -				capture->next = eb->request->capture_list;
> > -				capture->vma = vma;
> > -				eb->request->capture_list = capture;
> > +			for_each_batch_create_order(eb, j) {
> > +				if (!eb->requests[j])
> > +					break;
> > +
> > +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> > +				if (capture) {
> > +					capture->next =
> > +						eb->requests[j]->capture_list;
> > +					capture->vma = vma;
> > +					eb->requests[j]->capture_list = capture;
> > +				}
> >   			}
> >   		}
> > @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   				flags &= ~EXEC_OBJECT_ASYNC;
> >   		}
> > +		/* We only need to await on the first request */
> >   		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
> >   			err = i915_request_await_object
> > -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> > +				(eb_find_first_request_added(eb), obj,
> > +				 flags & EXEC_OBJECT_WRITE);
> >   		}
> > -		if (err == 0)
> > -			err = i915_vma_move_to_active(vma, eb->request,
> > -						      flags | __EXEC_OBJECT_NO_RESERVE);
> > +		for_each_batch_add_order(eb, j) {
> > +			if (err)
> > +				break;
> > +			if (!eb->requests[j])
> > +				continue;
> > +
> > +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> > +						       j ? NULL :
> > +						       eb->composite_fence ?
> > +						       eb->composite_fence :
> > +						       &eb->requests[j]->fence,
> > +						       flags | __EXEC_OBJECT_NO_RESERVE);
> > +		}
> >   	}
> >   #ifdef CONFIG_MMU_NOTIFIER
> > @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   		goto err_skip;
> >   	/* Unconditionally flush any chipset caches (for streaming writes). */
> > -	intel_gt_chipset_flush(eb->engine->gt);
> > +	intel_gt_chipset_flush(eb->gt);
> >   	return 0;
> >   err_skip:
> > -	i915_request_set_error_once(eb->request, err);
> > +	for_each_batch_create_order(eb, j) {
> > +		if (!eb->requests[j])
> > +			break;
> > +
> > +		i915_request_set_error_once(eb->requests[j], err);
> > +	}
> >   	return err;
> >   }
> > @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	int err;
> >   	if (!eb_use_cmdparser(eb)) {
> > -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> > +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
> >   		if (IS_ERR(batch))
> >   			return PTR_ERR(batch);
> >   		goto secure_batch;
> >   	}
> > -	len = eb->batch_len;
> > +	if (intel_context_is_parallel(eb->context))
> > +		return -EINVAL;
> > +
> > +	len = eb->batch_len[0];
> >   	if (!CMDPARSER_USES_GGTT(eb->i915)) {
> >   		/*
> >   		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> > @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	} else {
> >   		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
> >   	}
> > -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> > +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
> >   		return -EINVAL;
> >   	if (!pool) {
> > -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> > +		pool = intel_gt_get_buffer_pool(eb->gt, len,
> >   						I915_MAP_WB);
> >   		if (IS_ERR(pool))
> >   			return PTR_ERR(pool);
> > @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   		trampoline = shadow;
> >   		shadow = shadow_batch_pin(eb, pool->obj,
> > -					  &eb->engine->gt->ggtt->vm,
> > +					  &eb->gt->ggtt->vm,
> >   					  PIN_GLOBAL);
> >   		if (IS_ERR(shadow)) {
> >   			err = PTR_ERR(shadow);
> > @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	if (err)
> >   		goto err_trampoline;
> > -	err = intel_engine_cmd_parser(eb->engine,
> > -				      eb->batch->vma,
> > +	err = intel_engine_cmd_parser(eb->context->engine,
> > +				      eb->batches[0]->vma,
> >   				      eb->batch_start_offset,
> > -				      eb->batch_len,
> > +				      eb->batch_len[0],
> >   				      shadow, trampoline);
> >   	if (err)
> >   		goto err_unpin_batch;
> > -	eb->batch = &eb->vma[eb->buffer_count++];
> > -	eb->batch->vma = i915_vma_get(shadow);
> > -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> > +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> > +	eb->batches[0]->vma = i915_vma_get(shadow);
> > +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> >   	eb->trampoline = trampoline;
> >   	eb->batch_start_offset = 0;
> >   secure_batch:
> >   	if (batch) {
> > -		eb->batch = &eb->vma[eb->buffer_count++];
> > -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> > -		eb->batch->vma = i915_vma_get(batch);
> > +		if (intel_context_is_parallel(eb->context))
> > +			return -EINVAL;
> > +
> > +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> > +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> > +		eb->batches[0]->vma = i915_vma_get(batch);
> >   	}
> >   	return 0;
> > @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   }
> > -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> > +static int eb_request_submit(struct i915_execbuffer *eb,
> > +			     struct i915_request *rq,
> > +			     struct i915_vma *batch,
> > +			     u64 batch_len)
> >   {
> >   	int err;
> > -	if (intel_context_nopreempt(eb->context))
> > -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> > -
> > -	err = eb_move_to_gpu(eb);
> > -	if (err)
> > -		return err;
> > +	if (intel_context_nopreempt(rq->context))
> > +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
> >   	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> > -		err = i915_reset_gen7_sol_offsets(eb->request);
> > +		err = i915_reset_gen7_sol_offsets(rq);
> >   		if (err)
> >   			return err;
> >   	}
> > @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> >   	 * allows us to determine if the batch is still waiting on the GPU
> >   	 * or actually running by checking the breadcrumb.
> >   	 */
> > -	if (eb->engine->emit_init_breadcrumb) {
> > -		err = eb->engine->emit_init_breadcrumb(eb->request);
> > +	if (rq->context->engine->emit_init_breadcrumb) {
> > +		err = rq->context->engine->emit_init_breadcrumb(rq);
> >   		if (err)
> >   			return err;
> >   	}
> > -	err = eb->engine->emit_bb_start(eb->request,
> > -					batch->node.start +
> > -					eb->batch_start_offset,
> > -					eb->batch_len,
> > -					eb->batch_flags);
> > +	err = rq->context->engine->emit_bb_start(rq,
> > +						 batch->node.start +
> > +						 eb->batch_start_offset,
> > +						 batch_len,
> > +						 eb->batch_flags);
> >   	if (err)
> >   		return err;
> >   	if (eb->trampoline) {
> > +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
> >   		GEM_BUG_ON(eb->batch_start_offset);
> > -		err = eb->engine->emit_bb_start(eb->request,
> > -						eb->trampoline->node.start +
> > -						eb->batch_len,
> > -						0, 0);
> > +		err = rq->context->engine->emit_bb_start(rq,
> > +							 eb->trampoline->node.start +
> > +							 batch_len, 0, 0);
> >   		if (err)
> >   			return err;
> >   	}
> > @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> >   	return 0;
> >   }
> > +static int eb_submit(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +	int err;
> > +
> > +	err = eb_move_to_gpu(eb);
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> > +		if (!err)
> > +			err = eb_request_submit(eb, eb->requests[i],
> > +						eb->batches[i]->vma,
> > +						eb->batch_len[i]);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> >   static int num_vcs_engines(const struct drm_i915_private *i915)
> >   {
> >   	return hweight_long(VDBOX_MASK(&i915->gt));
> > @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
> >   	return i915_request_get(rq);
> >   }
> > -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> > +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> > +			   bool throttle)
> >   {
> > -	struct intel_context *ce = eb->context;
> >   	struct intel_timeline *tl;
> > -	struct i915_request *rq = NULL;
> > -	int err;
> > -
> > -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> > -
> > -	if (unlikely(intel_context_is_banned(ce)))
> > -		return ERR_PTR(-EIO);
> > -
> > -	/*
> > -	 * Pinning the contexts may generate requests in order to acquire
> > -	 * GGTT space, so do this first before we reserve a seqno for
> > -	 * ourselves.
> > -	 */
> > -	err = intel_context_pin_ww(ce, &eb->ww);
> > -	if (err)
> > -		return ERR_PTR(err);
> > +	struct i915_request *rq;
> >   	/*
> >   	 * Take a local wakeref for preparing to dispatch the execbuf as
> > @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
> >   	 * taken on the engine, and the parent device.
> >   	 */
> >   	tl = intel_context_timeline_lock(ce);
> > -	if (IS_ERR(tl)) {
> > -		intel_context_unpin(ce);
> > -		return ERR_CAST(tl);
> > -	}
> > +	if (IS_ERR(tl))
> > +		return PTR_ERR(tl);
> >   	intel_context_enter(ce);
> >   	if (throttle)
> >   		rq = eb_throttle(eb, ce);
> >   	intel_context_timeline_unlock(tl);
> > +	if (rq) {
> > +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> > +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> > +
> > +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> > +				      timeout) < 0) {
> > +			i915_request_put(rq);
> > +
> > +			tl = intel_context_timeline_lock(ce);
> > +			intel_context_exit(ce);
> > +			intel_context_timeline_unlock(tl);
> > +
> > +			if (nonblock)
> > +				return -EWOULDBLOCK;
> > +			else
> > +				return -EINTR;
> > +		}
> > +		i915_request_put(rq);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> > +{
> > +	struct intel_context *ce = eb->context, *child;
> > +	int err;
> > +	int i = 0, j = 0;
> > +
> > +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> > +
> > +	if (unlikely(intel_context_is_banned(ce)))
> > +		return -EIO;
> > +
> > +	/*
> > +	 * Pinning the contexts may generate requests in order to acquire
> > +	 * GGTT space, so do this first before we reserve a seqno for
> > +	 * ourselves.
> > +	 */
> > +	err = intel_context_pin_ww(ce, &eb->ww);
> > +	if (err)
> > +		return err;
> > +	for_each_child(ce, child) {
> > +		err = intel_context_pin_ww(child, &eb->ww);
> > +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> > +	}
> > +
> > +	for_each_child(ce, child) {
> > +		err = eb_pin_timeline(eb, child, throttle);
> > +		if (err)
> > +			goto unwind;
> > +		++i;
> > +	}
> > +	err = eb_pin_timeline(eb, ce, throttle);
> > +	if (err)
> > +		goto unwind;
> > +
> >   	eb->args->flags |= __EXEC_ENGINE_PINNED;
> > -	return rq;
> > +	return 0;
> > +
> > +unwind:
> > +	for_each_child(ce, child) {
> > +		if (j++ < i) {
> > +			mutex_lock(&child->timeline->mutex);
> > +			intel_context_exit(child);
> > +			mutex_unlock(&child->timeline->mutex);
> > +		}
> > +	}
> > +	for_each_child(ce, child)
> > +		intel_context_unpin(child);
> > +	intel_context_unpin(ce);
> > +	return err;
> >   }
> >   static void eb_unpin_engine(struct i915_execbuffer *eb)
> >   {
> > -	struct intel_context *ce = eb->context;
> > -	struct intel_timeline *tl = ce->timeline;
> > +	struct intel_context *ce = eb->context, *child;
> >   	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
> >   		return;
> >   	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
> > -	mutex_lock(&tl->mutex);
> > +	for_each_child(ce, child) {
> > +		mutex_lock(&child->timeline->mutex);
> > +		intel_context_exit(child);
> > +		mutex_unlock(&child->timeline->mutex);
> > +
> > +		intel_context_unpin(child);
> > +	}
> > +
> > +	mutex_lock(&ce->timeline->mutex);
> >   	intel_context_exit(ce);
> > -	mutex_unlock(&tl->mutex);
> > +	mutex_unlock(&ce->timeline->mutex);
> >   	intel_context_unpin(ce);
> >   }
> > @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
> >   static int
> >   eb_select_engine(struct i915_execbuffer *eb)
> >   {
> > -	struct intel_context *ce;
> > +	struct intel_context *ce, *child;
> >   	unsigned int idx;
> >   	int err;
> > @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   	if (IS_ERR(ce))
> >   		return PTR_ERR(ce);
> > +	if (intel_context_is_parallel(ce)) {
> > +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> > +			intel_context_put(ce);
> > +			return -EINVAL;
> > +		}
> > +		if (eb->batch_start_offset || eb->args->batch_len) {
> > +			intel_context_put(ce);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +	eb->num_batches = ce->parallel.number_children + 1;
> > +
> > +	for_each_child(ce, child)
> > +		intel_context_get(child);
> >   	intel_gt_pm_get(ce->engine->gt);
> >   	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> > @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   		if (err)
> >   			goto err;
> >   	}
> > +	for_each_child(ce, child) {
> > +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> > +			err = intel_context_alloc_state(child);
> > +			if (err)
> > +				goto err;
> > +		}
> > +	}
> >   	/*
> >   	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> > @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   		goto err;
> >   	eb->context = ce;
> > -	eb->engine = ce->engine;
> > +	eb->gt = ce->engine->gt;
> >   	/*
> >   	 * Make sure engine pool stays alive even if we call intel_context_put
> > @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   err:
> >   	intel_gt_pm_put(ce->engine->gt);
> > +	for_each_child(ce, child)
> > +		intel_context_put(child);
> >   	intel_context_put(ce);
> >   	return err;
> >   }
> > @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   static void
> >   eb_put_engine(struct i915_execbuffer *eb)
> >   {
> > -	intel_gt_pm_put(eb->engine->gt);
> > +	struct intel_context *child;
> > +
> > +	intel_gt_pm_put(eb->gt);
> > +	for_each_child(eb->context, child)
> > +		intel_context_put(child);
> >   	intel_context_put(eb->context);
> >   }
> > @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
> >   }
> >   static int
> > -await_fence_array(struct i915_execbuffer *eb)
> > +await_fence_array(struct i915_execbuffer *eb,
> > +		  struct i915_request *rq)
> >   {
> >   	unsigned int n;
> >   	int err;
> > @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
> >   		if (!eb->fences[n].dma_fence)
> >   			continue;
> > -		err = i915_request_await_dma_fence(eb->request,
> > -						   eb->fences[n].dma_fence);
> > +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
> >   		if (err < 0)
> >   			return err;
> >   	}
> > @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
> >   	return 0;
> >   }
> > -static void signal_fence_array(const struct i915_execbuffer *eb)
> > +static void signal_fence_array(const struct i915_execbuffer *eb,
> > +			       struct dma_fence * const fence)
> >   {
> > -	struct dma_fence * const fence = &eb->request->fence;
> >   	unsigned int n;
> >   	for (n = 0; n < eb->num_fences; n++) {
> > @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
> >   			break;
> >   }
> > -static int eb_request_add(struct i915_execbuffer *eb, int err)
> > +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
> >   {
> > -	struct i915_request *rq = eb->request;
> >   	struct intel_timeline * const tl = i915_request_timeline(rq);
> >   	struct i915_sched_attr attr = {};
> >   	struct i915_request *prev;
> > @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
> >   	/* Check that the context wasn't destroyed before submission */
> >   	if (likely(!intel_context_is_closed(eb->context))) {
> >   		attr = eb->gem_context->sched;
> > -	} else {
> > -		/* Serialise with context_close via the add_to_timeline */
> > -		i915_request_set_error_once(rq, -ENOENT);
> > -		__i915_request_skip(rq);
> > -		err = -ENOENT; /* override any transient errors */
> >   	}
> >   	__i915_request_queue(rq, &attr);
> > @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
> >   		retire_requests(tl, prev);
> >   	mutex_unlock(&tl->mutex);
> > +}
> > +
> > +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> > +{
> > +	int i;
> > +
> > +	/*
> > +	 * We iterate in reverse order of creation to release timeline mutexes in
> > +	 * same order.
> > +	 */
> > +	for_each_batch_add_order(eb, i) {
> > +		struct i915_request *rq = eb->requests[i];
> > +
> > +		if (!rq)
> > +			continue;
> > +
> > +		if (unlikely(intel_context_is_closed(eb->context))) {
> > +			/* Serialise with context_close via the add_to_timeline */
> > +			i915_request_set_error_once(rq, -ENOENT);
> > +			__i915_request_skip(rq);
> > +			err = -ENOENT; /* override any transient errors */
> > +		}
> > +
> > +		if (intel_context_is_parallel(eb->context)) {
> > +			if (err) {
> > +				__i915_request_skip(rq);
> > +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> > +					&rq->fence.flags);
> > +			}
> > +			if (i == 0)
> > +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> > +					&rq->fence.flags);
> > +		}
> > +
> > +		eb_request_add(eb, rq);
> > +	}
> >   	return err;
> >   }
> > @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
> >   				    eb);
> >   }
> > +static void eb_requests_get(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		i915_request_get(eb->requests[i]);
> > +	}
> > +}
> > +
> > +static void eb_requests_put(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		i915_request_put(eb->requests[i]);
> > +	}
> > +}
> > +
> > +static struct sync_file *
> > +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	struct dma_fence_array *fence_array;
> > +	struct dma_fence **fences;
> > +	unsigned int i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> > +
> > +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> > +	if (!fences)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	for_each_batch_create_order(eb, i)
> > +		fences[i] = &eb->requests[i]->fence;
> > +
> > +	fence_array = dma_fence_array_create(eb->num_batches,
> > +					     fences,
> > +					     eb->context->parallel.fence_context,
> > +					     eb->context->parallel.seqno,
> > +					     false);
> > +	if (!fence_array) {
> > +		kfree(fences);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	/* Move ownership to the dma_fence_array created above */
> > +	for_each_batch_create_order(eb, i)
> > +		dma_fence_get(fences[i]);
> > +
> > +	if (out_fence_fd != -1) {
> > +		out_fence = sync_file_create(&fence_array->base);
> > +		/* sync_file now owns fence_arry, drop creation ref */
> > +		dma_fence_put(&fence_array->base);
> > +		if (!out_fence)
> > +			return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	eb->composite_fence = &fence_array->base;
> > +
> > +	return out_fence;
> > +}
> > +
> > +static struct sync_file *
> > +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> > +	      struct dma_fence *in_fence, int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	int err;
> > +
> > +	if (unlikely(eb->gem_context->syncobj)) {
> > +		struct dma_fence *fence;
> > +
> > +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> > +		err = i915_request_await_dma_fence(rq, fence);
> > +		dma_fence_put(fence);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (in_fence) {
> > +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> > +			err = i915_request_await_execution(rq, in_fence);
> > +		else
> > +			err = i915_request_await_dma_fence(rq, in_fence);
> > +		if (err < 0)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (eb->fences) {
> > +		err = await_fence_array(eb, rq);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (intel_context_is_parallel(eb->context)) {
> > +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> > +		if (IS_ERR(out_fence))
> > +			return ERR_PTR(-ENOMEM);
> > +	} else if (out_fence_fd != -1) {
> > +		out_fence = sync_file_create(&rq->fence);
> > +		if (!out_fence)
> > +			return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	return out_fence;
> > +}
> > +
> > +static struct intel_context *
> > +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> > +{
> > +	struct intel_context *child;
> > +
> > +	if (likely(context_number == 0))
> > +		return eb->context;
> > +
> > +	for_each_child(eb->context, child)
> > +		if (!--context_number)
> > +			return child;
> > +
> > +	GEM_BUG_ON("Context not found");
> > +
> > +	return NULL;
> > +}
> > +
> > +static struct sync_file *
> > +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> > +		   int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		/* Allocate a request for this batch buffer nice and early. */
> > +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> > +		if (IS_ERR(eb->requests[i])) {
> > +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> > +			eb->requests[i] = NULL;
> > +			return out_fence;
> > +		}
> > +
> > +		/*
> > +		 * Only the first request added (committed to backend) has to
> > +		 * take the in fences into account as all subsequent requests
> > +		 * will have fences inserted inbetween them.
> > +		 */
> > +		if (i + 1 == eb->num_batches) {
> > +			out_fence = eb_fences_add(eb, eb->requests[i],
> > +						  in_fence, out_fence_fd);
> > +			if (IS_ERR(out_fence))
> > +				return out_fence;
> > +		}
> > +
> > +		/*
> > +		 * Whilst this request exists, batch_obj will be on the
> > +		 * active_list, and so will hold the active reference. Only when
> > +		 * this request is retired will the batch_obj be moved onto
> > +		 * the inactive_list and lose its active reference. Hence we do
> > +		 * not need to explicitly hold another reference here.
> > +		 */
> > +		eb->requests[i]->batch = eb->batches[i]->vma;
> > +		if (eb->batch_pool) {
> > +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> > +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> > +							 eb->requests[i]);
> > +		}
> > +	}
> > +
> > +	return out_fence;
> > +}
> > +
> >   static int
> >   i915_gem_do_execbuffer(struct drm_device *dev,
> >   		       struct drm_file *file,
> > @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	struct i915_execbuffer eb;
> >   	struct dma_fence *in_fence = NULL;
> >   	struct sync_file *out_fence = NULL;
> > -	struct i915_vma *batch;
> >   	int out_fence_fd = -1;
> >   	int err;
> > @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	eb.buffer_count = args->buffer_count;
> >   	eb.batch_start_offset = args->batch_start_offset;
> > -	eb.batch_len = args->batch_len;
> >   	eb.trampoline = NULL;
> >   	eb.fences = NULL;
> >   	eb.num_fences = 0;
> > +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> > +	       ARRAY_SIZE(eb.requests));
> > +	eb.composite_fence = NULL;
> > +
> >   	eb.batch_flags = 0;
> >   	if (args->flags & I915_EXEC_SECURE) {
> >   		if (GRAPHICS_VER(i915) >= 11)
> > @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	ww_acquire_done(&eb.ww.ctx);
> > -	batch = eb.batch->vma;
> > -
> > -	/* Allocate a request for this batch buffer nice and early. */
> > -	eb.request = i915_request_create(eb.context);
> > -	if (IS_ERR(eb.request)) {
> > -		err = PTR_ERR(eb.request);
> > -		goto err_vma;
> > -	}
> > -
> > -	if (unlikely(eb.gem_context->syncobj)) {
> > -		struct dma_fence *fence;
> > -
> > -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> > -		err = i915_request_await_dma_fence(eb.request, fence);
> > -		dma_fence_put(fence);
> > -		if (err)
> > -			goto err_ext;
> > -	}
> > -
> > -	if (in_fence) {
> > -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> > -			err = i915_request_await_execution(eb.request,
> > -							   in_fence);
> > -		else
> > -			err = i915_request_await_dma_fence(eb.request,
> > -							   in_fence);
> > -		if (err < 0)
> > -			goto err_request;
> > -	}
> > -
> > -	if (eb.fences) {
> > -		err = await_fence_array(&eb);
> > -		if (err)
> > +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> > +	if (IS_ERR(out_fence)) {
> > +		err = PTR_ERR(out_fence);
> > +		if (eb.requests[0])
> >   			goto err_request;
> > +		else
> > +			goto err_vma;
> >   	}
> > -	if (out_fence_fd != -1) {
> > -		out_fence = sync_file_create(&eb.request->fence);
> > -		if (!out_fence) {
> > -			err = -ENOMEM;
> > -			goto err_request;
> > -		}
> > -	}
> > -
> > -	/*
> > -	 * Whilst this request exists, batch_obj will be on the
> > -	 * active_list, and so will hold the active reference. Only when this
> > -	 * request is retired will the the batch_obj be moved onto the
> > -	 * inactive_list and lose its active reference. Hence we do not need
> > -	 * to explicitly hold another reference here.
> > -	 */
> > -	eb.request->batch = batch;
> > -	if (eb.batch_pool)
> > -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> > -
> > -	trace_i915_request_queue(eb.request, eb.batch_flags);
> > -	err = eb_submit(&eb, batch);
> > +	err = eb_submit(&eb);
> >   err_request:
> > -	i915_request_get(eb.request);
> > -	err = eb_request_add(&eb, err);
> > +	eb_requests_get(&eb);
> > +	err = eb_requests_add(&eb, err);
> >   	if (eb.fences)
> > -		signal_fence_array(&eb);
> > +		signal_fence_array(&eb, eb.composite_fence ?
> > +				   eb.composite_fence :
> > +				   &eb.requests[0]->fence);
> >   	if (out_fence) {
> >   		if (err == 0) {
> > @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	if (unlikely(eb.gem_context->syncobj)) {
> >   		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> > -					  &eb.request->fence);
> > +					  eb.composite_fence ?
> > +					  eb.composite_fence :
> > +					  &eb.requests[0]->fence);
> >   	}
> > -	i915_request_put(eb.request);
> > +	if (!out_fence && eb.composite_fence)
> > +		dma_fence_put(eb.composite_fence);
> > +
> > +	eb_requests_put(&eb);
> >   err_vma:
> >   	eb_release_vmas(&eb, true);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index 1bc705f98e2a..1781419fa105 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
> >   	struct intel_timeline *tl = ce->timeline;
> >   	int err;
> > -	err = mutex_lock_interruptible(&tl->mutex);
> > +	if (intel_context_is_parent(ce))
> > +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> > +	else if (intel_context_is_child(ce))
> > +		err = mutex_lock_interruptible_nested(&tl->mutex,
> > +						      ce->parallel.child_index + 1);
> > +	else
> > +		err = mutex_lock_interruptible(&tl->mutex);
> >   	if (err)
> >   		return ERR_PTR(err);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 95a5b94b4ece..9e0177dc5484 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -248,6 +248,16 @@ struct intel_context {
> >   		 * context
> >   		 */
> >   		struct i915_request *last_rq;
> > +		/**
> > +		 * @fence_context: fence context composite fence when doing
> > +		 * parallel submission
> > +		 */
> > +		u64 fence_context;
> > +		/**
> > +		 * @seqno: seqno for composite fence when doing parallel
> > +		 * submission
> > +		 */
> > +		u32 seqno;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> >   		/** @child_index: index into child_list if child */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f28e36aa77c2..83b0d2a114af 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> > +
> >   	parent->engine->emit_bb_start =
> >   		emit_bb_start_parent_no_preempt_mid_batch;
> >   	parent->engine->emit_fini_breadcrumb =
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 8950785e55d6..24db8459376b 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -147,6 +147,15 @@ enum {
> >   	 * tail.
> >   	 */
> >   	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> > +	 * parent-child relationship (parallel submission, multi-lrc) that
> > +	 * hit an error while generating requests in the execbuf IOCTL.
> > +	 * Indicates this request should be skipped as another request in
> > +	 * submission / relationship encoutered an error.
> > +	 */
> > +	I915_FENCE_FLAG_SKIP_PARALLEL,
> >   };
> >   /**
> > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> > index 4b7fc4647e46..90546fa58fc1 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/i915_vma.c
> > @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
> >   	return i915_active_add_request(&vma->active, rq);
> >   }
> > -int i915_vma_move_to_active(struct i915_vma *vma,
> > -			    struct i915_request *rq,
> > -			    unsigned int flags)
> > +int _i915_vma_move_to_active(struct i915_vma *vma,
> > +			     struct i915_request *rq,
> > +			     struct dma_fence *fence,
> > +			     unsigned int flags)
> >   {
> >   	struct drm_i915_gem_object *obj = vma->obj;
> >   	int err;
> > @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
> >   			intel_frontbuffer_put(front);
> >   		}
> > -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> > -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> > -		obj->read_domains = 0;
> > +		if (fence) {
> > +			dma_resv_add_excl_fence(vma->resv, fence);
> > +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> > +			obj->read_domains = 0;
> > +		}
> >   	} else {
> >   		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
> >   			err = dma_resv_reserve_shared(vma->resv, 1);
> > @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
> >   				return err;
> >   		}
> > -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> > -		obj->write_domain = 0;
> > +		if (fence) {
> > +			dma_resv_add_shared_fence(vma->resv, fence);
> > +			obj->write_domain = 0;
> > +		}
> >   	}
> >   	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> > diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> > index ed69f66c7ab0..648dbe744c96 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.h
> > +++ b/drivers/gpu/drm/i915/i915_vma.h
> > @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
> >   int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
> >   					   struct i915_request *rq);
> > -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> > -					 struct i915_request *rq,
> > -					 unsigned int flags);
> > +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> > +					  struct i915_request *rq,
> > +					  struct dma_fence *fence,
> > +					  unsigned int flags);
> > +static inline int __must_check
> > +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> > +			unsigned int flags)
> > +{
> > +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> > +}
> >   #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 21/26] drm/i915: Multi-BB execbuf
@ 2021-10-13  0:37       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  0:37 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 02:22:41PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> > after a context has been configured with the 'set_parallel' extension.
> > The number batches is implicit based on the contexts configuration.
> > 
> > This is implemented with a series of loops. First a loop is used to find
> > all the batches, a loop to pin all the HW contexts, a loop to create all
> > the requests, a loop to submit (emit BB start, etc...) all the requests,
> > a loop to tie the requests to the VMAs they touch, and finally a loop to
> > commit the requests to the backend.
> > 
> > A composite fence is also created for the generated requests to return
> > to the user and to stick in dma resv slots.
> > 
> > No behavior from the existing IOCTL should be changed aside from when
> > throttling because the ring for a context is full, wait on the request
> throttling because the ring for -> throttling the ring because
> 
> full, wait -> full. In this situation, i915 will now wait
> 

Yep.

> > while holding the object locks.
> , previously it would have dropped the locks for the wait.
> 
> And maybe explain why this change is necessary?
>

We could drop the lock but it would make the code way more complicated
and IMO simpler code far out weighs the potential benefit of dropping
the lock. The dropping of the lock probably was a premature optimization
that landed in the code without any data backing it up that it helped in
any meaningful way. I can add a comment stating this.

> 
> > 
> > IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > media UMD: https://github.com/intel/media-driver/pull/1252
> > 
> > v2:
> >   (Matthew Brost)
> >    - Return proper error value if i915_request_create fails
> > v3:
> >   (John Harrison)
> >    - Add comment explaining create / add order loops + locking
> >    - Update commit message explaining different in IOCTL behavior
> >    - Line wrap some comments
> >    - eb_add_request returns void
> >    - Return -EINVAL rather triggering BUG_ON if cmd parser used
> >   (Checkpatch)
> >    - Check eb->batch_len[*current_batch]
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 793 ++++++++++++------
> >   drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  10 +
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
> >   drivers/gpu/drm/i915/i915_request.h           |   9 +
> >   drivers/gpu/drm/i915/i915_vma.c               |  21 +-
> >   drivers/gpu/drm/i915/i915_vma.h               |  13 +-
> >   7 files changed, 599 insertions(+), 257 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index 2f2434b52317..5c7fb6f68bbb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -244,17 +244,25 @@ struct i915_execbuffer {
> >   	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
> >   	struct eb_vma *vma;
> > -	struct intel_engine_cs *engine; /** engine to queue the request to */
> > +	struct intel_gt *gt; /* gt for the execbuf */
> >   	struct intel_context *context; /* logical state for the request */
> >   	struct i915_gem_context *gem_context; /** caller's context */
> > -	struct i915_request *request; /** our request to build */
> > -	struct eb_vma *batch; /** identity of the batch obj/vma */
> > +	/** our requests to build */
> > +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
> > +	/** identity of the batch obj/vma */
> > +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1];
> >   	struct i915_vma *trampoline; /** trampoline used for chaining */
> > +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> > +	struct dma_fence *composite_fence;
> > +
> >   	/** actual size of execobj[] as we may extend it for the cmdparser */
> >   	unsigned int buffer_count;
> > +	/* number of batches in execbuf IOCTL */
> > +	unsigned int num_batches;
> > +
> >   	/** list of vma not yet bound during reservation phase */
> >   	struct list_head unbound;
> > @@ -281,7 +289,8 @@ struct i915_execbuffer {
> >   	u64 invalid_flags; /** Set of execobj.flags that are invalid */
> > -	u64 batch_len; /** Length of batch within object */
> > +	/** Length of batch within object */
> > +	u64 batch_len[MAX_ENGINE_INSTANCE + 1];
> >   	u32 batch_start_offset; /** Location within object of batch */
> >   	u32 batch_flags; /** Flags composed for emit_bb_start() */
> >   	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> > @@ -299,14 +308,13 @@ struct i915_execbuffer {
> >   };
> >   static int eb_parse(struct i915_execbuffer *eb);
> > -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> > -					  bool throttle);
> > +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
> >   static void eb_unpin_engine(struct i915_execbuffer *eb);
> >   static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
> >   {
> > -	return intel_engine_requires_cmd_parser(eb->engine) ||
> > -		(intel_engine_using_cmd_parser(eb->engine) &&
> > +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> > +		(intel_engine_using_cmd_parser(eb->context->engine) &&
> >   		 eb->args->batch_len);
> >   }
> > @@ -544,11 +552,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
> >   	return 0;
> >   }
> > -static void
> > +static inline bool
> > +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> > +{
> > +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> > +		buffer_idx < eb->num_batches :
> > +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> > +}
> > +
> > +static int
> >   eb_add_vma(struct i915_execbuffer *eb,
> > -	   unsigned int i, unsigned batch_idx,
> > +	   unsigned int *current_batch,
> > +	   unsigned int i,
> >   	   struct i915_vma *vma)
> >   {
> > +	struct drm_i915_private *i915 = eb->i915;
> >   	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
> >   	struct eb_vma *ev = &eb->vma[i];
> > @@ -575,15 +593,41 @@ eb_add_vma(struct i915_execbuffer *eb,
> >   	 * Note that actual hangs have only been observed on gen7, but for
> >   	 * paranoia do it everywhere.
> >   	 */
> > -	if (i == batch_idx) {
> > +	if (is_batch_buffer(eb, i)) {
> >   		if (entry->relocation_count &&
> >   		    !(ev->flags & EXEC_OBJECT_PINNED))
> >   			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
> >   		if (eb->reloc_cache.has_fence)
> >   			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
> > -		eb->batch = ev;
> > +		eb->batches[*current_batch] = ev;
> > +
> > +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> > +			drm_dbg(&i915->drm,
> > +				"Attempting to use self-modifying batch buffer\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (range_overflows_t(u64,
> > +				      eb->batch_start_offset,
> > +				      eb->args->batch_len,
> > +				      ev->vma->size)) {
> > +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (eb->args->batch_len == 0)
> > +			eb->batch_len[*current_batch] = ev->vma->size -
> > +				eb->batch_start_offset;
> > +		if (unlikely(eb->batch_len[*current_batch] == 0)) { /* impossible! */
> > +			drm_dbg(&i915->drm, "Invalid batch length\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		++*current_batch;
> >   	}
> > +
> > +	return 0;
> >   }
> >   static inline int use_cpu_reloc(const struct reloc_cache *cache,
> > @@ -727,14 +771,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
> >   	} while (1);
> >   }
> > -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> > -{
> > -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> > -		return 0;
> > -	else
> > -		return eb->buffer_count - 1;
> > -}
> > -
> >   static int eb_select_context(struct i915_execbuffer *eb)
> >   {
> >   	struct i915_gem_context *ctx;
> > @@ -839,9 +875,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
> >   static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   {
> > -	struct drm_i915_private *i915 = eb->i915;
> > -	unsigned int batch = eb_batch_index(eb);
> > -	unsigned int i;
> > +	unsigned int i, current_batch = 0;
> >   	int err = 0;
> >   	INIT_LIST_HEAD(&eb->relocs);
> > @@ -861,7 +895,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   			goto err;
> >   		}
> > -		eb_add_vma(eb, i, batch, vma);
> > +		err = eb_add_vma(eb, &current_batch, i, vma);
> > +		if (err)
> > +			return err;
> >   		if (i915_gem_object_is_userptr(vma->obj)) {
> >   			err = i915_gem_object_userptr_submit_init(vma->obj);
> > @@ -884,26 +920,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
> >   		}
> >   	}
> > -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> > -		drm_dbg(&i915->drm,
> > -			"Attempting to use self-modifying batch buffer\n");
> > -		return -EINVAL;
> > -	}
> > -
> > -	if (range_overflows_t(u64,
> > -			      eb->batch_start_offset, eb->batch_len,
> > -			      eb->batch->vma->size)) {
> > -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> > -		return -EINVAL;
> > -	}
> > -
> > -	if (eb->batch_len == 0)
> > -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> > -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> > -		drm_dbg(&i915->drm, "Invalid batch length\n");
> > -		return -EINVAL;
> > -	}
> > -
> >   	return 0;
> >   err:
> > @@ -1636,8 +1652,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
> >   	return 0;
> >   }
> > -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> > -					   struct i915_request *rq)
> > +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
> >   {
> >   	bool have_copy = false;
> >   	struct eb_vma *ev;
> > @@ -1653,21 +1668,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   	eb_release_vmas(eb, false);
> >   	i915_gem_ww_ctx_fini(&eb->ww);
> > -	if (rq) {
> > -		/* nonblocking is always false */
> > -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> > -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> > -			i915_request_put(rq);
> > -			rq = NULL;
> > -
> > -			err = -EINTR;
> > -			goto err_relock;
> > -		}
> > -
> > -		i915_request_put(rq);
> > -		rq = NULL;
> > -	}
> > -
> >   	/*
> >   	 * We take 3 passes through the slowpatch.
> >   	 *
> > @@ -1694,28 +1694,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   	if (!err)
> >   		err = eb_reinit_userptr(eb);
> > -err_relock:
> >   	i915_gem_ww_ctx_init(&eb->ww, true);
> >   	if (err)
> >   		goto out;
> >   	/* reacquire the objects */
> >   repeat_validate:
> > -	rq = eb_pin_engine(eb, false);
> > -	if (IS_ERR(rq)) {
> > -		err = PTR_ERR(rq);
> > -		rq = NULL;
> > +	err = eb_pin_engine(eb, false);
> > +	if (err)
> >   		goto err;
> > -	}
> > -
> > -	/* We didn't throttle, should be NULL */
> > -	GEM_WARN_ON(rq);
> >   	err = eb_validate_vmas(eb);
> >   	if (err)
> >   		goto err;
> > -	GEM_BUG_ON(!eb->batch);
> > +	GEM_BUG_ON(!eb->batches[0]);
> >   	list_for_each_entry(ev, &eb->relocs, reloc_link) {
> >   		if (!have_copy) {
> > @@ -1779,46 +1772,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >   		}
> >   	}
> > -	if (rq)
> > -		i915_request_put(rq);
> > -
> >   	return err;
> >   }
> >   static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   {
> >   	int err;
> > -	struct i915_request *rq = NULL;
> >   	bool throttle = true;
> >   retry:
> > -	rq = eb_pin_engine(eb, throttle);
> > -	if (IS_ERR(rq)) {
> > -		err = PTR_ERR(rq);
> > -		rq = NULL;
> > +	err = eb_pin_engine(eb, throttle);
> > +	if (err) {
> >   		if (err != -EDEADLK)
> >   			return err;
> >   		goto err;
> >   	}
> > -	if (rq) {
> > -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> > -
> > -		/* Need to drop all locks now for throttling, take slowpath */
> > -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> > -		if (err == -ETIME) {
> > -			if (nonblock) {
> > -				err = -EWOULDBLOCK;
> > -				i915_request_put(rq);
> > -				goto err;
> > -			}
> > -			goto slow;
> > -		}
> > -		i915_request_put(rq);
> > -		rq = NULL;
> > -	}
> > -
> >   	/* only throttle once, even if we didn't need to throttle */
> >   	throttle = false;
> > @@ -1858,7 +1828,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   slow:
> > -	err = eb_relocate_parse_slow(eb, rq);
> > +	err = eb_relocate_parse_slow(eb);
> >   	if (err)
> >   		/*
> >   		 * If the user expects the execobject.offset and
> > @@ -1872,11 +1842,40 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   }
> > +/*
> > + * Using two helper loops for the order of which requests / batches are created
> > + * and added the to backend. Requests are created in order from the parent to
> > + * the last child. Requests are add in the reverse order, from the last child to
> > + * parent. This is down from locking reasons as the timeline lock is acquired
> down from -> done for
> 

Yep.

Matt

> John.
> 
> > + * during request creation and released when the request is added to the
> > + * backend. To make lockdep happy (see intel_context_timeline_lock) this must be
> > + * the ordering.
> > + */
> > +#define for_each_batch_create_order(_eb, _i) \
> > +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
> > +#define for_each_batch_add_order(_eb, _i) \
> > +	BUILD_BUG_ON(!typecheck(int, _i)); \
> > +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
> > +
> > +static struct i915_request *
> > +eb_find_first_request_added(struct i915_execbuffer *eb)
> > +{
> > +	int i;
> > +
> > +	for_each_batch_add_order(eb, i)
> > +		if (eb->requests[i])
> > +			return eb->requests[i];
> > +
> > +	GEM_BUG_ON("Request not found");
> > +
> > +	return NULL;
> > +}
> > +
> >   static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   {
> >   	const unsigned int count = eb->buffer_count;
> >   	unsigned int i = count;
> > -	int err = 0;
> > +	int err = 0, j;
> >   	while (i--) {
> >   		struct eb_vma *ev = &eb->vma[i];
> > @@ -1889,11 +1888,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   		if (flags & EXEC_OBJECT_CAPTURE) {
> >   			struct i915_capture_list *capture;
> > -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> > -			if (capture) {
> > -				capture->next = eb->request->capture_list;
> > -				capture->vma = vma;
> > -				eb->request->capture_list = capture;
> > +			for_each_batch_create_order(eb, j) {
> > +				if (!eb->requests[j])
> > +					break;
> > +
> > +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> > +				if (capture) {
> > +					capture->next =
> > +						eb->requests[j]->capture_list;
> > +					capture->vma = vma;
> > +					eb->requests[j]->capture_list = capture;
> > +				}
> >   			}
> >   		}
> > @@ -1914,14 +1919,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   				flags &= ~EXEC_OBJECT_ASYNC;
> >   		}
> > +		/* We only need to await on the first request */
> >   		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
> >   			err = i915_request_await_object
> > -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> > +				(eb_find_first_request_added(eb), obj,
> > +				 flags & EXEC_OBJECT_WRITE);
> >   		}
> > -		if (err == 0)
> > -			err = i915_vma_move_to_active(vma, eb->request,
> > -						      flags | __EXEC_OBJECT_NO_RESERVE);
> > +		for_each_batch_add_order(eb, j) {
> > +			if (err)
> > +				break;
> > +			if (!eb->requests[j])
> > +				continue;
> > +
> > +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> > +						       j ? NULL :
> > +						       eb->composite_fence ?
> > +						       eb->composite_fence :
> > +						       &eb->requests[j]->fence,
> > +						       flags | __EXEC_OBJECT_NO_RESERVE);
> > +		}
> >   	}
> >   #ifdef CONFIG_MMU_NOTIFIER
> > @@ -1952,11 +1969,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
> >   		goto err_skip;
> >   	/* Unconditionally flush any chipset caches (for streaming writes). */
> > -	intel_gt_chipset_flush(eb->engine->gt);
> > +	intel_gt_chipset_flush(eb->gt);
> >   	return 0;
> >   err_skip:
> > -	i915_request_set_error_once(eb->request, err);
> > +	for_each_batch_create_order(eb, j) {
> > +		if (!eb->requests[j])
> > +			break;
> > +
> > +		i915_request_set_error_once(eb->requests[j], err);
> > +	}
> >   	return err;
> >   }
> > @@ -2051,14 +2073,17 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	int err;
> >   	if (!eb_use_cmdparser(eb)) {
> > -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> > +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
> >   		if (IS_ERR(batch))
> >   			return PTR_ERR(batch);
> >   		goto secure_batch;
> >   	}
> > -	len = eb->batch_len;
> > +	if (intel_context_is_parallel(eb->context))
> > +		return -EINVAL;
> > +
> > +	len = eb->batch_len[0];
> >   	if (!CMDPARSER_USES_GGTT(eb->i915)) {
> >   		/*
> >   		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> > @@ -2072,11 +2097,11 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	} else {
> >   		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
> >   	}
> > -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> > +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
> >   		return -EINVAL;
> >   	if (!pool) {
> > -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> > +		pool = intel_gt_get_buffer_pool(eb->gt, len,
> >   						I915_MAP_WB);
> >   		if (IS_ERR(pool))
> >   			return PTR_ERR(pool);
> > @@ -2101,7 +2126,7 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   		trampoline = shadow;
> >   		shadow = shadow_batch_pin(eb, pool->obj,
> > -					  &eb->engine->gt->ggtt->vm,
> > +					  &eb->gt->ggtt->vm,
> >   					  PIN_GLOBAL);
> >   		if (IS_ERR(shadow)) {
> >   			err = PTR_ERR(shadow);
> > @@ -2123,26 +2148,29 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	if (err)
> >   		goto err_trampoline;
> > -	err = intel_engine_cmd_parser(eb->engine,
> > -				      eb->batch->vma,
> > +	err = intel_engine_cmd_parser(eb->context->engine,
> > +				      eb->batches[0]->vma,
> >   				      eb->batch_start_offset,
> > -				      eb->batch_len,
> > +				      eb->batch_len[0],
> >   				      shadow, trampoline);
> >   	if (err)
> >   		goto err_unpin_batch;
> > -	eb->batch = &eb->vma[eb->buffer_count++];
> > -	eb->batch->vma = i915_vma_get(shadow);
> > -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> > +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> > +	eb->batches[0]->vma = i915_vma_get(shadow);
> > +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> >   	eb->trampoline = trampoline;
> >   	eb->batch_start_offset = 0;
> >   secure_batch:
> >   	if (batch) {
> > -		eb->batch = &eb->vma[eb->buffer_count++];
> > -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> > -		eb->batch->vma = i915_vma_get(batch);
> > +		if (intel_context_is_parallel(eb->context))
> > +			return -EINVAL;
> > +
> > +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> > +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> > +		eb->batches[0]->vma = i915_vma_get(batch);
> >   	}
> >   	return 0;
> > @@ -2158,19 +2186,18 @@ static int eb_parse(struct i915_execbuffer *eb)
> >   	return err;
> >   }
> > -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> > +static int eb_request_submit(struct i915_execbuffer *eb,
> > +			     struct i915_request *rq,
> > +			     struct i915_vma *batch,
> > +			     u64 batch_len)
> >   {
> >   	int err;
> > -	if (intel_context_nopreempt(eb->context))
> > -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> > -
> > -	err = eb_move_to_gpu(eb);
> > -	if (err)
> > -		return err;
> > +	if (intel_context_nopreempt(rq->context))
> > +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
> >   	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> > -		err = i915_reset_gen7_sol_offsets(eb->request);
> > +		err = i915_reset_gen7_sol_offsets(rq);
> >   		if (err)
> >   			return err;
> >   	}
> > @@ -2181,26 +2208,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> >   	 * allows us to determine if the batch is still waiting on the GPU
> >   	 * or actually running by checking the breadcrumb.
> >   	 */
> > -	if (eb->engine->emit_init_breadcrumb) {
> > -		err = eb->engine->emit_init_breadcrumb(eb->request);
> > +	if (rq->context->engine->emit_init_breadcrumb) {
> > +		err = rq->context->engine->emit_init_breadcrumb(rq);
> >   		if (err)
> >   			return err;
> >   	}
> > -	err = eb->engine->emit_bb_start(eb->request,
> > -					batch->node.start +
> > -					eb->batch_start_offset,
> > -					eb->batch_len,
> > -					eb->batch_flags);
> > +	err = rq->context->engine->emit_bb_start(rq,
> > +						 batch->node.start +
> > +						 eb->batch_start_offset,
> > +						 batch_len,
> > +						 eb->batch_flags);
> >   	if (err)
> >   		return err;
> >   	if (eb->trampoline) {
> > +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
> >   		GEM_BUG_ON(eb->batch_start_offset);
> > -		err = eb->engine->emit_bb_start(eb->request,
> > -						eb->trampoline->node.start +
> > -						eb->batch_len,
> > -						0, 0);
> > +		err = rq->context->engine->emit_bb_start(rq,
> > +							 eb->trampoline->node.start +
> > +							 batch_len, 0, 0);
> >   		if (err)
> >   			return err;
> >   	}
> > @@ -2208,6 +2235,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> >   	return 0;
> >   }
> > +static int eb_submit(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +	int err;
> > +
> > +	err = eb_move_to_gpu(eb);
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> > +		if (!err)
> > +			err = eb_request_submit(eb, eb->requests[i],
> > +						eb->batches[i]->vma,
> > +						eb->batch_len[i]);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> >   static int num_vcs_engines(const struct drm_i915_private *i915)
> >   {
> >   	return hweight_long(VDBOX_MASK(&i915->gt));
> > @@ -2273,26 +2321,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
> >   	return i915_request_get(rq);
> >   }
> > -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> > +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> > +			   bool throttle)
> >   {
> > -	struct intel_context *ce = eb->context;
> >   	struct intel_timeline *tl;
> > -	struct i915_request *rq = NULL;
> > -	int err;
> > -
> > -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> > -
> > -	if (unlikely(intel_context_is_banned(ce)))
> > -		return ERR_PTR(-EIO);
> > -
> > -	/*
> > -	 * Pinning the contexts may generate requests in order to acquire
> > -	 * GGTT space, so do this first before we reserve a seqno for
> > -	 * ourselves.
> > -	 */
> > -	err = intel_context_pin_ww(ce, &eb->ww);
> > -	if (err)
> > -		return ERR_PTR(err);
> > +	struct i915_request *rq;
> >   	/*
> >   	 * Take a local wakeref for preparing to dispatch the execbuf as
> > @@ -2303,33 +2336,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
> >   	 * taken on the engine, and the parent device.
> >   	 */
> >   	tl = intel_context_timeline_lock(ce);
> > -	if (IS_ERR(tl)) {
> > -		intel_context_unpin(ce);
> > -		return ERR_CAST(tl);
> > -	}
> > +	if (IS_ERR(tl))
> > +		return PTR_ERR(tl);
> >   	intel_context_enter(ce);
> >   	if (throttle)
> >   		rq = eb_throttle(eb, ce);
> >   	intel_context_timeline_unlock(tl);
> > +	if (rq) {
> > +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> > +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> > +
> > +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> > +				      timeout) < 0) {
> > +			i915_request_put(rq);
> > +
> > +			tl = intel_context_timeline_lock(ce);
> > +			intel_context_exit(ce);
> > +			intel_context_timeline_unlock(tl);
> > +
> > +			if (nonblock)
> > +				return -EWOULDBLOCK;
> > +			else
> > +				return -EINTR;
> > +		}
> > +		i915_request_put(rq);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> > +{
> > +	struct intel_context *ce = eb->context, *child;
> > +	int err;
> > +	int i = 0, j = 0;
> > +
> > +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> > +
> > +	if (unlikely(intel_context_is_banned(ce)))
> > +		return -EIO;
> > +
> > +	/*
> > +	 * Pinning the contexts may generate requests in order to acquire
> > +	 * GGTT space, so do this first before we reserve a seqno for
> > +	 * ourselves.
> > +	 */
> > +	err = intel_context_pin_ww(ce, &eb->ww);
> > +	if (err)
> > +		return err;
> > +	for_each_child(ce, child) {
> > +		err = intel_context_pin_ww(child, &eb->ww);
> > +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> > +	}
> > +
> > +	for_each_child(ce, child) {
> > +		err = eb_pin_timeline(eb, child, throttle);
> > +		if (err)
> > +			goto unwind;
> > +		++i;
> > +	}
> > +	err = eb_pin_timeline(eb, ce, throttle);
> > +	if (err)
> > +		goto unwind;
> > +
> >   	eb->args->flags |= __EXEC_ENGINE_PINNED;
> > -	return rq;
> > +	return 0;
> > +
> > +unwind:
> > +	for_each_child(ce, child) {
> > +		if (j++ < i) {
> > +			mutex_lock(&child->timeline->mutex);
> > +			intel_context_exit(child);
> > +			mutex_unlock(&child->timeline->mutex);
> > +		}
> > +	}
> > +	for_each_child(ce, child)
> > +		intel_context_unpin(child);
> > +	intel_context_unpin(ce);
> > +	return err;
> >   }
> >   static void eb_unpin_engine(struct i915_execbuffer *eb)
> >   {
> > -	struct intel_context *ce = eb->context;
> > -	struct intel_timeline *tl = ce->timeline;
> > +	struct intel_context *ce = eb->context, *child;
> >   	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
> >   		return;
> >   	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
> > -	mutex_lock(&tl->mutex);
> > +	for_each_child(ce, child) {
> > +		mutex_lock(&child->timeline->mutex);
> > +		intel_context_exit(child);
> > +		mutex_unlock(&child->timeline->mutex);
> > +
> > +		intel_context_unpin(child);
> > +	}
> > +
> > +	mutex_lock(&ce->timeline->mutex);
> >   	intel_context_exit(ce);
> > -	mutex_unlock(&tl->mutex);
> > +	mutex_unlock(&ce->timeline->mutex);
> >   	intel_context_unpin(ce);
> >   }
> > @@ -2380,7 +2488,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
> >   static int
> >   eb_select_engine(struct i915_execbuffer *eb)
> >   {
> > -	struct intel_context *ce;
> > +	struct intel_context *ce, *child;
> >   	unsigned int idx;
> >   	int err;
> > @@ -2393,6 +2501,20 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   	if (IS_ERR(ce))
> >   		return PTR_ERR(ce);
> > +	if (intel_context_is_parallel(ce)) {
> > +		if (eb->buffer_count < ce->parallel.number_children + 1) {
> > +			intel_context_put(ce);
> > +			return -EINVAL;
> > +		}
> > +		if (eb->batch_start_offset || eb->args->batch_len) {
> > +			intel_context_put(ce);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +	eb->num_batches = ce->parallel.number_children + 1;
> > +
> > +	for_each_child(ce, child)
> > +		intel_context_get(child);
> >   	intel_gt_pm_get(ce->engine->gt);
> >   	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> > @@ -2400,6 +2522,13 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   		if (err)
> >   			goto err;
> >   	}
> > +	for_each_child(ce, child) {
> > +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> > +			err = intel_context_alloc_state(child);
> > +			if (err)
> > +				goto err;
> > +		}
> > +	}
> >   	/*
> >   	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> > @@ -2410,7 +2539,7 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   		goto err;
> >   	eb->context = ce;
> > -	eb->engine = ce->engine;
> > +	eb->gt = ce->engine->gt;
> >   	/*
> >   	 * Make sure engine pool stays alive even if we call intel_context_put
> > @@ -2421,6 +2550,8 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   err:
> >   	intel_gt_pm_put(ce->engine->gt);
> > +	for_each_child(ce, child)
> > +		intel_context_put(child);
> >   	intel_context_put(ce);
> >   	return err;
> >   }
> > @@ -2428,7 +2559,11 @@ eb_select_engine(struct i915_execbuffer *eb)
> >   static void
> >   eb_put_engine(struct i915_execbuffer *eb)
> >   {
> > -	intel_gt_pm_put(eb->engine->gt);
> > +	struct intel_context *child;
> > +
> > +	intel_gt_pm_put(eb->gt);
> > +	for_each_child(eb->context, child)
> > +		intel_context_put(child);
> >   	intel_context_put(eb->context);
> >   }
> > @@ -2651,7 +2786,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
> >   }
> >   static int
> > -await_fence_array(struct i915_execbuffer *eb)
> > +await_fence_array(struct i915_execbuffer *eb,
> > +		  struct i915_request *rq)
> >   {
> >   	unsigned int n;
> >   	int err;
> > @@ -2665,8 +2801,7 @@ await_fence_array(struct i915_execbuffer *eb)
> >   		if (!eb->fences[n].dma_fence)
> >   			continue;
> > -		err = i915_request_await_dma_fence(eb->request,
> > -						   eb->fences[n].dma_fence);
> > +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
> >   		if (err < 0)
> >   			return err;
> >   	}
> > @@ -2674,9 +2809,9 @@ await_fence_array(struct i915_execbuffer *eb)
> >   	return 0;
> >   }
> > -static void signal_fence_array(const struct i915_execbuffer *eb)
> > +static void signal_fence_array(const struct i915_execbuffer *eb,
> > +			       struct dma_fence * const fence)
> >   {
> > -	struct dma_fence * const fence = &eb->request->fence;
> >   	unsigned int n;
> >   	for (n = 0; n < eb->num_fences; n++) {
> > @@ -2724,9 +2859,8 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
> >   			break;
> >   }
> > -static int eb_request_add(struct i915_execbuffer *eb, int err)
> > +static void eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
> >   {
> > -	struct i915_request *rq = eb->request;
> >   	struct intel_timeline * const tl = i915_request_timeline(rq);
> >   	struct i915_sched_attr attr = {};
> >   	struct i915_request *prev;
> > @@ -2741,11 +2875,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
> >   	/* Check that the context wasn't destroyed before submission */
> >   	if (likely(!intel_context_is_closed(eb->context))) {
> >   		attr = eb->gem_context->sched;
> > -	} else {
> > -		/* Serialise with context_close via the add_to_timeline */
> > -		i915_request_set_error_once(rq, -ENOENT);
> > -		__i915_request_skip(rq);
> > -		err = -ENOENT; /* override any transient errors */
> >   	}
> >   	__i915_request_queue(rq, &attr);
> > @@ -2755,6 +2884,42 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
> >   		retire_requests(tl, prev);
> >   	mutex_unlock(&tl->mutex);
> > +}
> > +
> > +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> > +{
> > +	int i;
> > +
> > +	/*
> > +	 * We iterate in reverse order of creation to release timeline mutexes in
> > +	 * same order.
> > +	 */
> > +	for_each_batch_add_order(eb, i) {
> > +		struct i915_request *rq = eb->requests[i];
> > +
> > +		if (!rq)
> > +			continue;
> > +
> > +		if (unlikely(intel_context_is_closed(eb->context))) {
> > +			/* Serialise with context_close via the add_to_timeline */
> > +			i915_request_set_error_once(rq, -ENOENT);
> > +			__i915_request_skip(rq);
> > +			err = -ENOENT; /* override any transient errors */
> > +		}
> > +
> > +		if (intel_context_is_parallel(eb->context)) {
> > +			if (err) {
> > +				__i915_request_skip(rq);
> > +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> > +					&rq->fence.flags);
> > +			}
> > +			if (i == 0)
> > +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> > +					&rq->fence.flags);
> > +		}
> > +
> > +		eb_request_add(eb, rq);
> > +	}
> >   	return err;
> >   }
> > @@ -2785,6 +2950,182 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
> >   				    eb);
> >   }
> > +static void eb_requests_get(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		i915_request_get(eb->requests[i]);
> > +	}
> > +}
> > +
> > +static void eb_requests_put(struct i915_execbuffer *eb)
> > +{
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		if (!eb->requests[i])
> > +			break;
> > +
> > +		i915_request_put(eb->requests[i]);
> > +	}
> > +}
> > +
> > +static struct sync_file *
> > +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	struct dma_fence_array *fence_array;
> > +	struct dma_fence **fences;
> > +	unsigned int i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> > +
> > +	fences = kmalloc_array(eb->num_batches, sizeof(*fences), GFP_KERNEL);
> > +	if (!fences)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	for_each_batch_create_order(eb, i)
> > +		fences[i] = &eb->requests[i]->fence;
> > +
> > +	fence_array = dma_fence_array_create(eb->num_batches,
> > +					     fences,
> > +					     eb->context->parallel.fence_context,
> > +					     eb->context->parallel.seqno,
> > +					     false);
> > +	if (!fence_array) {
> > +		kfree(fences);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	/* Move ownership to the dma_fence_array created above */
> > +	for_each_batch_create_order(eb, i)
> > +		dma_fence_get(fences[i]);
> > +
> > +	if (out_fence_fd != -1) {
> > +		out_fence = sync_file_create(&fence_array->base);
> > +		/* sync_file now owns fence_arry, drop creation ref */
> > +		dma_fence_put(&fence_array->base);
> > +		if (!out_fence)
> > +			return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	eb->composite_fence = &fence_array->base;
> > +
> > +	return out_fence;
> > +}
> > +
> > +static struct sync_file *
> > +eb_fences_add(struct i915_execbuffer *eb, struct i915_request *rq,
> > +	      struct dma_fence *in_fence, int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	int err;
> > +
> > +	if (unlikely(eb->gem_context->syncobj)) {
> > +		struct dma_fence *fence;
> > +
> > +		fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> > +		err = i915_request_await_dma_fence(rq, fence);
> > +		dma_fence_put(fence);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (in_fence) {
> > +		if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> > +			err = i915_request_await_execution(rq, in_fence);
> > +		else
> > +			err = i915_request_await_dma_fence(rq, in_fence);
> > +		if (err < 0)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (eb->fences) {
> > +		err = await_fence_array(eb, rq);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +	}
> > +
> > +	if (intel_context_is_parallel(eb->context)) {
> > +		out_fence = eb_composite_fence_create(eb, out_fence_fd);
> > +		if (IS_ERR(out_fence))
> > +			return ERR_PTR(-ENOMEM);
> > +	} else if (out_fence_fd != -1) {
> > +		out_fence = sync_file_create(&rq->fence);
> > +		if (!out_fence)
> > +			return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	return out_fence;
> > +}
> > +
> > +static struct intel_context *
> > +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> > +{
> > +	struct intel_context *child;
> > +
> > +	if (likely(context_number == 0))
> > +		return eb->context;
> > +
> > +	for_each_child(eb->context, child)
> > +		if (!--context_number)
> > +			return child;
> > +
> > +	GEM_BUG_ON("Context not found");
> > +
> > +	return NULL;
> > +}
> > +
> > +static struct sync_file *
> > +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> > +		   int out_fence_fd)
> > +{
> > +	struct sync_file *out_fence = NULL;
> > +	unsigned int i;
> > +
> > +	for_each_batch_create_order(eb, i) {
> > +		/* Allocate a request for this batch buffer nice and early. */
> > +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> > +		if (IS_ERR(eb->requests[i])) {
> > +			out_fence = ERR_PTR(PTR_ERR(eb->requests[i]));
> > +			eb->requests[i] = NULL;
> > +			return out_fence;
> > +		}
> > +
> > +		/*
> > +		 * Only the first request added (committed to backend) has to
> > +		 * take the in fences into account as all subsequent requests
> > +		 * will have fences inserted inbetween them.
> > +		 */
> > +		if (i + 1 == eb->num_batches) {
> > +			out_fence = eb_fences_add(eb, eb->requests[i],
> > +						  in_fence, out_fence_fd);
> > +			if (IS_ERR(out_fence))
> > +				return out_fence;
> > +		}
> > +
> > +		/*
> > +		 * Whilst this request exists, batch_obj will be on the
> > +		 * active_list, and so will hold the active reference. Only when
> > +		 * this request is retired will the batch_obj be moved onto
> > +		 * the inactive_list and lose its active reference. Hence we do
> > +		 * not need to explicitly hold another reference here.
> > +		 */
> > +		eb->requests[i]->batch = eb->batches[i]->vma;
> > +		if (eb->batch_pool) {
> > +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> > +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> > +							 eb->requests[i]);
> > +		}
> > +	}
> > +
> > +	return out_fence;
> > +}
> > +
> >   static int
> >   i915_gem_do_execbuffer(struct drm_device *dev,
> >   		       struct drm_file *file,
> > @@ -2795,7 +3136,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	struct i915_execbuffer eb;
> >   	struct dma_fence *in_fence = NULL;
> >   	struct sync_file *out_fence = NULL;
> > -	struct i915_vma *batch;
> >   	int out_fence_fd = -1;
> >   	int err;
> > @@ -2819,12 +3159,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	eb.buffer_count = args->buffer_count;
> >   	eb.batch_start_offset = args->batch_start_offset;
> > -	eb.batch_len = args->batch_len;
> >   	eb.trampoline = NULL;
> >   	eb.fences = NULL;
> >   	eb.num_fences = 0;
> > +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> > +	       ARRAY_SIZE(eb.requests));
> > +	eb.composite_fence = NULL;
> > +
> >   	eb.batch_flags = 0;
> >   	if (args->flags & I915_EXEC_SECURE) {
> >   		if (GRAPHICS_VER(i915) >= 11)
> > @@ -2908,70 +3251,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	ww_acquire_done(&eb.ww.ctx);
> > -	batch = eb.batch->vma;
> > -
> > -	/* Allocate a request for this batch buffer nice and early. */
> > -	eb.request = i915_request_create(eb.context);
> > -	if (IS_ERR(eb.request)) {
> > -		err = PTR_ERR(eb.request);
> > -		goto err_vma;
> > -	}
> > -
> > -	if (unlikely(eb.gem_context->syncobj)) {
> > -		struct dma_fence *fence;
> > -
> > -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> > -		err = i915_request_await_dma_fence(eb.request, fence);
> > -		dma_fence_put(fence);
> > -		if (err)
> > -			goto err_ext;
> > -	}
> > -
> > -	if (in_fence) {
> > -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> > -			err = i915_request_await_execution(eb.request,
> > -							   in_fence);
> > -		else
> > -			err = i915_request_await_dma_fence(eb.request,
> > -							   in_fence);
> > -		if (err < 0)
> > -			goto err_request;
> > -	}
> > -
> > -	if (eb.fences) {
> > -		err = await_fence_array(&eb);
> > -		if (err)
> > +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> > +	if (IS_ERR(out_fence)) {
> > +		err = PTR_ERR(out_fence);
> > +		if (eb.requests[0])
> >   			goto err_request;
> > +		else
> > +			goto err_vma;
> >   	}
> > -	if (out_fence_fd != -1) {
> > -		out_fence = sync_file_create(&eb.request->fence);
> > -		if (!out_fence) {
> > -			err = -ENOMEM;
> > -			goto err_request;
> > -		}
> > -	}
> > -
> > -	/*
> > -	 * Whilst this request exists, batch_obj will be on the
> > -	 * active_list, and so will hold the active reference. Only when this
> > -	 * request is retired will the the batch_obj be moved onto the
> > -	 * inactive_list and lose its active reference. Hence we do not need
> > -	 * to explicitly hold another reference here.
> > -	 */
> > -	eb.request->batch = batch;
> > -	if (eb.batch_pool)
> > -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> > -
> > -	trace_i915_request_queue(eb.request, eb.batch_flags);
> > -	err = eb_submit(&eb, batch);
> > +	err = eb_submit(&eb);
> >   err_request:
> > -	i915_request_get(eb.request);
> > -	err = eb_request_add(&eb, err);
> > +	eb_requests_get(&eb);
> > +	err = eb_requests_add(&eb, err);
> >   	if (eb.fences)
> > -		signal_fence_array(&eb);
> > +		signal_fence_array(&eb, eb.composite_fence ?
> > +				   eb.composite_fence :
> > +				   &eb.requests[0]->fence);
> >   	if (out_fence) {
> >   		if (err == 0) {
> > @@ -2986,10 +3284,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >   	if (unlikely(eb.gem_context->syncobj)) {
> >   		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> > -					  &eb.request->fence);
> > +					  eb.composite_fence ?
> > +					  eb.composite_fence :
> > +					  &eb.requests[0]->fence);
> >   	}
> > -	i915_request_put(eb.request);
> > +	if (!out_fence && eb.composite_fence)
> > +		dma_fence_put(eb.composite_fence);
> > +
> > +	eb_requests_put(&eb);
> >   err_vma:
> >   	eb_release_vmas(&eb, true);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index 1bc705f98e2a..1781419fa105 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -239,7 +239,13 @@ intel_context_timeline_lock(struct intel_context *ce)
> >   	struct intel_timeline *tl = ce->timeline;
> >   	int err;
> > -	err = mutex_lock_interruptible(&tl->mutex);
> > +	if (intel_context_is_parent(ce))
> > +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> > +	else if (intel_context_is_child(ce))
> > +		err = mutex_lock_interruptible_nested(&tl->mutex,
> > +						      ce->parallel.child_index + 1);
> > +	else
> > +		err = mutex_lock_interruptible(&tl->mutex);
> >   	if (err)
> >   		return ERR_PTR(err);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 95a5b94b4ece..9e0177dc5484 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -248,6 +248,16 @@ struct intel_context {
> >   		 * context
> >   		 */
> >   		struct i915_request *last_rq;
> > +		/**
> > +		 * @fence_context: fence context composite fence when doing
> > +		 * parallel submission
> > +		 */
> > +		u64 fence_context;
> > +		/**
> > +		 * @seqno: seqno for composite fence when doing parallel
> > +		 * submission
> > +		 */
> > +		u32 seqno;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> >   		/** @child_index: index into child_list if child */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f28e36aa77c2..83b0d2a114af 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -3094,6 +3094,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->parallel.fence_context = dma_fence_context_alloc(1);
> > +
> >   	parent->engine->emit_bb_start =
> >   		emit_bb_start_parent_no_preempt_mid_batch;
> >   	parent->engine->emit_fini_breadcrumb =
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 8950785e55d6..24db8459376b 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -147,6 +147,15 @@ enum {
> >   	 * tail.
> >   	 */
> >   	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> > +	 * parent-child relationship (parallel submission, multi-lrc) that
> > +	 * hit an error while generating requests in the execbuf IOCTL.
> > +	 * Indicates this request should be skipped as another request in
> > +	 * submission / relationship encoutered an error.
> > +	 */
> > +	I915_FENCE_FLAG_SKIP_PARALLEL,
> >   };
> >   /**
> > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> > index 4b7fc4647e46..90546fa58fc1 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/i915_vma.c
> > @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
> >   	return i915_active_add_request(&vma->active, rq);
> >   }
> > -int i915_vma_move_to_active(struct i915_vma *vma,
> > -			    struct i915_request *rq,
> > -			    unsigned int flags)
> > +int _i915_vma_move_to_active(struct i915_vma *vma,
> > +			     struct i915_request *rq,
> > +			     struct dma_fence *fence,
> > +			     unsigned int flags)
> >   {
> >   	struct drm_i915_gem_object *obj = vma->obj;
> >   	int err;
> > @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
> >   			intel_frontbuffer_put(front);
> >   		}
> > -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> > -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> > -		obj->read_domains = 0;
> > +		if (fence) {
> > +			dma_resv_add_excl_fence(vma->resv, fence);
> > +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> > +			obj->read_domains = 0;
> > +		}
> >   	} else {
> >   		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
> >   			err = dma_resv_reserve_shared(vma->resv, 1);
> > @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
> >   				return err;
> >   		}
> > -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> > -		obj->write_domain = 0;
> > +		if (fence) {
> > +			dma_resv_add_shared_fence(vma->resv, fence);
> > +			obj->write_domain = 0;
> > +		}
> >   	}
> >   	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> > diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> > index ed69f66c7ab0..648dbe744c96 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.h
> > +++ b/drivers/gpu/drm/i915/i915_vma.h
> > @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
> >   int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
> >   					   struct i915_request *rq);
> > -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> > -					 struct i915_request *rq,
> > -					 unsigned int flags);
> > +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> > +					  struct i915_request *rq,
> > +					  struct dma_fence *fence,
> > +					  unsigned int flags);
> > +static inline int __must_check
> > +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> > +			unsigned int flags)
> > +{
> > +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> > +}
> >   #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-10-11 23:32     ` [Intel-gfx] " John Harrison
@ 2021-10-13  1:52       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  1:52 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Mon, Oct 11, 2021 at 04:32:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > mid BB. To safely enable preemption at the BB boundary, a handshake
> > between to parent and child is needed. This is implemented via custom
> between to parent -> between parent
> > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> via by -> by
> 
> I'm also not seeing any mention of the forced re-group behavioural change in
> either the comments or commit description.
> 

Will do all of the above + mention fixing the PD size.

> > default if a context is configured by set parallel extension.
> > 
> > v2:
> >   (John Harrison)
> >    - Fix a few comments wording
> >    - Add struture for parent page layout
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
> >   4 files changed, 324 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 3b340eb59ada..ee84259959d0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> >   	GEM_BUG_ON(intel_context_is_child(child));
> >   	GEM_BUG_ON(intel_context_is_parent(child));
> > -	parent->parallel.number_children++;
> > +	parent->parallel.child_index = parent->parallel.number_children++;
> >   	list_add_tail(&child->parallel.child_link,
> >   		      &parent->parallel.child_list);
> >   	child->parallel.parent = parent;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 1d880303a7e4..95a5b94b4ece 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -250,6 +250,8 @@ struct intel_context {
> >   		struct i915_request *last_rq;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> > +		/** @child_index: index into child_list if child */
> > +		u8 child_index;
> >   		/** @guc: GuC specific members for parallel submission */
> >   		struct {
> >   			/** @wqi_head: head pointer in work queue */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index a00eeddc1449..663950d3badc 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -181,7 +181,7 @@ struct guc_process_desc {
> >   	u32 wq_status;
> >   	u32 engine_presence;
> >   	u32 priority;
> > -	u32 reserved[30];
> > +	u32 reserved[36];
> Not seeing the promised explanation of this bug fix.
> 

Will add in commit message.

> >   } __packed;
> >   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 12ee8ca76249..f28e36aa77c2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -11,6 +11,7 @@
> >   #include "gt/intel_context.h"
> >   #include "gt/intel_engine_pm.h"
> >   #include "gt/intel_engine_heartbeat.h"
> > +#include "gt/intel_gpu_commands.h"
> >   #include "gt/intel_gt.h"
> >   #include "gt/intel_gt_irq.h"
> >   #include "gt/intel_gt_pm.h"
> > @@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >   /*
> >    * When using multi-lrc submission an extra page in the context state is
> > - * reserved for the process descriptor and work queue.
> > + * reserved for the process descriptor, work queue, and handshake between the
> > + * parent + childlren contexts to insert safe preemption points between each set
> > + * of BBs.
> >    *
> >    * The layout of this page is below:
> >    * 0						guc_process_desc
> > + * + sizeof(struct guc_process_desc)		child go
> > + * + CACHELINE_BYTES				child join[0]
> > + * ...
> > + * + CACHELINE_BYTES				child join[n - 1]
> >    * ...						unused
> >    * PAGE_SIZE / 2				work queue start
> >    * ...						work queue
> > @@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >    */
> >   #define WQ_SIZE			(PAGE_SIZE / 2)
> >   #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> > -static u32 __get_process_desc_offset(struct intel_context *ce)
> > +
> > +struct parent_page {
> > +	struct guc_process_desc pdesc;
> > +
> > +	u32 child_go_memory;
> > +	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
> > +
> > +	struct {
> > +		u32 child_join_memory;
> > +		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
> > +	} join[MAX_ENGINE_INSTANCE + 1];
> Could have a common structure for these. Call the u32 'semaphore_memory' or
> something then just have:
>   struct sync_semaphore go;
>   struct sync_semaphore go[MAX + 1];
> 

Sure.

> > +
> > +	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
> > +		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
> And this bit could be 'sizeof(struct sync_semaphore) * MAX + 2' to be
> clearer what it refers to.
> 
> And to be totally paranoid about it, could also add
> 'BUILD_BUG_ON(sizeof(struct sync_semaphore) != CACHELINE_BYTES'.
> 
> And BUILD_BUG_ON(sizeof(parent_page) == PARENT_PAGE_SIZE)'.
>

Sure.
 
> > +
> > +	u32 wq[WQ_SIZE / sizeof(u32)];
> > +};
> > +
> > +static u32 __get_parent_page_offset(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> > @@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
> >   static u32 __get_wq_offset(struct intel_context *ce)
> >   {
> > -	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > +	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
> > +
> > +	return __get_parent_page_offset(ce) + WQ_OFFSET;
> >   }
> > -static struct guc_process_desc *
> > -__get_process_desc(struct intel_context *ce)
> > +static struct parent_page *
> > +__get_parent_page(struct intel_context *ce)
> >   {
> > +	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
> > +
> >   	/*
> >   	 * Need to subtract LRC_STATE_OFFSET here as the
> >   	 * parallel.guc.parent_page is the offset into ce->state while
> >   	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> >   	 */
> > -	return (struct guc_process_desc *)
> > +	return (struct parent_page *)
> >   		(ce->lrc_reg_state +
> > -		 ((__get_process_desc_offset(ce) -
> > +		 ((__get_parent_page_offset(ce) -
> >   		   LRC_STATE_OFFSET) / sizeof(u32)));
> >   }
> > +static struct guc_process_desc *
> > +__get_process_desc(struct intel_context *ce)
> > +{
> > +	struct parent_page *pp = __get_parent_page(ce);
> > +
> > +	return &pp->pdesc;
> > +}
> > +
> >   static u32 *get_wq_pointer(struct guc_process_desc *desc,
> >   			   struct intel_context *ce,
> >   			   u32 wqi_size)
> > @@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
> >   	}
> >   #undef AVAILABLE_SPACE
> > -	return ((u32 *)__get_process_desc(ce)) +
> > -		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> > +	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
> >   }
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> > @@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
> >   	return __guc_action_deregister_context(guc, guc_id);
> >   }
> > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
> > +	u8 i;
> > +
> > +	for (i = 0; i < ce->parallel.number_children + 1; ++i)
> > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> Can't this be written as:
>   pp->child_go_memory = 0;
>   for(i = 0 to number_children)
>     pp->child_join_memory = 0;
> 
> Seems like that would be much clearer than this magic casting and
> offsetting. I mean, that was the whole point of creating the parent_page
> structure.
> 

Will rewrite.

> 
> > +}
> > +
> > +static inline u32 get_children_go_value(struct intel_context *ce)
> > +{
> > +	return __get_parent_page(ce)->child_go_memory;
> > +}
> > +
> > +static inline u32 get_children_join_value(struct intel_context *ce,
> > +					  u8 child_index)
> > +{
> > +	return __get_parent_page(ce)->join[child_index].child_join_memory;
> > +}
> > +
> >   static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   				    struct guc_lrc_desc *desc)
> >   {
> > @@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   		ce->parallel.guc.wqi_head = 0;
> >   		desc->process_desc = i915_ggtt_offset(ce->state) +
> > -			__get_process_desc_offset(ce);
> > +			__get_parent_page_offset(ce);
> >   		desc->wq_addr = i915_ggtt_offset(ce->state) +
> >   			__get_wq_offset(ce);
> >   		desc->wq_size = WQ_SIZE;
> > @@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   			guc_context_policy_init(engine, desc);
> >   		}
> > +
> > +		clear_children_join_go_memory(ce);
> >   	}
> >   	/*
> > @@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > +/*
> > + * The below override of the breadcrumbs is enabled when the user configures a
> > + * context for parallel submission (multi-lrc, parent-child).
> > + *
> > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > + * safely preempt all the hw contexts configured for parallel submission
> > + * between each BB. The contract between the i915 and GuC is if the parent
> > + * context can be preempted, all the children can be preempted, and the GuC will
> > + * always try to preempt the parent before the children. A handshake between the
> > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > + * creating a window to preempt between each set of BBs.
> > + */
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags);
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags);
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs);
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs);
> > +
> >   static struct intel_context *
> >   guc_create_parallel(struct intel_engine_cs **engines,
> >   		    unsigned int num_siblings,
> > @@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->engine->emit_bb_start =
> > +		emit_bb_start_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb =
> > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb_dw =
> > +		12 + 4 * parent->parallel.number_children;
> > +	for_each_child(parent, ce) {
> > +		ce->engine->emit_bb_start =
> > +			emit_bb_start_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb =
> > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > +	}
> > +
> >   	kfree(siblings);
> >   	return parent;
> > @@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   			drm_printf(p, "\t\tWQI Status: %u\n\n",
> >   				   READ_ONCE(desc->wq_status));
> > +			if (ce->engine->emit_bb_start ==
> > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > +				u8 i;
> > +
> > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > +					   get_children_go_value(ce));
> > +				for (i = 0; i < ce->parallel.number_children; ++i)
> > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > +						   get_children_join_value(ce, i));
> > +			}
> > +
> >   			for_each_child(ce, child)
> >   				guc_log_context(p, child);
> >   		}
> > @@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
> > +		     sizeof(struct guc_process_desc));
> > +
> > +	return i915_ggtt_offset(ce->state) +
> > +		__get_parent_page_offset(ce) +
> > +		sizeof(struct guc_process_desc);
> Rather than relying on the BUILD_BUG to make sure that the magic calculation
> matches the structure definition, can't this just say "ggtt_offset +
> pp_offset + offsetof(pp, child_go)"?
> 

Probably.

> > +}
> > +
> > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > +					 u8 child_index)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> "ggtt_offset + pp_offset + offsetof(pp, child_join[i])"?
> 

Probably.

> 
> > +}
> > +
> > +#define PARENT_GO_BB			1
> > +#define PARENT_GO_FINI_BREADCRUMB	0
> > +#define CHILD_GO_BB			1
> > +#define CHILD_GO_FINI_BREADCRUMB	0
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->parallel.number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_BB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_BB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +	*cs++ = MI_NOOP;
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> > +	u32 *cs;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	cs = intel_ring_begin(rq, 12);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_BB,
> > +				  get_children_join_addr(parent,
> > +							 ce->parallel.child_index),
> > +				  0);
> > +
> > +	/* Wait on parent for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_BB;
> > +	*cs++ = get_children_go_addr(parent);
> > +	*cs++ = 0;
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->parallel.number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> You mentioned possibly needing to add an MI_ARB_CHECK in here but I'm not
> seeing it. Did the testing happen? I don't see that it should be necessary.
> Once you execute the MI_ARB_ENABLE, the CS can preempt anywhere, I thought?

No, it can only preempt on certain instructions - e.g. MI_ARB_CHECK or a
semaphore.

> Even if it can't there should be an MI_ARB_CHECK added at the next level up
> after the breadcrumb code. Or do we not have those in between batches any
> more?
>

Right. A MI_ARB_CHECK before writing the fini breadcrumbs would be wrong
as we could preeempt after the batch is complete but before the fini
breadcrumbs are written so the i915 still thinks the batch (request)
isn't done. The code is fine as is. The emit BB code for the parent has
semaphore instructions where is can be preempted.

Matt
 
> John.
> 
> 
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_FINI_BREADCRUMB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_FINI_BREADCRUMB,
> > +				  get_children_join_addr(parent,
> > +							 ce->parallel.child_index),
> > +				  0);
> > +
> > +	/* Wait parent on for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > +	*cs++ = get_children_go_addr(parent);
> > +	*cs++ = 0;
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> >   static struct intel_context *
> >   guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> >   		   unsigned long flags)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc
@ 2021-10-13  1:52       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13  1:52 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Mon, Oct 11, 2021 at 04:32:03PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > mid BB. To safely enable preemption at the BB boundary, a handshake
> > between to parent and child is needed. This is implemented via custom
> between to parent -> between parent
> > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> via by -> by
> 
> I'm also not seeing any mention of the forced re-group behavioural change in
> either the comments or commit description.
> 

Will do all of the above + mention fixing the PD size.

> > default if a context is configured by set parallel extension.
> > 
> > v2:
> >   (John Harrison)
> >    - Fix a few comments wording
> >    - Add struture for parent page layout
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   2 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 330 +++++++++++++++++-
> >   4 files changed, 324 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 3b340eb59ada..ee84259959d0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -569,7 +569,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> >   	GEM_BUG_ON(intel_context_is_child(child));
> >   	GEM_BUG_ON(intel_context_is_parent(child));
> > -	parent->parallel.number_children++;
> > +	parent->parallel.child_index = parent->parallel.number_children++;
> >   	list_add_tail(&child->parallel.child_link,
> >   		      &parent->parallel.child_list);
> >   	child->parallel.parent = parent;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 1d880303a7e4..95a5b94b4ece 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -250,6 +250,8 @@ struct intel_context {
> >   		struct i915_request *last_rq;
> >   		/** @number_children: number of children if parent */
> >   		u8 number_children;
> > +		/** @child_index: index into child_list if child */
> > +		u8 child_index;
> >   		/** @guc: GuC specific members for parallel submission */
> >   		struct {
> >   			/** @wqi_head: head pointer in work queue */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index a00eeddc1449..663950d3badc 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -181,7 +181,7 @@ struct guc_process_desc {
> >   	u32 wq_status;
> >   	u32 engine_presence;
> >   	u32 priority;
> > -	u32 reserved[30];
> > +	u32 reserved[36];
> Not seeing the promised explanation of this bug fix.
> 

Will add in commit message.

> >   } __packed;
> >   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 12ee8ca76249..f28e36aa77c2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -11,6 +11,7 @@
> >   #include "gt/intel_context.h"
> >   #include "gt/intel_engine_pm.h"
> >   #include "gt/intel_engine_heartbeat.h"
> > +#include "gt/intel_gpu_commands.h"
> >   #include "gt/intel_gt.h"
> >   #include "gt/intel_gt_irq.h"
> >   #include "gt/intel_gt_pm.h"
> > @@ -368,10 +369,16 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >   /*
> >    * When using multi-lrc submission an extra page in the context state is
> > - * reserved for the process descriptor and work queue.
> > + * reserved for the process descriptor, work queue, and handshake between the
> > + * parent + childlren contexts to insert safe preemption points between each set
> > + * of BBs.
> >    *
> >    * The layout of this page is below:
> >    * 0						guc_process_desc
> > + * + sizeof(struct guc_process_desc)		child go
> > + * + CACHELINE_BYTES				child join[0]
> > + * ...
> > + * + CACHELINE_BYTES				child join[n - 1]
> >    * ...						unused
> >    * PAGE_SIZE / 2				work queue start
> >    * ...						work queue
> > @@ -379,7 +386,25 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
> >    */
> >   #define WQ_SIZE			(PAGE_SIZE / 2)
> >   #define WQ_OFFSET		(PAGE_SIZE - WQ_SIZE)
> > -static u32 __get_process_desc_offset(struct intel_context *ce)
> > +
> > +struct parent_page {
> > +	struct guc_process_desc pdesc;
> > +
> > +	u32 child_go_memory;
> > +	u8 unused0[CACHELINE_BYTES - sizeof(u32)];
> > +
> > +	struct {
> > +		u32 child_join_memory;
> > +		u8 unused1[CACHELINE_BYTES - sizeof(u32)];
> > +	} join[MAX_ENGINE_INSTANCE + 1];
> Could have a common structure for these. Call the u32 'semaphore_memory' or
> something then just have:
>   struct sync_semaphore go;
>   struct sync_semaphore go[MAX + 1];
> 

Sure.

> > +
> > +	u8 unused2[(WQ_OFFSET - sizeof(struct guc_process_desc) -
> > +		    CACHELINE_BYTES * (MAX_ENGINE_INSTANCE + 2))];
> And this bit could be 'sizeof(struct sync_semaphore) * MAX + 2' to be
> clearer what it refers to.
> 
> And to be totally paranoid about it, could also add
> 'BUILD_BUG_ON(sizeof(struct sync_semaphore) != CACHELINE_BYTES'.
> 
> And BUILD_BUG_ON(sizeof(parent_page) == PARENT_PAGE_SIZE)'.
>

Sure.
 
> > +
> > +	u32 wq[WQ_SIZE / sizeof(u32)];
> > +};
> > +
> > +static u32 __get_parent_page_offset(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(!ce->parallel.guc.parent_page);
> > @@ -388,23 +413,35 @@ static u32 __get_process_desc_offset(struct intel_context *ce)
> >   static u32 __get_wq_offset(struct intel_context *ce)
> >   {
> > -	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > +	BUILD_BUG_ON(offsetof(struct parent_page, wq) != WQ_OFFSET);
> > +
> > +	return __get_parent_page_offset(ce) + WQ_OFFSET;
> >   }
> > -static struct guc_process_desc *
> > -__get_process_desc(struct intel_context *ce)
> > +static struct parent_page *
> > +__get_parent_page(struct intel_context *ce)
> >   {
> > +	BUILD_BUG_ON(sizeof(struct parent_page) != PAGE_SIZE);
> > +
> >   	/*
> >   	 * Need to subtract LRC_STATE_OFFSET here as the
> >   	 * parallel.guc.parent_page is the offset into ce->state while
> >   	 * ce->lrc_reg_reg is ce->state + LRC_STATE_OFFSET.
> >   	 */
> > -	return (struct guc_process_desc *)
> > +	return (struct parent_page *)
> >   		(ce->lrc_reg_state +
> > -		 ((__get_process_desc_offset(ce) -
> > +		 ((__get_parent_page_offset(ce) -
> >   		   LRC_STATE_OFFSET) / sizeof(u32)));
> >   }
> > +static struct guc_process_desc *
> > +__get_process_desc(struct intel_context *ce)
> > +{
> > +	struct parent_page *pp = __get_parent_page(ce);
> > +
> > +	return &pp->pdesc;
> > +}
> > +
> >   static u32 *get_wq_pointer(struct guc_process_desc *desc,
> >   			   struct intel_context *ce,
> >   			   u32 wqi_size)
> > @@ -424,8 +461,7 @@ static u32 *get_wq_pointer(struct guc_process_desc *desc,
> >   	}
> >   #undef AVAILABLE_SPACE
> > -	return ((u32 *)__get_process_desc(ce)) +
> > -		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> > +	return &__get_parent_page(ce)->wq[ce->parallel.guc.wqi_tail / sizeof(u32)];
> >   }
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> > @@ -1829,6 +1865,26 @@ static int deregister_context(struct intel_context *ce, u32 guc_id)
> >   	return __guc_action_deregister_context(guc, guc_id);
> >   }
> > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(&__get_parent_page(ce)->child_go_memory);
> > +	u8 i;
> > +
> > +	for (i = 0; i < ce->parallel.number_children + 1; ++i)
> > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> Can't this be written as:
>   pp->child_go_memory = 0;
>   for(i = 0 to number_children)
>     pp->child_join_memory = 0;
> 
> Seems like that would be much clearer than this magic casting and
> offsetting. I mean, that was the whole point of creating the parent_page
> structure.
> 

Will rewrite.

> 
> > +}
> > +
> > +static inline u32 get_children_go_value(struct intel_context *ce)
> > +{
> > +	return __get_parent_page(ce)->child_go_memory;
> > +}
> > +
> > +static inline u32 get_children_join_value(struct intel_context *ce,
> > +					  u8 child_index)
> > +{
> > +	return __get_parent_page(ce)->join[child_index].child_join_memory;
> > +}
> > +
> >   static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   				    struct guc_lrc_desc *desc)
> >   {
> > @@ -1888,7 +1944,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   		ce->parallel.guc.wqi_head = 0;
> >   		desc->process_desc = i915_ggtt_offset(ce->state) +
> > -			__get_process_desc_offset(ce);
> > +			__get_parent_page_offset(ce);
> >   		desc->wq_addr = i915_ggtt_offset(ce->state) +
> >   			__get_wq_offset(ce);
> >   		desc->wq_size = WQ_SIZE;
> > @@ -1910,6 +1966,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   			guc_context_policy_init(engine, desc);
> >   		}
> > +
> > +		clear_children_join_go_memory(ce);
> >   	}
> >   	/*
> > @@ -2976,6 +3034,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > +/*
> > + * The below override of the breadcrumbs is enabled when the user configures a
> > + * context for parallel submission (multi-lrc, parent-child).
> > + *
> > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > + * safely preempt all the hw contexts configured for parallel submission
> > + * between each BB. The contract between the i915 and GuC is if the parent
> > + * context can be preempted, all the children can be preempted, and the GuC will
> > + * always try to preempt the parent before the children. A handshake between the
> > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > + * creating a window to preempt between each set of BBs.
> > + */
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags);
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags);
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs);
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs);
> > +
> >   static struct intel_context *
> >   guc_create_parallel(struct intel_engine_cs **engines,
> >   		    unsigned int num_siblings,
> > @@ -3011,6 +3094,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->engine->emit_bb_start =
> > +		emit_bb_start_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb =
> > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb_dw =
> > +		12 + 4 * parent->parallel.number_children;
> > +	for_each_child(parent, ce) {
> > +		ce->engine->emit_bb_start =
> > +			emit_bb_start_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb =
> > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > +	}
> > +
> >   	kfree(siblings);
> >   	return parent;
> > @@ -3840,6 +3937,17 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   			drm_printf(p, "\t\tWQI Status: %u\n\n",
> >   				   READ_ONCE(desc->wq_status));
> > +			if (ce->engine->emit_bb_start ==
> > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > +				u8 i;
> > +
> > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > +					   get_children_go_value(ce));
> > +				for (i = 0; i < ce->parallel.number_children; ++i)
> > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > +						   get_children_join_value(ce, i));
> > +			}
> > +
> >   			for_each_child(ce, child)
> >   				guc_log_context(p, child);
> >   		}
> > @@ -3847,6 +3955,208 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +	BUILD_BUG_ON(offsetof(struct parent_page, child_go_memory) !=
> > +		     sizeof(struct guc_process_desc));
> > +
> > +	return i915_ggtt_offset(ce->state) +
> > +		__get_parent_page_offset(ce) +
> > +		sizeof(struct guc_process_desc);
> Rather than relying on the BUILD_BUG to make sure that the magic calculation
> matches the structure definition, can't this just say "ggtt_offset +
> pp_offset + offsetof(pp, child_go)"?
> 

Probably.

> > +}
> > +
> > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > +					 u8 child_index)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> "ggtt_offset + pp_offset + offsetof(pp, child_join[i])"?
> 

Probably.

> 
> > +}
> > +
> > +#define PARENT_GO_BB			1
> > +#define PARENT_GO_FINI_BREADCRUMB	0
> > +#define CHILD_GO_BB			1
> > +#define CHILD_GO_FINI_BREADCRUMB	0
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	cs = intel_ring_begin(rq, 10 + 4 * ce->parallel.number_children);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->parallel.number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_BB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_BB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +	*cs++ = MI_NOOP;
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> > +	u32 *cs;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	cs = intel_ring_begin(rq, 12);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_BB,
> > +				  get_children_join_addr(parent,
> > +							 ce->parallel.child_index),
> > +				  0);
> > +
> > +	/* Wait on parent for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_BB;
> > +	*cs++ = get_children_go_addr(parent);
> > +	*cs++ = 0;
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->parallel.number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> You mentioned possibly needing to add an MI_ARB_CHECK in here but I'm not
> seeing it. Did the testing happen? I don't see that it should be necessary.
> Once you execute the MI_ARB_ENABLE, the CS can preempt anywhere, I thought?

No, it can only preempt on certain instructions - e.g. MI_ARB_CHECK or a
semaphore.

> Even if it can't there should be an MI_ARB_CHECK added at the next level up
> after the breadcrumb code. Or do we not have those in between batches any
> more?
>

Right. A MI_ARB_CHECK before writing the fini breadcrumbs would be wrong
as we could preeempt after the batch is complete but before the fini
breadcrumbs are written so the i915 still thinks the batch (request)
isn't done. The code is fine as is. The emit BB code for the parent has
semaphore instructions where is can be preempted.

Matt
 
> John.
> 
> 
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_FINI_BREADCRUMB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	struct intel_context *parent = intel_context_to_parent(ce);
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_FINI_BREADCRUMB,
> > +				  get_children_join_addr(parent,
> > +							 ce->parallel.child_index),
> > +				  0);
> > +
> > +	/* Wait parent on for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > +	*cs++ = get_children_go_addr(parent);
> > +	*cs++ = 0;
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> >   static struct intel_context *
> >   guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> >   		   unsigned long flags)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-12 22:08     ` [Intel-gfx] " John Harrison
@ 2021-10-13 17:51       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 17:51 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an object in the excl or shared slot is a composite fence from a
> > parallel submit and the current request in the conflict tracking is from
> > the same parallel context there is no need to enforce ordering as the
> > ordering already implicit. Make the request conflict tracking understand
> ordering already -> ordering is already
> 
> > this by comparing the parents parallel fence values and skipping the
> parents -> parent's
> 
> > conflict insertion if the values match.
> Presumably, this is to cope with the fact that the parallel submit fences do
> not look like regular submission fences. And hence the existing code that
> says 'new fence belongs to same context as old fence, so safe to ignore'
> does not work with parallel submission. However, this change does not appear
> to be adding parallel submit support to an existing 'same context' check. It
> seems to be a brand new check that does not exist for single submission.
> What makes parallel submit different? If we aren't skipping same context
> fences for single submits, why do we need it for parallel? Conversely, if we
> need it for parallel then why don't we need it for single?
> 
> And if the single submission version is simply somewhere else in the code,
> why do the parallel version here instead of at the same place?
> 
> John.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
> >   1 file changed, 29 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index e9bfa32f9270..cf89624020ad 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
> >   	return err;
> >   }
> > +static inline bool is_parallel_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_parallel(rq->context);
> > +}
> > +
> > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> > +static bool is_same_parallel_context(struct i915_request *to,
> > +				     struct i915_request *from)
> > +{
> > +	if (is_parallel_rq(to))
> Should this not say '&& is_parallel_rq(from)'?
> 

Missed this one. That isn't necessary as if from is not a parallel
submit the following compare of parents will always return false. I
could add if you insist as either way works.

Matt 

> > +		return request_to_parent(to) == request_to_parent(from);
> > +
> > +	return false;
> > +}
> > +
> >   int
> >   i915_request_await_execution(struct i915_request *rq,
> >   			     struct dma_fence *fence)
> > @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
> >   		 * want to run our callback in all cases.
> >   		 */
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = __i915_request_await_execution(rq,
> >   							     to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> >   	} while (--nchild);
> > @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
> >   						 fence))
> >   			continue;
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = i915_request_await_request(rq, to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> > @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
> >   	return ret;
> >   }
> > -static inline bool is_parallel_rq(struct i915_request *rq)
> > -{
> > -	return intel_context_is_parallel(rq->context);
> > -}
> > -
> > -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > -{
> > -	return intel_context_to_parent(rq->context);
> > -}
> > -
> >   static struct i915_request *
> >   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
> >   					struct intel_timeline *timeline)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-13 17:51       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 17:51 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > If an object in the excl or shared slot is a composite fence from a
> > parallel submit and the current request in the conflict tracking is from
> > the same parallel context there is no need to enforce ordering as the
> > ordering already implicit. Make the request conflict tracking understand
> ordering already -> ordering is already
> 
> > this by comparing the parents parallel fence values and skipping the
> parents -> parent's
> 
> > conflict insertion if the values match.
> Presumably, this is to cope with the fact that the parallel submit fences do
> not look like regular submission fences. And hence the existing code that
> says 'new fence belongs to same context as old fence, so safe to ignore'
> does not work with parallel submission. However, this change does not appear
> to be adding parallel submit support to an existing 'same context' check. It
> seems to be a brand new check that does not exist for single submission.
> What makes parallel submit different? If we aren't skipping same context
> fences for single submits, why do we need it for parallel? Conversely, if we
> need it for parallel then why don't we need it for single?
> 
> And if the single submission version is simply somewhere else in the code,
> why do the parallel version here instead of at the same place?
> 
> John.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
> >   1 file changed, 29 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index e9bfa32f9270..cf89624020ad 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
> >   	return err;
> >   }
> > +static inline bool is_parallel_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_parallel(rq->context);
> > +}
> > +
> > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> > +static bool is_same_parallel_context(struct i915_request *to,
> > +				     struct i915_request *from)
> > +{
> > +	if (is_parallel_rq(to))
> Should this not say '&& is_parallel_rq(from)'?
> 

Missed this one. That isn't necessary as if from is not a parallel
submit the following compare of parents will always return false. I
could add if you insist as either way works.

Matt 

> > +		return request_to_parent(to) == request_to_parent(from);
> > +
> > +	return false;
> > +}
> > +
> >   int
> >   i915_request_await_execution(struct i915_request *rq,
> >   			     struct dma_fence *fence)
> > @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
> >   		 * want to run our callback in all cases.
> >   		 */
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = __i915_request_await_execution(rq,
> >   							     to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> >   	} while (--nchild);
> > @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
> >   						 fence))
> >   			continue;
> > -		if (dma_fence_is_i915(fence))
> > +		if (dma_fence_is_i915(fence)) {
> > +			if (is_same_parallel_context(rq, to_request(fence)))
> > +				continue;
> >   			ret = i915_request_await_request(rq, to_request(fence));
> > -		else
> > +		} else {
> >   			ret = i915_request_await_external(rq, fence);
> > +		}
> >   		if (ret < 0)
> >   			return ret;
> > @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
> >   	return ret;
> >   }
> > -static inline bool is_parallel_rq(struct i915_request *rq)
> > -{
> > -	return intel_context_is_parallel(rq->context);
> > -}
> > -
> > -static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > -{
> > -	return intel_context_to_parent(rq->context);
> > -}
> > -
> >   static struct i915_request *
> >   __i915_request_ensure_parallel_ordering(struct i915_request *rq,
> >   					struct intel_timeline *timeline)
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-08 16:40         ` [Intel-gfx] " John Harrison
@ 2021-10-13 18:03           ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 18:03 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 09:40:43AM -0700, John Harrison wrote:
> On 10/7/2021 18:21, Matthew Brost wrote:
> > On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
> > > On 10/4/2021 15:06, Matthew Brost wrote:
> > > > Assign contexts in parent-child relationship consecutive guc_ids. This
> > > > is accomplished by partitioning guc_id space between ones that need to
> > > > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > > > available guc_ids). The consecutive search is implemented via the bitmap
> > > > API.
> > > > 
> > > > This is a precursor to the full GuC multi-lrc implementation but aligns
> > > > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > > > when using the GuC multi-lrc interface.
> > > > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Explicitly state why we assign consecutive guc_ids
> > > > v3:
> > > >    (John Harrison)
> > > >     - Bring back in spin lock
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
> > > >    2 files changed, 86 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 25a598e2b6e8..a9f4ec972bfb 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -76,9 +76,13 @@ struct intel_guc {
> > > >    		 */
> > > >    		spinlock_t lock;
> > > >    		/**
> > > > -		 * @guc_ids: used to allocate new guc_ids
> > > > +		 * @guc_ids: used to allocate new guc_ids, single-lrc
> > > >    		 */
> > > >    		struct ida guc_ids;
> > > > +		/**
> > > > +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> > > > +		 */
> > > > +		unsigned long *guc_ids_bitmap;
> > > >    		/**
> > > >    		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > > >    		 * refs
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 1f2809187513..79e7732e83b2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > > >    #define GUC_REQUEST_SIZE 64 /* bytes */
> > > > +/*
> > > > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > > > + * per the GuC submission interface. A different allocation algorithm is used
> > > > + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> > > > + * partition the guc_id space. We believe the number of multi-lrc contexts in
> > > > + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> > > > + * multi-lrc.
> > > > + */
> > > > +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> > > > +
> > > >    /*
> > > >     * Below is a set of functions which control the GuC scheduling state which
> > > >     * require a lock.
> > > > @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_WORK(&guc->submission_state.destroyed_worker,
> > > >    		  destroyed_worker_func);
> > > > +	guc->submission_state.guc_ids_bitmap =
> > > > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> > > > +	if (!guc->submission_state.guc_ids_bitmap)
> > > > +		return -ENOMEM;
> > > > +
> > > >    	return 0;
> > > >    }
> > > > @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> > > >    	guc_lrc_desc_pool_destroy(guc);
> > > >    	guc_flush_destroyed_contexts(guc);
> > > >    	i915_sched_engine_put(guc->sched_engine);
> > > > +	bitmap_free(guc->submission_state.guc_ids_bitmap);
> > > >    }
> > > >    static inline void queue_request(struct i915_sched_engine *sched_engine,
> > > > @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
> > > >    	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > > >    }
> > > > -static int new_guc_id(struct intel_guc *guc)
> > > > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > > > -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> > > > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > > +	int ret;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	if (intel_context_is_parent(ce))
> > > > +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > > > +					      NUMBER_MULTI_LRC_GUC_ID,
> > > > +					      order_base_2(ce->parallel.number_children
> > > > +							   + 1));
> > > > +	else
> > > > +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > > > +				     NUMBER_MULTI_LRC_GUC_ID,
> > > > +				     GUC_MAX_LRC_DESCRIPTORS,
> > > > +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> > > > +				     __GFP_NOWARN);
> > > > +	if (unlikely(ret < 0))
> > > > +		return ret;
> > > > +
> > > > +	ce->guc_id.id = ret;
> > > > +	return 0;
> > > >    }
> > > >    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > >    	if (!context_guc_id_invalid(ce)) {
> > > > -		ida_simple_remove(&guc->submission_state.guc_ids,
> > > > -				  ce->guc_id.id);
> > > > +		if (intel_context_is_parent(ce))
> > > > +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> > > > +					      ce->guc_id.id,
> > > > +					      order_base_2(ce->parallel.number_children
> > > > +							   + 1));
> > > There was a discussion on the previous revision about adding a BUG_ON to
> > > ensure that number_children cannot change between the bitmap alloc and the
> > > bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
> > > 
> > I thought you meant to add a BUG_ON to ensure before we release a region
> > / id it is occupied? I looked in both the bitmap API and ida API and
> > neither have a function that checks if region / id is occupied so can't
> > really add a BUG_ON for that.
> > 
> > How much you add BUG_ON to ensure the number of children canoot change
> > between alloc and release? I don't follow how that would work.
> > 
> > Matt
> I was thinking that where number_children is modified, you have a
> BUG_ON(guc_id_is_valid). That would ensure that the release has to match the
> alloc. Hmm, you already have a BUG_ON about the parent/child not being
> pinned in intel_context_bind_parent_child(), which I guess covers it because
> you shouldn't have a guc_id if you aren't pinned, right? And that is the
> only function which can modify number_children, yes? So maybe it's all good?
> 

I think we are all good.

Matt

> John.
> 
> > 
> > > John.
> > > 
> > > 
> > > > +		else
> > > > +			ida_simple_remove(&guc->submission_state.guc_ids,
> > > > +					  ce->guc_id.id);
> > > >    		reset_lrc_desc(guc, ce->guc_id.id);
> > > >    		set_context_guc_id_invalid(ce);
> > > >    	}
> > > > @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > > >    }
> > > > -static int steal_guc_id(struct intel_guc *guc)
> > > > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > -	struct intel_context *ce;
> > > > -	int guc_id;
> > > > +	struct intel_context *cn;
> > > >    	lockdep_assert_held(&guc->submission_state.lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +	GEM_BUG_ON(intel_context_is_parent(ce));
> > > >    	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > > > -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> > > > +		cn = list_first_entry(&guc->submission_state.guc_id_list,
> > > >    				      struct intel_context,
> > > >    				      guc_id.link);
> > > > -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> > > > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> > > > +		GEM_BUG_ON(context_guc_id_invalid(cn));
> > > > +		GEM_BUG_ON(intel_context_is_child(cn));
> > > > +		GEM_BUG_ON(intel_context_is_parent(cn));
> > > > -		list_del_init(&ce->guc_id.link);
> > > > -		guc_id = ce->guc_id.id;
> > > > +		list_del_init(&cn->guc_id.link);
> > > > +		ce->guc_id = cn->guc_id;
> > > >    		spin_lock(&ce->guc_state.lock);
> > > > -		clr_context_registered(ce);
> > > > +		clr_context_registered(cn);
> > > >    		spin_unlock(&ce->guc_state.lock);
> > > > -		set_context_guc_id_invalid(ce);
> > > > -		return guc_id;
> > > > +		set_context_guc_id_invalid(cn);
> > > > +
> > > > +		return 0;
> > > >    	} else {
> > > >    		return -EAGAIN;
> > > >    	}
> > > >    }
> > > > -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> > > > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > >    	int ret;
> > > >    	lockdep_assert_held(&guc->submission_state.lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > -	ret = new_guc_id(guc);
> > > > +	ret = new_guc_id(guc, ce);
> > > >    	if (unlikely(ret < 0)) {
> > > > -		ret = steal_guc_id(guc);
> > > > +		if (intel_context_is_parent(ce))
> > > > +			return -ENOSPC;
> > > > +
> > > > +		ret = steal_guc_id(guc, ce);
> > > >    		if (ret < 0)
> > > >    			return ret;
> > > >    	}
> > > > -	*out = ret;
> > > > +	if (intel_context_is_parent(ce)) {
> > > > +		struct intel_context *child;
> > > > +		int i = 1;
> > > > +
> > > > +		for_each_child(ce, child)
> > > > +			child->guc_id.id = ce->guc_id.id + i++;
> > > > +	}
> > > > +
> > > >    	return 0;
> > > >    }
> > > > @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	might_lock(&ce->guc_state.lock);
> > > >    	if (context_guc_id_invalid(ce)) {
> > > > -		ret = assign_guc_id(guc, &ce->guc_id.id);
> > > > +		ret = assign_guc_id(guc, ce);
> > > >    		if (ret)
> > > >    			goto out_unlock;
> > > >    		ret = 1;	/* Indidcates newly assigned guc_id */
> > > > @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	unsigned long flags;
> > > >    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > -	if (unlikely(context_guc_id_invalid(ce)))
> > > > +	if (unlikely(context_guc_id_invalid(ce) ||
> > > > +		     intel_context_is_parent(ce)))
> > > >    		return;
> > > >    	spin_lock_irqsave(&guc->submission_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-13 18:03           ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 18:03 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 09:40:43AM -0700, John Harrison wrote:
> On 10/7/2021 18:21, Matthew Brost wrote:
> > On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
> > > On 10/4/2021 15:06, Matthew Brost wrote:
> > > > Assign contexts in parent-child relationship consecutive guc_ids. This
> > > > is accomplished by partitioning guc_id space between ones that need to
> > > > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > > > available guc_ids). The consecutive search is implemented via the bitmap
> > > > API.
> > > > 
> > > > This is a precursor to the full GuC multi-lrc implementation but aligns
> > > > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > > > when using the GuC multi-lrc interface.
> > > > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Explicitly state why we assign consecutive guc_ids
> > > > v3:
> > > >    (John Harrison)
> > > >     - Bring back in spin lock
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
> > > >    2 files changed, 86 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 25a598e2b6e8..a9f4ec972bfb 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -76,9 +76,13 @@ struct intel_guc {
> > > >    		 */
> > > >    		spinlock_t lock;
> > > >    		/**
> > > > -		 * @guc_ids: used to allocate new guc_ids
> > > > +		 * @guc_ids: used to allocate new guc_ids, single-lrc
> > > >    		 */
> > > >    		struct ida guc_ids;
> > > > +		/**
> > > > +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> > > > +		 */
> > > > +		unsigned long *guc_ids_bitmap;
> > > >    		/**
> > > >    		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > > >    		 * refs
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 1f2809187513..79e7732e83b2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > > >    #define GUC_REQUEST_SIZE 64 /* bytes */
> > > > +/*
> > > > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > > > + * per the GuC submission interface. A different allocation algorithm is used
> > > > + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> > > > + * partition the guc_id space. We believe the number of multi-lrc contexts in
> > > > + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> > > > + * multi-lrc.
> > > > + */
> > > > +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
> > > > +
> > > >    /*
> > > >     * Below is a set of functions which control the GuC scheduling state which
> > > >     * require a lock.
> > > > @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_WORK(&guc->submission_state.destroyed_worker,
> > > >    		  destroyed_worker_func);
> > > > +	guc->submission_state.guc_ids_bitmap =
> > > > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
> > > > +	if (!guc->submission_state.guc_ids_bitmap)
> > > > +		return -ENOMEM;
> > > > +
> > > >    	return 0;
> > > >    }
> > > > @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> > > >    	guc_lrc_desc_pool_destroy(guc);
> > > >    	guc_flush_destroyed_contexts(guc);
> > > >    	i915_sched_engine_put(guc->sched_engine);
> > > > +	bitmap_free(guc->submission_state.guc_ids_bitmap);
> > > >    }
> > > >    static inline void queue_request(struct i915_sched_engine *sched_engine,
> > > > @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
> > > >    	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > > >    }
> > > > -static int new_guc_id(struct intel_guc *guc)
> > > > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > > > -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> > > > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > > +	int ret;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	if (intel_context_is_parent(ce))
> > > > +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > > > +					      NUMBER_MULTI_LRC_GUC_ID,
> > > > +					      order_base_2(ce->parallel.number_children
> > > > +							   + 1));
> > > > +	else
> > > > +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > > > +				     NUMBER_MULTI_LRC_GUC_ID,
> > > > +				     GUC_MAX_LRC_DESCRIPTORS,
> > > > +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> > > > +				     __GFP_NOWARN);
> > > > +	if (unlikely(ret < 0))
> > > > +		return ret;
> > > > +
> > > > +	ce->guc_id.id = ret;
> > > > +	return 0;
> > > >    }
> > > >    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > >    	if (!context_guc_id_invalid(ce)) {
> > > > -		ida_simple_remove(&guc->submission_state.guc_ids,
> > > > -				  ce->guc_id.id);
> > > > +		if (intel_context_is_parent(ce))
> > > > +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> > > > +					      ce->guc_id.id,
> > > > +					      order_base_2(ce->parallel.number_children
> > > > +							   + 1));
> > > There was a discussion on the previous revision about adding a BUG_ON to
> > > ensure that number_children cannot change between the bitmap alloc and the
> > > bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
> > > 
> > I thought you meant to add a BUG_ON to ensure before we release a region
> > / id it is occupied? I looked in both the bitmap API and ida API and
> > neither have a function that checks if region / id is occupied so can't
> > really add a BUG_ON for that.
> > 
> > How much you add BUG_ON to ensure the number of children canoot change
> > between alloc and release? I don't follow how that would work.
> > 
> > Matt
> I was thinking that where number_children is modified, you have a
> BUG_ON(guc_id_is_valid). That would ensure that the release has to match the
> alloc. Hmm, you already have a BUG_ON about the parent/child not being
> pinned in intel_context_bind_parent_child(), which I guess covers it because
> you shouldn't have a guc_id if you aren't pinned, right? And that is the
> only function which can modify number_children, yes? So maybe it's all good?
> 

I think we are all good.

Matt

> John.
> 
> > 
> > > John.
> > > 
> > > 
> > > > +		else
> > > > +			ida_simple_remove(&guc->submission_state.guc_ids,
> > > > +					  ce->guc_id.id);
> > > >    		reset_lrc_desc(guc, ce->guc_id.id);
> > > >    		set_context_guc_id_invalid(ce);
> > > >    	}
> > > > @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > > >    }
> > > > -static int steal_guc_id(struct intel_guc *guc)
> > > > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > > -	struct intel_context *ce;
> > > > -	int guc_id;
> > > > +	struct intel_context *cn;
> > > >    	lockdep_assert_held(&guc->submission_state.lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +	GEM_BUG_ON(intel_context_is_parent(ce));
> > > >    	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > > > -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> > > > +		cn = list_first_entry(&guc->submission_state.guc_id_list,
> > > >    				      struct intel_context,
> > > >    				      guc_id.link);
> > > > -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> > > > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> > > > +		GEM_BUG_ON(context_guc_id_invalid(cn));
> > > > +		GEM_BUG_ON(intel_context_is_child(cn));
> > > > +		GEM_BUG_ON(intel_context_is_parent(cn));
> > > > -		list_del_init(&ce->guc_id.link);
> > > > -		guc_id = ce->guc_id.id;
> > > > +		list_del_init(&cn->guc_id.link);
> > > > +		ce->guc_id = cn->guc_id;
> > > >    		spin_lock(&ce->guc_state.lock);
> > > > -		clr_context_registered(ce);
> > > > +		clr_context_registered(cn);
> > > >    		spin_unlock(&ce->guc_state.lock);
> > > > -		set_context_guc_id_invalid(ce);
> > > > -		return guc_id;
> > > > +		set_context_guc_id_invalid(cn);
> > > > +
> > > > +		return 0;
> > > >    	} else {
> > > >    		return -EAGAIN;
> > > >    	}
> > > >    }
> > > > -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> > > > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    {
> > > >    	int ret;
> > > >    	lockdep_assert_held(&guc->submission_state.lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > -	ret = new_guc_id(guc);
> > > > +	ret = new_guc_id(guc, ce);
> > > >    	if (unlikely(ret < 0)) {
> > > > -		ret = steal_guc_id(guc);
> > > > +		if (intel_context_is_parent(ce))
> > > > +			return -ENOSPC;
> > > > +
> > > > +		ret = steal_guc_id(guc, ce);
> > > >    		if (ret < 0)
> > > >    			return ret;
> > > >    	}
> > > > -	*out = ret;
> > > > +	if (intel_context_is_parent(ce)) {
> > > > +		struct intel_context *child;
> > > > +		int i = 1;
> > > > +
> > > > +		for_each_child(ce, child)
> > > > +			child->guc_id.id = ce->guc_id.id + i++;
> > > > +	}
> > > > +
> > > >    	return 0;
> > > >    }
> > > > @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	might_lock(&ce->guc_state.lock);
> > > >    	if (context_guc_id_invalid(ce)) {
> > > > -		ret = assign_guc_id(guc, &ce->guc_id.id);
> > > > +		ret = assign_guc_id(guc, ce);
> > > >    		if (ret)
> > > >    			goto out_unlock;
> > > >    		ret = 1;	/* Indidcates newly assigned guc_id */
> > > > @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	unsigned long flags;
> > > >    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > -	if (unlikely(context_guc_id_invalid(ce)))
> > > > +	if (unlikely(context_guc_id_invalid(ce) ||
> > > > +		     intel_context_is_parent(ce)))
> > > >    		return;
> > > >    	spin_lock_irqsave(&guc->submission_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
  2021-10-08 17:20     ` [Intel-gfx] " John Harrison
@ 2021-10-13 18:24       ` Matthew Brost
  -1 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 18:24 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 10:20:24AM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Implement multi-lrc submission via a single workqueue entry and single
> > H2G. The workqueue entry contains an updated tail value for each
> > request, of all the contexts in the multi-lrc submission, and updates
> > these values simultaneously. As such, the tasklet and bypass path have
> > been updated to coalesce requests into a single submission.
> > 
> > v2:
> >   (John Harrison)
> >    - s/wqe/wqi
> >    - Use FIELD_PREP macros
> >    - Add GEM_BUG_ONs ensures length fits within field
> >    - Add comment / white space to intel_guc_write_barrier
> >   (Kernel test robot)
> >    - Make need_tasklet a static function
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
> >   drivers/gpu/drm/i915/i915_request.h           |   8 +
> >   6 files changed, 335 insertions(+), 73 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > index 8f8182bf7c11..7191e8439290 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > @@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
> >   		}
> >   	}
> >   }
> > +
> > +void intel_guc_write_barrier(struct intel_guc *guc)
> > +{
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > +		/*
> > +		 * Ensure intel_uncore_write_fw can be used rather than
> > +		 * intel_uncore_write.
> > +		 */
> > +		GEM_BUG_ON(guc->send_regs.fw_domains);
> > +
> > +		/*
> > +		 * This register is used by the i915 and GuC for MMIO based
> > +		 * communication. Once we are in this code CTBs are the only
> > +		 * method the i915 uses to communicate with the GuC so it is
> > +		 * safe to write to this register (a value of 0 is NOP for MMIO
> > +		 * communication). If we ever start mixing CTBs and MMIOs a new
> > +		 * register will have to be chosen.
> > +		 */
> Hmm, missed it before but this comment is very CTB centric and the barrier
> function is now being used for parallel submission work queues. Seems like
> an extra comment should be added to cover that case. Just something simple
> about WQ usage is also guaranteed to be post CTB switch over.
> 

Sure.

> > +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > +	} else {
> > +		/* wmb() sufficient for a barrier if in smem */
> > +		wmb();
> > +	}
> > +}
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index a9f4ec972bfb..147f39cc0f2f 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -46,6 +46,12 @@ struct intel_guc {
> >   	 * submitted until the stalled request is processed.
> >   	 */
> >   	struct i915_request *stalled_request;
> > +	enum {
> > +		STALL_NONE,
> > +		STALL_REGISTER_CONTEXT,
> > +		STALL_MOVE_LRC_TAIL,
> > +		STALL_ADD_REQUEST,
> > +	} submission_stall_reason;
> >   	/* intel_guc_recv interrupt related state */
> >   	/** @irq_lock: protects GuC irq state */
> > @@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
> >   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
> > +void intel_guc_write_barrier(struct intel_guc *guc);
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > index 20c710a74498..10d1878d2826 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
> >   	return ++ct->requests.last_fence;
> >   }
> > -static void write_barrier(struct intel_guc_ct *ct)
> > -{
> > -	struct intel_guc *guc = ct_to_guc(ct);
> > -	struct intel_gt *gt = guc_to_gt(guc);
> > -
> > -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > -		GEM_BUG_ON(guc->send_regs.fw_domains);
> > -		/*
> > -		 * This register is used by the i915 and GuC for MMIO based
> > -		 * communication. Once we are in this code CTBs are the only
> > -		 * method the i915 uses to communicate with the GuC so it is
> > -		 * safe to write to this register (a value of 0 is NOP for MMIO
> > -		 * communication). If we ever start mixing CTBs and MMIOs a new
> > -		 * register will have to be chosen.
> > -		 */
> > -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > -	} else {
> > -		/* wmb() sufficient for a barrier if in smem */
> > -		wmb();
> > -	}
> > -}
> > -
> >   static int ct_write(struct intel_guc_ct *ct,
> >   		    const u32 *action,
> >   		    u32 len /* in dwords */,
> > @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
> >   	 * make sure H2G buffer update and LRC tail update (if this triggering a
> >   	 * submission) are visible before updating the descriptor tail
> >   	 */
> > -	write_barrier(ct);
> > +	intel_guc_write_barrier(ct_to_guc(ct));
> >   	/* update local copies */
> >   	ctb->tail = tail;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index 0eeb2a9feeed..a00eeddc1449 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -58,19 +58,16 @@
> >   #define WQ_STATUS_CMD_ERROR		3
> >   #define WQ_STATUS_ENGINE_ID_NOT_USED	4
> >   #define WQ_STATUS_SUSPENDED_FROM_RESET	5
> > -#define WQ_TYPE_SHIFT			0
> > -#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> > -#define WQ_TARGET_SHIFT			10
> > -#define WQ_LEN_SHIFT			16
> > -#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> > -#define WQ_PRESENT_WORKLOAD		(1 << 28)
> > -
> > -#define WQ_RING_TAIL_SHIFT		20
> > -#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> > -#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> > +#define WQ_TYPE_BATCH_BUF		0x1
> > +#define WQ_TYPE_PSEUDO			0x2
> > +#define WQ_TYPE_INORDER			0x3
> > +#define WQ_TYPE_NOOP			0x4
> > +#define WQ_TYPE_MULTI_LRC		0x5
> > +#define WQ_TYPE_MASK			GENMASK(7, 0)
> > +#define WQ_LEN_MASK			GENMASK(26, 16)
> > +
> > +#define WQ_GUC_ID_MASK			GENMASK(15, 0)
> > +#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
> Other option for documenting WQ and WQI would be at the top of this block of
> definitions. I believe there is a one line comment of 'work queue item
> header definitions' but none of these defines actually use the WQI
> abbreviation. And some description of what the work queue is, how it is
> used, etc. would be good.
>

Will add something here but again plan on updating the GuC kernel doc
with all the multi-lrc details including WQ / WQI. 
 
> >   #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
> >   #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 031b1bf5ba91..1610120e31a1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
> >   		   LRC_STATE_OFFSET) / sizeof(u32)));
> >   }
> > +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> > +			   struct intel_context *ce,
> > +			   u32 wqi_size)
> > +{
> > +	/*
> > +	 * Check for space in work queue. Caching a value of head pointer in
> > +	 * intel_context structure in order reduce the number accesses to shared
> > +	 * GPU memory which may be across a PCIe bus.
> > +	 */
> > +#define AVAILABLE_SPACE	\
> > +	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
> > +	if (wqi_size > AVAILABLE_SPACE) {
> > +		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
> > +
> > +		if (wqi_size > AVAILABLE_SPACE)
> > +			return NULL;
> > +	}
> > +#undef AVAILABLE_SPACE
> > +
> > +	return ((u32 *)__get_process_desc(ce)) +
> > +		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> >   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> > -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   {
> >   	int err = 0;
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u32 action[3];
> >   	int len = 0;
> >   	u32 g2h_len_dw = 0;
> > @@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> >   	GEM_BUG_ON(context_guc_id_invalid(ce));
> > -	/*
> > -	 * Corner case where the GuC firmware was blown away and reloaded while
> > -	 * this context was pinned.
> > -	 */
> > -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> > -		err = guc_lrc_desc_pin(ce, false);
> > -		if (unlikely(err))
> > -			return err;
> > -	}
> > -
> >   	spin_lock(&ce->guc_state.lock);
> >   	/*
> >   	 * The request / context will be run on the hardware when scheduling
> > -	 * gets enabled in the unblock.
> > +	 * gets enabled in the unblock. For multi-lrc we still submit the
> > +	 * context to move the LRC tails.
> >   	 */
> > -	if (unlikely(context_blocked(ce)))
> > +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
> >   		goto out;
> > -	enabled = context_enabled(ce);
> > +	enabled = context_enabled(ce) || context_blocked(ce);
> >   	if (!enabled) {
> >   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> > @@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   		trace_intel_context_sched_enable(ce);
> >   		atomic_inc(&guc->outstanding_submission_g2h);
> >   		set_context_enabled(ce);
> > +
> > +		/*
> > +		 * Without multi-lrc KMD does the submission step (moving the
> > +		 * lrc tail) so enabling scheduling is sufficient to submit the
> > +		 * context. This isn't the case in multi-lrc submission as the
> > +		 * GuC needs to move the tails, hence the need for another H2G
> > +		 * to submit a multi-lrc context after enabling scheduling.
> > +		 */
> > +		if (intel_context_is_parent(ce)) {
> > +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> > +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> > +		}
> >   	} else if (!enabled) {
> >   		clr_context_pending_enable(ce);
> >   		intel_context_put(ce);
> > @@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	return err;
> >   }
> > +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	int ret = __guc_add_request(guc, rq);
> > +
> > +	if (unlikely(ret == -EBUSY)) {
> > +		guc->stalled_request = rq;
> > +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static inline void guc_set_lrc_tail(struct i915_request *rq)
> >   {
> >   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> > @@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
> >   	return rq->sched.attr.priority;
> >   }
> > +static bool is_multi_lrc_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_child(rq->context) ||
> > +		intel_context_is_parent(rq->context);
> > +}
> > +
> > +static bool can_merge_rq(struct i915_request *rq,
> > +			 struct i915_request *last)
> > +{
> > +	return request_to_scheduling_context(rq) ==
> > +		request_to_scheduling_context(last);
> > +}
> > +
> > +static u32 wq_space_until_wrap(struct intel_context *ce)
> > +{
> > +	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
> > +}
> > +
> > +static void write_wqi(struct guc_process_desc *desc,
> > +		      struct intel_context *ce,
> > +		      u32 wqi_size)
> > +{
> > +	/*
> > +	 * Ensure WQI are visible before updating tail
> > +	 */
> > +	intel_guc_write_barrier(ce_to_guc(ce));
> > +
> > +	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
> > +		(WQ_SIZE - 1);
> This relies on WQ_SIZE being a power of two, right? Is it possible to add a
> BUILD_BUG_ON to ensure that?
> 

Yep.

> > +	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
> > +}
> > +
> > +static int guc_wq_noop_append(struct intel_context *ce)
> > +{
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> > +	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
> > +
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> > +
> > +	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
> > +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> > +	ce->parallel.guc.wqi_tail = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __guc_wq_item_append(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	struct intel_context *child;
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	unsigned int wqi_size = (ce->parallel.number_children + 4) *
> > +		sizeof(u32);
> > +	u32 *wqi;
> > +	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
> > +	int ret;
> > +
> > +	/* Ensure context is in correct state updating work queue */
> > +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> > +	GEM_BUG_ON(context_guc_id_invalid(ce));
> > +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> > +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> > +
> > +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> > +	if (wqi_size > wq_space_until_wrap(ce)) {
> > +		ret = guc_wq_noop_append(ce);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	wqi = get_wq_pointer(desc, ce, wqi_size);
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> > +
> > +	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
> > +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> > +	*wqi++ = ce->lrc.lrca;
> > +	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
> > +	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
> > +	*wqi++ = 0;	/* fence_id */
> > +	for_each_child(ce, child)
> > +		*wqi++ = child->ring->tail / sizeof(u64);
> > +
> > +	write_wqi(desc, ce, wqi_size);
> > +
> > +	return 0;
> > +}
> > +
> > +static int guc_wq_item_append(struct intel_guc *guc,
> > +			      struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	int ret = 0;
> > +
> > +	if (likely(!intel_context_is_banned(ce))) {
> > +		ret = __guc_wq_item_append(rq);
> > +
> > +		if (unlikely(ret == -EBUSY)) {
> > +			guc->stalled_request = rq;
> > +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static bool multi_lrc_submit(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	intel_ring_set_tail(rq->ring, rq->tail);
> > +
> > +	/*
> > +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> > +	 * request generated from a multi-BB submission. This indicates to the
> > +	 * backend (GuC interface) that we should submit this context thus
> > +	 * submitting all the requests generated in parallel.
> > +	 */
> > +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
> FYI: Apparently the test_bit/set_bit/etc helpers are intended for use on
> arbitrary sized bitfields. As in, they do all sorts of complicated atomic
> operations to work on 164 bit words and such like. For single word flags,
> the guidance is to just use 'if(word & BIT(bit))' instead.
> 

I get that but currently everywhere in the code uses
set_bit/clear_bit/test_bit on the rq->fence.flags. IMO is better to
stick to that convention for now rip of all of these helpers in a single
patch later. I'd rather not have a hodgepodge of styles in the code.

I can an AR to clean up rq->fence.flags everywhere in the code in a
follow up.

Matt

> John.
> 
> > +		intel_context_is_banned(ce);
> > +}
> > +
> >   static int guc_dequeue_one_context(struct intel_guc *guc)
> >   {
> >   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > @@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   	if (guc->stalled_request) {
> >   		submit = true;
> >   		last = guc->stalled_request;
> > -		goto resubmit;
> > +
> > +		switch (guc->submission_stall_reason) {
> > +		case STALL_REGISTER_CONTEXT:
> > +			goto register_context;
> > +		case STALL_MOVE_LRC_TAIL:
> > +			goto move_lrc_tail;
> > +		case STALL_ADD_REQUEST:
> > +			goto add_request;
> > +		default:
> > +			MISSING_CASE(guc->submission_stall_reason);
> > +		}
> >   	}
> >   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> > @@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   		struct i915_request *rq, *rn;
> >   		priolist_for_each_request_consume(rq, rn, p) {
> > -			if (last && rq->context != last->context)
> > -				goto done;
> > +			if (last && !can_merge_rq(rq, last))
> > +				goto register_context;
> >   			list_del_init(&rq->sched.link);
> > @@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   			trace_i915_request_in(rq, 0);
> >   			last = rq;
> > -			submit = true;
> > +
> > +			if (is_multi_lrc_rq(rq)) {
> > +				/*
> > +				 * We need to coalesce all multi-lrc requests in
> > +				 * a relationship into a single H2G. We are
> > +				 * guaranteed that all of these requests will be
> > +				 * submitted sequentially.
> > +				 */
> > +				if (multi_lrc_submit(rq)) {
> > +					submit = true;
> > +					goto register_context;
> > +				}
> > +			} else {
> > +				submit = true;
> > +			}
> >   		}
> >   		rb_erase_cached(&p->node, &sched_engine->queue);
> >   		i915_priolist_free(p);
> >   	}
> > -done:
> > +
> > +register_context:
> >   	if (submit) {
> > -		guc_set_lrc_tail(last);
> > -resubmit:
> > +		struct intel_context *ce = request_to_scheduling_context(last);
> > +
> > +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> > +			     !intel_context_is_banned(ce))) {
> > +			ret = guc_lrc_desc_pin(ce, false);
> > +			if (unlikely(ret == -EPIPE)) {
> > +				goto deadlk;
> > +			} else if (ret == -EBUSY) {
> > +				guc->stalled_request = last;
> > +				guc->submission_stall_reason =
> > +					STALL_REGISTER_CONTEXT;
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		}
> > +
> > +move_lrc_tail:
> > +		if (is_multi_lrc_rq(last)) {
> > +			ret = guc_wq_item_append(guc, last);
> > +			if (ret == -EBUSY) {
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		} else {
> > +			guc_set_lrc_tail(last);
> > +		}
> > +
> > +add_request:
> >   		ret = guc_add_request(guc, last);
> > -		if (unlikely(ret == -EPIPE))
> > +		if (unlikely(ret == -EPIPE)) {
> > +			goto deadlk;
> > +		} else if (ret == -EBUSY) {
> > +			goto schedule_tasklet;
> > +		} else if (ret != 0) {
> > +			GEM_WARN_ON(ret);	/* Unexpected */
> >   			goto deadlk;
> > -		else if (ret == -EBUSY) {
> > -			tasklet_schedule(&sched_engine->tasklet);
> > -			guc->stalled_request = last;
> > -			return false;
> >   		}
> >   	}
> >   	guc->stalled_request = NULL;
> > +	guc->submission_stall_reason = STALL_NONE;
> >   	return submit;
> >   deadlk:
> >   	sched_engine->tasklet.callback = NULL;
> >   	tasklet_disable_nosync(&sched_engine->tasklet);
> >   	return false;
> > +
> > +schedule_tasklet:
> > +	tasklet_schedule(&sched_engine->tasklet);
> > +	return false;
> >   }
> >   static void guc_submission_tasklet(struct tasklet_struct *t)
> > @@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	trace_i915_request_in(rq, 0);
> > -	guc_set_lrc_tail(rq);
> > -	ret = guc_add_request(guc, rq);
> > -	if (ret == -EBUSY)
> > -		guc->stalled_request = rq;
> > +	if (is_multi_lrc_rq(rq)) {
> > +		if (multi_lrc_submit(rq)) {
> > +			ret = guc_wq_item_append(guc, rq);
> > +			if (!ret)
> > +				ret = guc_add_request(guc, rq);
> > +		}
> > +	} else {
> > +		guc_set_lrc_tail(rq);
> > +		ret = guc_add_request(guc, rq);
> > +	}
> >   	if (unlikely(ret == -EPIPE))
> >   		disable_submission(guc);
> > @@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	return ret;
> >   }
> > +static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	return submission_disabled(guc) || guc->stalled_request ||
> > +		!i915_sched_engine_is_empty(sched_engine) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +}
> > +
> >   static void guc_submit_request(struct i915_request *rq)
> >   {
> >   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > @@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   	/* Will be called from irq-context when using foreign fences. */
> >   	spin_lock_irqsave(&sched_engine->lock, flags);
> > -	if (submission_disabled(guc) || guc->stalled_request ||
> > -	    !i915_sched_engine_is_empty(sched_engine))
> > +	if (need_tasklet(guc, rq))
> >   		queue_request(sched_engine, rq, rq_prio(rq));
> >   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
> >   		tasklet_hi_schedule(&sched_engine->tasklet);
> > @@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
> >   static void add_to_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
> >   	spin_lock(&ce->guc_state.lock);
> > @@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
> >   static void remove_from_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	spin_lock_irq(&ce->guc_state.lock);
> > @@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
> >   static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   					   int prio)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
> >   	/* Short circuit function */
> > @@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   static void guc_retire_inflight_request_prio(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	spin_lock(&ce->guc_state.lock);
> >   	guc_prio_fini(rq, ce);
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 7bd9ed20623e..8950785e55d6 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -139,6 +139,14 @@ enum {
> >   	 * the GPU. Here we track such boost requests on a per-request basis.
> >   	 */
> >   	I915_FENCE_FLAG_BOOST,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> > +	 * parent-child relationship (parallel submission, multi-lrc) should
> > +	 * trigger a submission to the GuC rather than just moving the context
> > +	 * tail.
> > +	 */
> > +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> >   };
> >   /**
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission
@ 2021-10-13 18:24       ` Matthew Brost
  0 siblings, 0 replies; 165+ messages in thread
From: Matthew Brost @ 2021-10-13 18:24 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On Fri, Oct 08, 2021 at 10:20:24AM -0700, John Harrison wrote:
> On 10/4/2021 15:06, Matthew Brost wrote:
> > Implement multi-lrc submission via a single workqueue entry and single
> > H2G. The workqueue entry contains an updated tail value for each
> > request, of all the contexts in the multi-lrc submission, and updates
> > these values simultaneously. As such, the tasklet and bypass path have
> > been updated to coalesce requests into a single submission.
> > 
> > v2:
> >   (John Harrison)
> >    - s/wqe/wqi
> >    - Use FIELD_PREP macros
> >    - Add GEM_BUG_ONs ensures length fits within field
> >    - Add comment / white space to intel_guc_write_barrier
> >   (Kernel test robot)
> >    - Make need_tasklet a static function
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  26 ++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 319 ++++++++++++++++--
> >   drivers/gpu/drm/i915/i915_request.h           |   8 +
> >   6 files changed, 335 insertions(+), 73 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > index 8f8182bf7c11..7191e8439290 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > @@ -756,3 +756,29 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
> >   		}
> >   	}
> >   }
> > +
> > +void intel_guc_write_barrier(struct intel_guc *guc)
> > +{
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > +		/*
> > +		 * Ensure intel_uncore_write_fw can be used rather than
> > +		 * intel_uncore_write.
> > +		 */
> > +		GEM_BUG_ON(guc->send_regs.fw_domains);
> > +
> > +		/*
> > +		 * This register is used by the i915 and GuC for MMIO based
> > +		 * communication. Once we are in this code CTBs are the only
> > +		 * method the i915 uses to communicate with the GuC so it is
> > +		 * safe to write to this register (a value of 0 is NOP for MMIO
> > +		 * communication). If we ever start mixing CTBs and MMIOs a new
> > +		 * register will have to be chosen.
> > +		 */
> Hmm, missed it before but this comment is very CTB centric and the barrier
> function is now being used for parallel submission work queues. Seems like
> an extra comment should be added to cover that case. Just something simple
> about WQ usage is also guaranteed to be post CTB switch over.
> 

Sure.

> > +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > +	} else {
> > +		/* wmb() sufficient for a barrier if in smem */
> > +		wmb();
> > +	}
> > +}
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index a9f4ec972bfb..147f39cc0f2f 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -46,6 +46,12 @@ struct intel_guc {
> >   	 * submitted until the stalled request is processed.
> >   	 */
> >   	struct i915_request *stalled_request;
> > +	enum {
> > +		STALL_NONE,
> > +		STALL_REGISTER_CONTEXT,
> > +		STALL_MOVE_LRC_TAIL,
> > +		STALL_ADD_REQUEST,
> > +	} submission_stall_reason;
> >   	/* intel_guc_recv interrupt related state */
> >   	/** @irq_lock: protects GuC irq state */
> > @@ -361,4 +367,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
> >   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
> > +void intel_guc_write_barrier(struct intel_guc *guc);
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > index 20c710a74498..10d1878d2826 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
> >   	return ++ct->requests.last_fence;
> >   }
> > -static void write_barrier(struct intel_guc_ct *ct)
> > -{
> > -	struct intel_guc *guc = ct_to_guc(ct);
> > -	struct intel_gt *gt = guc_to_gt(guc);
> > -
> > -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > -		GEM_BUG_ON(guc->send_regs.fw_domains);
> > -		/*
> > -		 * This register is used by the i915 and GuC for MMIO based
> > -		 * communication. Once we are in this code CTBs are the only
> > -		 * method the i915 uses to communicate with the GuC so it is
> > -		 * safe to write to this register (a value of 0 is NOP for MMIO
> > -		 * communication). If we ever start mixing CTBs and MMIOs a new
> > -		 * register will have to be chosen.
> > -		 */
> > -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > -	} else {
> > -		/* wmb() sufficient for a barrier if in smem */
> > -		wmb();
> > -	}
> > -}
> > -
> >   static int ct_write(struct intel_guc_ct *ct,
> >   		    const u32 *action,
> >   		    u32 len /* in dwords */,
> > @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
> >   	 * make sure H2G buffer update and LRC tail update (if this triggering a
> >   	 * submission) are visible before updating the descriptor tail
> >   	 */
> > -	write_barrier(ct);
> > +	intel_guc_write_barrier(ct_to_guc(ct));
> >   	/* update local copies */
> >   	ctb->tail = tail;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index 0eeb2a9feeed..a00eeddc1449 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -58,19 +58,16 @@
> >   #define WQ_STATUS_CMD_ERROR		3
> >   #define WQ_STATUS_ENGINE_ID_NOT_USED	4
> >   #define WQ_STATUS_SUSPENDED_FROM_RESET	5
> > -#define WQ_TYPE_SHIFT			0
> > -#define   WQ_TYPE_BATCH_BUF		(0x1 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> > -#define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> > -#define WQ_TARGET_SHIFT			10
> > -#define WQ_LEN_SHIFT			16
> > -#define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> > -#define WQ_PRESENT_WORKLOAD		(1 << 28)
> > -
> > -#define WQ_RING_TAIL_SHIFT		20
> > -#define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> > -#define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> > +#define WQ_TYPE_BATCH_BUF		0x1
> > +#define WQ_TYPE_PSEUDO			0x2
> > +#define WQ_TYPE_INORDER			0x3
> > +#define WQ_TYPE_NOOP			0x4
> > +#define WQ_TYPE_MULTI_LRC		0x5
> > +#define WQ_TYPE_MASK			GENMASK(7, 0)
> > +#define WQ_LEN_MASK			GENMASK(26, 16)
> > +
> > +#define WQ_GUC_ID_MASK			GENMASK(15, 0)
> > +#define WQ_RING_TAIL_MASK		GENMASK(28, 18)
> Other option for documenting WQ and WQI would be at the top of this block of
> definitions. I believe there is a one line comment of 'work queue item
> header definitions' but none of these defines actually use the WQI
> abbreviation. And some description of what the work queue is, how it is
> used, etc. would be good.
>

Will add something here but again plan on updating the GuC kernel doc
with all the multi-lrc details including WQ / WQI. 
 
> >   #define GUC_STAGE_DESC_ATTR_ACTIVE	BIT(0)
> >   #define GUC_STAGE_DESC_ATTR_PENDING_DB	BIT(1)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 031b1bf5ba91..1610120e31a1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -399,6 +399,29 @@ __get_process_desc(struct intel_context *ce)
> >   		   LRC_STATE_OFFSET) / sizeof(u32)));
> >   }
> > +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> > +			   struct intel_context *ce,
> > +			   u32 wqi_size)
> > +{
> > +	/*
> > +	 * Check for space in work queue. Caching a value of head pointer in
> > +	 * intel_context structure in order reduce the number accesses to shared
> > +	 * GPU memory which may be across a PCIe bus.
> > +	 */
> > +#define AVAILABLE_SPACE	\
> > +	CIRC_SPACE(ce->parallel.guc.wqi_tail, ce->parallel.guc.wqi_head, WQ_SIZE)
> > +	if (wqi_size > AVAILABLE_SPACE) {
> > +		ce->parallel.guc.wqi_head = READ_ONCE(desc->head);
> > +
> > +		if (wqi_size > AVAILABLE_SPACE)
> > +			return NULL;
> > +	}
> > +#undef AVAILABLE_SPACE
> > +
> > +	return ((u32 *)__get_process_desc(ce)) +
> > +		((WQ_OFFSET + ce->parallel.guc.wqi_tail) / sizeof(u32));
> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -558,10 +581,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> >   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> > -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   {
> >   	int err = 0;
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u32 action[3];
> >   	int len = 0;
> >   	u32 g2h_len_dw = 0;
> > @@ -582,26 +605,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> >   	GEM_BUG_ON(context_guc_id_invalid(ce));
> > -	/*
> > -	 * Corner case where the GuC firmware was blown away and reloaded while
> > -	 * this context was pinned.
> > -	 */
> > -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> > -		err = guc_lrc_desc_pin(ce, false);
> > -		if (unlikely(err))
> > -			return err;
> > -	}
> > -
> >   	spin_lock(&ce->guc_state.lock);
> >   	/*
> >   	 * The request / context will be run on the hardware when scheduling
> > -	 * gets enabled in the unblock.
> > +	 * gets enabled in the unblock. For multi-lrc we still submit the
> > +	 * context to move the LRC tails.
> >   	 */
> > -	if (unlikely(context_blocked(ce)))
> > +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
> >   		goto out;
> > -	enabled = context_enabled(ce);
> > +	enabled = context_enabled(ce) || context_blocked(ce);
> >   	if (!enabled) {
> >   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> > @@ -620,6 +634,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   		trace_intel_context_sched_enable(ce);
> >   		atomic_inc(&guc->outstanding_submission_g2h);
> >   		set_context_enabled(ce);
> > +
> > +		/*
> > +		 * Without multi-lrc KMD does the submission step (moving the
> > +		 * lrc tail) so enabling scheduling is sufficient to submit the
> > +		 * context. This isn't the case in multi-lrc submission as the
> > +		 * GuC needs to move the tails, hence the need for another H2G
> > +		 * to submit a multi-lrc context after enabling scheduling.
> > +		 */
> > +		if (intel_context_is_parent(ce)) {
> > +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> > +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> > +		}
> >   	} else if (!enabled) {
> >   		clr_context_pending_enable(ce);
> >   		intel_context_put(ce);
> > @@ -632,6 +658,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	return err;
> >   }
> > +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	int ret = __guc_add_request(guc, rq);
> > +
> > +	if (unlikely(ret == -EBUSY)) {
> > +		guc->stalled_request = rq;
> > +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static inline void guc_set_lrc_tail(struct i915_request *rq)
> >   {
> >   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> > @@ -643,6 +681,134 @@ static inline int rq_prio(const struct i915_request *rq)
> >   	return rq->sched.attr.priority;
> >   }
> > +static bool is_multi_lrc_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_child(rq->context) ||
> > +		intel_context_is_parent(rq->context);
> > +}
> > +
> > +static bool can_merge_rq(struct i915_request *rq,
> > +			 struct i915_request *last)
> > +{
> > +	return request_to_scheduling_context(rq) ==
> > +		request_to_scheduling_context(last);
> > +}
> > +
> > +static u32 wq_space_until_wrap(struct intel_context *ce)
> > +{
> > +	return (WQ_SIZE - ce->parallel.guc.wqi_tail);
> > +}
> > +
> > +static void write_wqi(struct guc_process_desc *desc,
> > +		      struct intel_context *ce,
> > +		      u32 wqi_size)
> > +{
> > +	/*
> > +	 * Ensure WQI are visible before updating tail
> > +	 */
> > +	intel_guc_write_barrier(ce_to_guc(ce));
> > +
> > +	ce->parallel.guc.wqi_tail = (ce->parallel.guc.wqi_tail + wqi_size) &
> > +		(WQ_SIZE - 1);
> This relies on WQ_SIZE being a power of two, right? Is it possible to add a
> BUILD_BUG_ON to ensure that?
> 

Yep.

> > +	WRITE_ONCE(desc->tail, ce->parallel.guc.wqi_tail);
> > +}
> > +
> > +static int guc_wq_noop_append(struct intel_context *ce)
> > +{
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> > +	u32 len_dw = wq_space_until_wrap(ce) / sizeof(u32) - 1;
> > +
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> > +
> > +	*wqi = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
> > +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> > +	ce->parallel.guc.wqi_tail = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __guc_wq_item_append(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	struct intel_context *child;
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	unsigned int wqi_size = (ce->parallel.number_children + 4) *
> > +		sizeof(u32);
> > +	u32 *wqi;
> > +	u32 len_dw = (wqi_size / sizeof(u32)) - 1;
> > +	int ret;
> > +
> > +	/* Ensure context is in correct state updating work queue */
> > +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> > +	GEM_BUG_ON(context_guc_id_invalid(ce));
> > +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> > +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> > +
> > +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> > +	if (wqi_size > wq_space_until_wrap(ce)) {
> > +		ret = guc_wq_noop_append(ce);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	wqi = get_wq_pointer(desc, ce, wqi_size);
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	GEM_BUG_ON(!FIELD_FIT(WQ_LEN_MASK, len_dw));
> > +
> > +	*wqi++ = FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_MULTI_LRC) |
> > +		FIELD_PREP(WQ_LEN_MASK, len_dw);
> > +	*wqi++ = ce->lrc.lrca;
> > +	*wqi++ = FIELD_PREP(WQ_GUC_ID_MASK, ce->guc_id.id) |
> > +	       FIELD_PREP(WQ_RING_TAIL_MASK, ce->ring->tail / sizeof(u64));
> > +	*wqi++ = 0;	/* fence_id */
> > +	for_each_child(ce, child)
> > +		*wqi++ = child->ring->tail / sizeof(u64);
> > +
> > +	write_wqi(desc, ce, wqi_size);
> > +
> > +	return 0;
> > +}
> > +
> > +static int guc_wq_item_append(struct intel_guc *guc,
> > +			      struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	int ret = 0;
> > +
> > +	if (likely(!intel_context_is_banned(ce))) {
> > +		ret = __guc_wq_item_append(rq);
> > +
> > +		if (unlikely(ret == -EBUSY)) {
> > +			guc->stalled_request = rq;
> > +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static bool multi_lrc_submit(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	intel_ring_set_tail(rq->ring, rq->tail);
> > +
> > +	/*
> > +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> > +	 * request generated from a multi-BB submission. This indicates to the
> > +	 * backend (GuC interface) that we should submit this context thus
> > +	 * submitting all the requests generated in parallel.
> > +	 */
> > +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
> FYI: Apparently the test_bit/set_bit/etc helpers are intended for use on
> arbitrary sized bitfields. As in, they do all sorts of complicated atomic
> operations to work on 164 bit words and such like. For single word flags,
> the guidance is to just use 'if(word & BIT(bit))' instead.
> 

I get that but currently everywhere in the code uses
set_bit/clear_bit/test_bit on the rq->fence.flags. IMO is better to
stick to that convention for now rip of all of these helpers in a single
patch later. I'd rather not have a hodgepodge of styles in the code.

I can an AR to clean up rq->fence.flags everywhere in the code in a
follow up.

Matt

> John.
> 
> > +		intel_context_is_banned(ce);
> > +}
> > +
> >   static int guc_dequeue_one_context(struct intel_guc *guc)
> >   {
> >   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > @@ -656,7 +822,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   	if (guc->stalled_request) {
> >   		submit = true;
> >   		last = guc->stalled_request;
> > -		goto resubmit;
> > +
> > +		switch (guc->submission_stall_reason) {
> > +		case STALL_REGISTER_CONTEXT:
> > +			goto register_context;
> > +		case STALL_MOVE_LRC_TAIL:
> > +			goto move_lrc_tail;
> > +		case STALL_ADD_REQUEST:
> > +			goto add_request;
> > +		default:
> > +			MISSING_CASE(guc->submission_stall_reason);
> > +		}
> >   	}
> >   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> > @@ -664,8 +840,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   		struct i915_request *rq, *rn;
> >   		priolist_for_each_request_consume(rq, rn, p) {
> > -			if (last && rq->context != last->context)
> > -				goto done;
> > +			if (last && !can_merge_rq(rq, last))
> > +				goto register_context;
> >   			list_del_init(&rq->sched.link);
> > @@ -673,33 +849,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   			trace_i915_request_in(rq, 0);
> >   			last = rq;
> > -			submit = true;
> > +
> > +			if (is_multi_lrc_rq(rq)) {
> > +				/*
> > +				 * We need to coalesce all multi-lrc requests in
> > +				 * a relationship into a single H2G. We are
> > +				 * guaranteed that all of these requests will be
> > +				 * submitted sequentially.
> > +				 */
> > +				if (multi_lrc_submit(rq)) {
> > +					submit = true;
> > +					goto register_context;
> > +				}
> > +			} else {
> > +				submit = true;
> > +			}
> >   		}
> >   		rb_erase_cached(&p->node, &sched_engine->queue);
> >   		i915_priolist_free(p);
> >   	}
> > -done:
> > +
> > +register_context:
> >   	if (submit) {
> > -		guc_set_lrc_tail(last);
> > -resubmit:
> > +		struct intel_context *ce = request_to_scheduling_context(last);
> > +
> > +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> > +			     !intel_context_is_banned(ce))) {
> > +			ret = guc_lrc_desc_pin(ce, false);
> > +			if (unlikely(ret == -EPIPE)) {
> > +				goto deadlk;
> > +			} else if (ret == -EBUSY) {
> > +				guc->stalled_request = last;
> > +				guc->submission_stall_reason =
> > +					STALL_REGISTER_CONTEXT;
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		}
> > +
> > +move_lrc_tail:
> > +		if (is_multi_lrc_rq(last)) {
> > +			ret = guc_wq_item_append(guc, last);
> > +			if (ret == -EBUSY) {
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		} else {
> > +			guc_set_lrc_tail(last);
> > +		}
> > +
> > +add_request:
> >   		ret = guc_add_request(guc, last);
> > -		if (unlikely(ret == -EPIPE))
> > +		if (unlikely(ret == -EPIPE)) {
> > +			goto deadlk;
> > +		} else if (ret == -EBUSY) {
> > +			goto schedule_tasklet;
> > +		} else if (ret != 0) {
> > +			GEM_WARN_ON(ret);	/* Unexpected */
> >   			goto deadlk;
> > -		else if (ret == -EBUSY) {
> > -			tasklet_schedule(&sched_engine->tasklet);
> > -			guc->stalled_request = last;
> > -			return false;
> >   		}
> >   	}
> >   	guc->stalled_request = NULL;
> > +	guc->submission_stall_reason = STALL_NONE;
> >   	return submit;
> >   deadlk:
> >   	sched_engine->tasklet.callback = NULL;
> >   	tasklet_disable_nosync(&sched_engine->tasklet);
> >   	return false;
> > +
> > +schedule_tasklet:
> > +	tasklet_schedule(&sched_engine->tasklet);
> > +	return false;
> >   }
> >   static void guc_submission_tasklet(struct tasklet_struct *t)
> > @@ -1255,10 +1482,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	trace_i915_request_in(rq, 0);
> > -	guc_set_lrc_tail(rq);
> > -	ret = guc_add_request(guc, rq);
> > -	if (ret == -EBUSY)
> > -		guc->stalled_request = rq;
> > +	if (is_multi_lrc_rq(rq)) {
> > +		if (multi_lrc_submit(rq)) {
> > +			ret = guc_wq_item_append(guc, rq);
> > +			if (!ret)
> > +				ret = guc_add_request(guc, rq);
> > +		}
> > +	} else {
> > +		guc_set_lrc_tail(rq);
> > +		ret = guc_add_request(guc, rq);
> > +	}
> >   	if (unlikely(ret == -EPIPE))
> >   		disable_submission(guc);
> > @@ -1266,6 +1499,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	return ret;
> >   }
> > +static bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	return submission_disabled(guc) || guc->stalled_request ||
> > +		!i915_sched_engine_is_empty(sched_engine) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +}
> > +
> >   static void guc_submit_request(struct i915_request *rq)
> >   {
> >   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > @@ -1275,8 +1518,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   	/* Will be called from irq-context when using foreign fences. */
> >   	spin_lock_irqsave(&sched_engine->lock, flags);
> > -	if (submission_disabled(guc) || guc->stalled_request ||
> > -	    !i915_sched_engine_is_empty(sched_engine))
> > +	if (need_tasklet(guc, rq))
> >   		queue_request(sched_engine, rq, rq_prio(rq));
> >   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
> >   		tasklet_hi_schedule(&sched_engine->tasklet);
> > @@ -2258,9 +2500,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
> >   static void add_to_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
> >   	spin_lock(&ce->guc_state.lock);
> > @@ -2293,7 +2536,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
> >   static void remove_from_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	spin_lock_irq(&ce->guc_state.lock);
> > @@ -2712,7 +2957,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
> >   static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   					   int prio)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
> >   	/* Short circuit function */
> > @@ -2735,7 +2980,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   static void guc_retire_inflight_request_prio(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	spin_lock(&ce->guc_state.lock);
> >   	guc_prio_fini(rq, ce);
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 7bd9ed20623e..8950785e55d6 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -139,6 +139,14 @@ enum {
> >   	 * the GPU. Here we track such boost requests on a per-request basis.
> >   	 */
> >   	I915_FENCE_FLAG_BOOST,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> > +	 * parent-child relationship (parallel submission, multi-lrc) should
> > +	 * trigger a submission to the GuC rather than just moving the context
> > +	 * tail.
> > +	 */
> > +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> >   };
> >   /**
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-10-13 18:03           ` [Intel-gfx] " Matthew Brost
@ 2021-10-13 19:11             ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/13/2021 11:03, Matthew Brost wrote:
> On Fri, Oct 08, 2021 at 09:40:43AM -0700, John Harrison wrote:
>> On 10/7/2021 18:21, Matthew Brost wrote:
>>> On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
>>>> On 10/4/2021 15:06, Matthew Brost wrote:
>>>>> Assign contexts in parent-child relationship consecutive guc_ids. This
>>>>> is accomplished by partitioning guc_id space between ones that need to
>>>>> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
>>>>> available guc_ids). The consecutive search is implemented via the bitmap
>>>>> API.
>>>>>
>>>>> This is a precursor to the full GuC multi-lrc implementation but aligns
>>>>> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
>>>>> when using the GuC multi-lrc interface.
>>>>>
>>>>> v2:
>>>>>     (Daniel Vetter)
>>>>>      - Explicitly state why we assign consecutive guc_ids
>>>>> v3:
>>>>>     (John Harrison)
>>>>>      - Bring back in spin lock
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>>>>>     .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>>>>>     2 files changed, 86 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> index 25a598e2b6e8..a9f4ec972bfb 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> @@ -76,9 +76,13 @@ struct intel_guc {
>>>>>     		 */
>>>>>     		spinlock_t lock;
>>>>>     		/**
>>>>> -		 * @guc_ids: used to allocate new guc_ids
>>>>> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>>>>>     		 */
>>>>>     		struct ida guc_ids;
>>>>> +		/**
>>>>> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
>>>>> +		 */
>>>>> +		unsigned long *guc_ids_bitmap;
>>>>>     		/**
>>>>>     		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>>>>     		 * refs
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index 1f2809187513..79e7732e83b2 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>>>>>     #define GUC_REQUEST_SIZE 64 /* bytes */
>>>>> +/*
>>>>> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
>>>>> + * per the GuC submission interface. A different allocation algorithm is used
>>>>> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
>>>>> + * partition the guc_id space. We believe the number of multi-lrc contexts in
>>>>> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
>>>>> + * multi-lrc.
>>>>> + */
>>>>> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
>>>>> +
>>>>>     /*
>>>>>      * Below is a set of functions which control the GuC scheduling state which
>>>>>      * require a lock.
>>>>> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>>>     	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>>>     		  destroyed_worker_func);
>>>>> +	guc->submission_state.guc_ids_bitmap =
>>>>> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
>>>>> +	if (!guc->submission_state.guc_ids_bitmap)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>>     	return 0;
>>>>>     }
>>>>> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>>>>     	guc_lrc_desc_pool_destroy(guc);
>>>>>     	guc_flush_destroyed_contexts(guc);
>>>>>     	i915_sched_engine_put(guc->sched_engine);
>>>>> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>>>>>     }
>>>>>     static inline void queue_request(struct i915_sched_engine *sched_engine,
>>>>> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>>>>>     	spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>>>     }
>>>>> -static int new_guc_id(struct intel_guc *guc)
>>>>> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>>>> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>>>> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>>>> +	int ret;
>>>>> +
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +
>>>>> +	if (intel_context_is_parent(ce))
>>>>> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
>>>>> +					      NUMBER_MULTI_LRC_GUC_ID,
>>>>> +					      order_base_2(ce->parallel.number_children
>>>>> +							   + 1));
>>>>> +	else
>>>>> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
>>>>> +				     NUMBER_MULTI_LRC_GUC_ID,
>>>>> +				     GUC_MAX_LRC_DESCRIPTORS,
>>>>> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
>>>>> +				     __GFP_NOWARN);
>>>>> +	if (unlikely(ret < 0))
>>>>> +		return ret;
>>>>> +
>>>>> +	ce->guc_id.id = ret;
>>>>> +	return 0;
>>>>>     }
>>>>>     static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +
>>>>>     	if (!context_guc_id_invalid(ce)) {
>>>>> -		ida_simple_remove(&guc->submission_state.guc_ids,
>>>>> -				  ce->guc_id.id);
>>>>> +		if (intel_context_is_parent(ce))
>>>>> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
>>>>> +					      ce->guc_id.id,
>>>>> +					      order_base_2(ce->parallel.number_children
>>>>> +							   + 1));
>>>> There was a discussion on the previous revision about adding a BUG_ON to
>>>> ensure that number_children cannot change between the bitmap alloc and the
>>>> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
>>>>
>>> I thought you meant to add a BUG_ON to ensure before we release a region
>>> / id it is occupied? I looked in both the bitmap API and ida API and
>>> neither have a function that checks if region / id is occupied so can't
>>> really add a BUG_ON for that.
>>>
>>> How much you add BUG_ON to ensure the number of children canoot change
>>> between alloc and release? I don't follow how that would work.
>>>
>>> Matt
>> I was thinking that where number_children is modified, you have a
>> BUG_ON(guc_id_is_valid). That would ensure that the release has to match the
>> alloc. Hmm, you already have a BUG_ON about the parent/child not being
>> pinned in intel_context_bind_parent_child(), which I guess covers it because
>> you shouldn't have a guc_id if you aren't pinned, right? And that is the
>> only function which can modify number_children, yes? So maybe it's all good?
>>
> I think we are all good.
We are all awesome ;)

Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> Matt
>
>> John.
>>
>>>> John.
>>>>
>>>>
>>>>> +		else
>>>>> +			ida_simple_remove(&guc->submission_state.guc_ids,
>>>>> +					  ce->guc_id.id);
>>>>>     		reset_lrc_desc(guc, ce->guc_id.id);
>>>>>     		set_context_guc_id_invalid(ce);
>>>>>     	}
>>>>> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>     }
>>>>> -static int steal_guc_id(struct intel_guc *guc)
>>>>> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> -	struct intel_context *ce;
>>>>> -	int guc_id;
>>>>> +	struct intel_context *cn;
>>>>>     	lockdep_assert_held(&guc->submission_state.lock);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +	GEM_BUG_ON(intel_context_is_parent(ce));
>>>>>     	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>>>> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>>>> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>>>>>     				      struct intel_context,
>>>>>     				      guc_id.link);
>>>>> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>>>> -		GEM_BUG_ON(context_guc_id_invalid(ce));
>>>>> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
>>>>> +		GEM_BUG_ON(context_guc_id_invalid(cn));
>>>>> +		GEM_BUG_ON(intel_context_is_child(cn));
>>>>> +		GEM_BUG_ON(intel_context_is_parent(cn));
>>>>> -		list_del_init(&ce->guc_id.link);
>>>>> -		guc_id = ce->guc_id.id;
>>>>> +		list_del_init(&cn->guc_id.link);
>>>>> +		ce->guc_id = cn->guc_id;
>>>>>     		spin_lock(&ce->guc_state.lock);
>>>>> -		clr_context_registered(ce);
>>>>> +		clr_context_registered(cn);
>>>>>     		spin_unlock(&ce->guc_state.lock);
>>>>> -		set_context_guc_id_invalid(ce);
>>>>> -		return guc_id;
>>>>> +		set_context_guc_id_invalid(cn);
>>>>> +
>>>>> +		return 0;
>>>>>     	} else {
>>>>>     		return -EAGAIN;
>>>>>     	}
>>>>>     }
>>>>> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>>>> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>>     	int ret;
>>>>>     	lockdep_assert_held(&guc->submission_state.lock);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> -	ret = new_guc_id(guc);
>>>>> +	ret = new_guc_id(guc, ce);
>>>>>     	if (unlikely(ret < 0)) {
>>>>> -		ret = steal_guc_id(guc);
>>>>> +		if (intel_context_is_parent(ce))
>>>>> +			return -ENOSPC;
>>>>> +
>>>>> +		ret = steal_guc_id(guc, ce);
>>>>>     		if (ret < 0)
>>>>>     			return ret;
>>>>>     	}
>>>>> -	*out = ret;
>>>>> +	if (intel_context_is_parent(ce)) {
>>>>> +		struct intel_context *child;
>>>>> +		int i = 1;
>>>>> +
>>>>> +		for_each_child(ce, child)
>>>>> +			child->guc_id.id = ce->guc_id.id + i++;
>>>>> +	}
>>>>> +
>>>>>     	return 0;
>>>>>     }
>>>>> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	might_lock(&ce->guc_state.lock);
>>>>>     	if (context_guc_id_invalid(ce)) {
>>>>> -		ret = assign_guc_id(guc, &ce->guc_id.id);
>>>>> +		ret = assign_guc_id(guc, ce);
>>>>>     		if (ret)
>>>>>     			goto out_unlock;
>>>>>     		ret = 1;	/* Indidcates newly assigned guc_id */
>>>>> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	unsigned long flags;
>>>>>     	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> -	if (unlikely(context_guc_id_invalid(ce)))
>>>>> +	if (unlikely(context_guc_id_invalid(ce) ||
>>>>> +		     intel_context_is_parent(ce)))
>>>>>     		return;
>>>>>     	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-10-13 19:11             ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:11 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/13/2021 11:03, Matthew Brost wrote:
> On Fri, Oct 08, 2021 at 09:40:43AM -0700, John Harrison wrote:
>> On 10/7/2021 18:21, Matthew Brost wrote:
>>> On Thu, Oct 07, 2021 at 03:03:04PM -0700, John Harrison wrote:
>>>> On 10/4/2021 15:06, Matthew Brost wrote:
>>>>> Assign contexts in parent-child relationship consecutive guc_ids. This
>>>>> is accomplished by partitioning guc_id space between ones that need to
>>>>> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
>>>>> available guc_ids). The consecutive search is implemented via the bitmap
>>>>> API.
>>>>>
>>>>> This is a precursor to the full GuC multi-lrc implementation but aligns
>>>>> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
>>>>> when using the GuC multi-lrc interface.
>>>>>
>>>>> v2:
>>>>>     (Daniel Vetter)
>>>>>      - Explicitly state why we assign consecutive guc_ids
>>>>> v3:
>>>>>     (John Harrison)
>>>>>      - Bring back in spin lock
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>>>>>     .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 104 ++++++++++++++----
>>>>>     2 files changed, 86 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> index 25a598e2b6e8..a9f4ec972bfb 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>> @@ -76,9 +76,13 @@ struct intel_guc {
>>>>>     		 */
>>>>>     		spinlock_t lock;
>>>>>     		/**
>>>>> -		 * @guc_ids: used to allocate new guc_ids
>>>>> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>>>>>     		 */
>>>>>     		struct ida guc_ids;
>>>>> +		/**
>>>>> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
>>>>> +		 */
>>>>> +		unsigned long *guc_ids_bitmap;
>>>>>     		/**
>>>>>     		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>>>>     		 * refs
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index 1f2809187513..79e7732e83b2 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -128,6 +128,16 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>>>>>     #define GUC_REQUEST_SIZE 64 /* bytes */
>>>>> +/*
>>>>> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
>>>>> + * per the GuC submission interface. A different allocation algorithm is used
>>>>> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
>>>>> + * partition the guc_id space. We believe the number of multi-lrc contexts in
>>>>> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
>>>>> + * multi-lrc.
>>>>> + */
>>>>> +#define NUMBER_MULTI_LRC_GUC_ID		(GUC_MAX_LRC_DESCRIPTORS / 16)
>>>>> +
>>>>>     /*
>>>>>      * Below is a set of functions which control the GuC scheduling state which
>>>>>      * require a lock.
>>>>> @@ -1206,6 +1216,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>>>     	INIT_WORK(&guc->submission_state.destroyed_worker,
>>>>>     		  destroyed_worker_func);
>>>>> +	guc->submission_state.guc_ids_bitmap =
>>>>> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID, GFP_KERNEL);
>>>>> +	if (!guc->submission_state.guc_ids_bitmap)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>>     	return 0;
>>>>>     }
>>>>> @@ -1217,6 +1232,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>>>>     	guc_lrc_desc_pool_destroy(guc);
>>>>>     	guc_flush_destroyed_contexts(guc);
>>>>>     	i915_sched_engine_put(guc->sched_engine);
>>>>> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>>>>>     }
>>>>>     static inline void queue_request(struct i915_sched_engine *sched_engine,
>>>>> @@ -1268,18 +1284,43 @@ static void guc_submit_request(struct i915_request *rq)
>>>>>     	spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>>>     }
>>>>> -static int new_guc_id(struct intel_guc *guc)
>>>>> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>>>> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
>>>>> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>>>> +	int ret;
>>>>> +
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +
>>>>> +	if (intel_context_is_parent(ce))
>>>>> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
>>>>> +					      NUMBER_MULTI_LRC_GUC_ID,
>>>>> +					      order_base_2(ce->parallel.number_children
>>>>> +							   + 1));
>>>>> +	else
>>>>> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
>>>>> +				     NUMBER_MULTI_LRC_GUC_ID,
>>>>> +				     GUC_MAX_LRC_DESCRIPTORS,
>>>>> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
>>>>> +				     __GFP_NOWARN);
>>>>> +	if (unlikely(ret < 0))
>>>>> +		return ret;
>>>>> +
>>>>> +	ce->guc_id.id = ret;
>>>>> +	return 0;
>>>>>     }
>>>>>     static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +
>>>>>     	if (!context_guc_id_invalid(ce)) {
>>>>> -		ida_simple_remove(&guc->submission_state.guc_ids,
>>>>> -				  ce->guc_id.id);
>>>>> +		if (intel_context_is_parent(ce))
>>>>> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
>>>>> +					      ce->guc_id.id,
>>>>> +					      order_base_2(ce->parallel.number_children
>>>>> +							   + 1));
>>>> There was a discussion on the previous revision about adding a BUG_ON to
>>>> ensure that number_children cannot change between the bitmap alloc and the
>>>> bitmap release. I'm not seeing the new BUG_ON mentioned in this patch.
>>>>
>>> I thought you meant to add a BUG_ON to ensure before we release a region
>>> / id it is occupied? I looked in both the bitmap API and ida API and
>>> neither have a function that checks if region / id is occupied so can't
>>> really add a BUG_ON for that.
>>>
>>> How much you add BUG_ON to ensure the number of children canoot change
>>> between alloc and release? I don't follow how that would work.
>>>
>>> Matt
>> I was thinking that where number_children is modified, you have a
>> BUG_ON(guc_id_is_valid). That would ensure that the release has to match the
>> alloc. Hmm, you already have a BUG_ON about the parent/child not being
>> pinned in intel_context_bind_parent_child(), which I guess covers it because
>> you shouldn't have a guc_id if you aren't pinned, right? And that is the
>> only function which can modify number_children, yes? So maybe it's all good?
>>
> I think we are all good.
We are all awesome ;)

Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> Matt
>
>> John.
>>
>>>> John.
>>>>
>>>>
>>>>> +		else
>>>>> +			ida_simple_remove(&guc->submission_state.guc_ids,
>>>>> +					  ce->guc_id.id);
>>>>>     		reset_lrc_desc(guc, ce->guc_id.id);
>>>>>     		set_context_guc_id_invalid(ce);
>>>>>     	}
>>>>> @@ -1296,49 +1337,64 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>     }
>>>>> -static int steal_guc_id(struct intel_guc *guc)
>>>>> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>> -	struct intel_context *ce;
>>>>> -	int guc_id;
>>>>> +	struct intel_context *cn;
>>>>>     	lockdep_assert_held(&guc->submission_state.lock);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> +	GEM_BUG_ON(intel_context_is_parent(ce));
>>>>>     	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>>>> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>>>> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>>>>>     				      struct intel_context,
>>>>>     				      guc_id.link);
>>>>> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>>>> -		GEM_BUG_ON(context_guc_id_invalid(ce));
>>>>> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
>>>>> +		GEM_BUG_ON(context_guc_id_invalid(cn));
>>>>> +		GEM_BUG_ON(intel_context_is_child(cn));
>>>>> +		GEM_BUG_ON(intel_context_is_parent(cn));
>>>>> -		list_del_init(&ce->guc_id.link);
>>>>> -		guc_id = ce->guc_id.id;
>>>>> +		list_del_init(&cn->guc_id.link);
>>>>> +		ce->guc_id = cn->guc_id;
>>>>>     		spin_lock(&ce->guc_state.lock);
>>>>> -		clr_context_registered(ce);
>>>>> +		clr_context_registered(cn);
>>>>>     		spin_unlock(&ce->guc_state.lock);
>>>>> -		set_context_guc_id_invalid(ce);
>>>>> -		return guc_id;
>>>>> +		set_context_guc_id_invalid(cn);
>>>>> +
>>>>> +		return 0;
>>>>>     	} else {
>>>>>     		return -EAGAIN;
>>>>>     	}
>>>>>     }
>>>>> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>>>> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     {
>>>>>     	int ret;
>>>>>     	lockdep_assert_held(&guc->submission_state.lock);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> -	ret = new_guc_id(guc);
>>>>> +	ret = new_guc_id(guc, ce);
>>>>>     	if (unlikely(ret < 0)) {
>>>>> -		ret = steal_guc_id(guc);
>>>>> +		if (intel_context_is_parent(ce))
>>>>> +			return -ENOSPC;
>>>>> +
>>>>> +		ret = steal_guc_id(guc, ce);
>>>>>     		if (ret < 0)
>>>>>     			return ret;
>>>>>     	}
>>>>> -	*out = ret;
>>>>> +	if (intel_context_is_parent(ce)) {
>>>>> +		struct intel_context *child;
>>>>> +		int i = 1;
>>>>> +
>>>>> +		for_each_child(ce, child)
>>>>> +			child->guc_id.id = ce->guc_id.id + i++;
>>>>> +	}
>>>>> +
>>>>>     	return 0;
>>>>>     }
>>>>> @@ -1356,7 +1412,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	might_lock(&ce->guc_state.lock);
>>>>>     	if (context_guc_id_invalid(ce)) {
>>>>> -		ret = assign_guc_id(guc, &ce->guc_id.id);
>>>>> +		ret = assign_guc_id(guc, ce);
>>>>>     		if (ret)
>>>>>     			goto out_unlock;
>>>>>     		ret = 1;	/* Indidcates newly assigned guc_id */
>>>>> @@ -1398,8 +1454,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>>>     	unsigned long flags;
>>>>>     	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
>>>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>>> -	if (unlikely(context_guc_id_invalid(ce)))
>>>>> +	if (unlikely(context_guc_id_invalid(ce) ||
>>>>> +		     intel_context_is_parent(ce)))
>>>>>     		return;
>>>>>     	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx]  ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4)
  2021-10-13  0:15     ` Matthew Brost
@ 2021-10-13 19:24       ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Patchwork

On 10/12/2021 17:15, Matthew Brost wrote:
> On Tue, Oct 12, 2021 at 03:15:00PM -0700, John Harrison wrote:
>> On 10/4/2021 15:21, Patchwork wrote:
>>> == Series Details ==
>>>
>>> Series: Parallel submission aka multi-bb execbuf (rev4)
>>> URL   : https://patchwork.freedesktop.org/series/92789/
>>> State : warning
>>>
>>> == Summary ==
>>>
>>> $ dim checkpatch origin/drm-tip
>>> e2a47a99bf9d drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
>>> f83d8f1539fa drm/i915/guc: Take GT PM ref when deregistering context
>>> -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
>>> #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
>>> +#define with_intel_gt_pm(gt, tmp) \
>>> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>> +	     intel_gt_pm_put(gt), tmp = 0)
>>>
>>> -:79: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
>>> #79: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
>>> +#define with_intel_gt_pm(gt, tmp) \
>>> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>> +	     intel_gt_pm_put(gt), tmp = 0)
>> Not sure what these two are complaining about? But 'gt' and 'tmp' should be
>> wrapped with parentheses when used?
>>
> Not, sure but I think this one is fine.
>
>>> total: 0 errors, 0 warnings, 2 checks, 290 lines checked
>>> 93e5284929b3 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
>>> 4dd6554d994d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
>>> 8629b55f536c drm/i915: Add logical engine mapping
>>> 8117ec0a1ca7 drm/i915: Expose logical engine instance to user
>>> aa8e1eb4dd4e drm/i915/guc: Introduce context parent-child relationship
>>> aaf50eacc2fd drm/i915/guc: Add multi-lrc context registration
>>> e5f6f50e66d1 drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
>>> adf21ba138f3 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
>>> 40ef33318b81 drm/i915/guc: Implement parallel context pin / unpin functions
>>> 1ad560c70346 drm/i915/guc: Implement multi-lrc submission
>>> -:364: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
>>> #364: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:771:
>>> +		*wqi++ = child->ring->tail / sizeof(u64);
>>>    		^
>> This seems like a bogus warning.
>>
> Agree.
>
>>> total: 0 errors, 0 warnings, 1 checks, 570 lines checked
>>> 466c01457dec drm/i915/guc: Insert submit fences between requests in parent-child relationship
>>> 2ece815c1f18 drm/i915/guc: Implement multi-lrc reset
>>> 7add5784199f drm/i915/guc: Update debugfs for GuC multi-lrc
>>> -:23: CHECK:LINE_SPACING: Please don't use multiple blank lines
>>> #23: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3707:
>>> +
>> This should be fixed.
>>
> Done.
>   
>>> total: 0 errors, 0 warnings, 1 checks, 67 lines checked
>>> 966991d7bbed drm/i915: Fix bug in user proto-context creation that leaked contexts
>>> 0eb3d3bf0c84 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
>>> 68c6596b649a drm/i915/doc: Update parallel submit doc to point to i915_drm.h
>>> -:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
>>> #13:
>>> deleted file mode 100644
>>>
>>> total: 0 errors, 1 warnings, 0 checks, 10 lines checked
>>> 8290f5d15ca2 drm/i915/guc: Add basic GuC multi-lrc selftest
>>> -:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
>>> #22:
>>> new file mode 100644
>> These two can be ignored.
> Agree.
>
>>> total: 0 errors, 1 warnings, 0 checks, 190 lines checked
>>> ade3768c42d5 drm/i915/guc: Implement no mid batch preemption for multi-lrc
>>> 57882939d788 drm/i915: Multi-BB execbuf
>>> -:369: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
>>> #369: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1854:
>>> +#define for_each_batch_create_order(_eb, _i) \
>>> +	for (_i = 0; _i < (_eb)->num_batches; ++_i)
>> Again, not sure the 'reuse' comment means but should also use '(_i)'?
>>
> I haven't been able to figure out how to fix these ones. I think you
> only need () if you dref the variable.
The () is to prevent any kind of operator precedence confusion when 
passing in something more exciting than a simple variable. Doesn't have 
to be a deref, it could be any operator. Granted, extremely unlikely for 
this particular macro but generally good practice just in case. E.g. 
someone passes in weird things like 'a, func()' as '_i'.

John.

>   
>>> -:371: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
>>> #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
>>> +#define for_each_batch_add_order(_eb, _i) \
>>> +	BUILD_BUG_ON(!typecheck(int, _i)); \
>>> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
>> This seems bogus. Wrapping it in a do/while will break the purpose!
>>
> Right. Added the BUILD_BUG_ON here because I did have a bug where I used
> an unsigned with this macro and that breaks the macro.
>
> Matt
>
>>> -:371: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
>>> #371: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1856:
>>> +#define for_each_batch_add_order(_eb, _i) \
>>> +	BUILD_BUG_ON(!typecheck(int, _i)); \
>>> +	for (_i = (_eb)->num_batches - 1; _i >= 0; --_i)
>> As above.
>>
>>> total: 1 errors, 0 warnings, 2 checks, 1298 lines checked
>>> 28b699ece289 drm/i915/guc: Handle errors in multi-lrc requests
>>> 962e6b3dce59 drm/i915: Make request conflict tracking understand parallel submits
>>> 368ab12f5205 drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences
>>> b52570f01859 drm/i915: Enable multi-bb execbuf
>>> 8766155832d7 drm/i915/execlists: Weak parallel submission support for execlists
>>>
>>>


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-13 17:51       ` [Intel-gfx] " Matthew Brost
@ 2021-10-13 19:25         ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:25 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/13/2021 10:51, Matthew Brost wrote:
> On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> If an object in the excl or shared slot is a composite fence from a
>>> parallel submit and the current request in the conflict tracking is from
>>> the same parallel context there is no need to enforce ordering as the
>>> ordering already implicit. Make the request conflict tracking understand
>> ordering already -> ordering is already
>>
>>> this by comparing the parents parallel fence values and skipping the
>> parents -> parent's
>>
>>> conflict insertion if the values match.
>> Presumably, this is to cope with the fact that the parallel submit fences do
>> not look like regular submission fences. And hence the existing code that
>> says 'new fence belongs to same context as old fence, so safe to ignore'
>> does not work with parallel submission. However, this change does not appear
>> to be adding parallel submit support to an existing 'same context' check. It
>> seems to be a brand new check that does not exist for single submission.
>> What makes parallel submit different? If we aren't skipping same context
>> fences for single submits, why do we need it for parallel? Conversely, if we
>> need it for parallel then why don't we need it for single?
>>
>> And if the single submission version is simply somewhere else in the code,
>> why do the parallel version here instead of at the same place?
>>
>> John.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>>>    1 file changed, 29 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index e9bfa32f9270..cf89624020ad 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>>>    	return err;
>>>    }
>>> +static inline bool is_parallel_rq(struct i915_request *rq)
>>> +{
>>> +	return intel_context_is_parallel(rq->context);
>>> +}
>>> +
>>> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> +{
>>> +	return intel_context_to_parent(rq->context);
>>> +}
>>> +
>>> +static bool is_same_parallel_context(struct i915_request *to,
>>> +				     struct i915_request *from)
>>> +{
>>> +	if (is_parallel_rq(to))
>> Should this not say '&& is_parallel_rq(from)'?
>>
> Missed this one. That isn't necessary as if from is not a parallel
> submit the following compare of parents will always return false. I
> could add if you insist as either way works.
>
> Matt
It was more a question of whether req_to_parent() works fine 
irrespective of whether the rq is a parent, child or single?

John.

>
>>> +		return request_to_parent(to) == request_to_parent(from);
>>> +
>>> +	return false;
>>> +}
>>> +
>>>    int
>>>    i915_request_await_execution(struct i915_request *rq,
>>>    			     struct dma_fence *fence)
>>> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>>>    		 * want to run our callback in all cases.
>>>    		 */
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = __i915_request_await_execution(rq,
>>>    							     to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	} while (--nchild);
>>> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>>>    						 fence))
>>>    			continue;
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = i915_request_await_request(rq, to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>>>    	return ret;
>>>    }
>>> -static inline bool is_parallel_rq(struct i915_request *rq)
>>> -{
>>> -	return intel_context_is_parallel(rq->context);
>>> -}
>>> -
>>> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> -{
>>> -	return intel_context_to_parent(rq->context);
>>> -}
>>> -
>>>    static struct i915_request *
>>>    __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>>>    					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-13 19:25         ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:25 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/13/2021 10:51, Matthew Brost wrote:
> On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> If an object in the excl or shared slot is a composite fence from a
>>> parallel submit and the current request in the conflict tracking is from
>>> the same parallel context there is no need to enforce ordering as the
>>> ordering already implicit. Make the request conflict tracking understand
>> ordering already -> ordering is already
>>
>>> this by comparing the parents parallel fence values and skipping the
>> parents -> parent's
>>
>>> conflict insertion if the values match.
>> Presumably, this is to cope with the fact that the parallel submit fences do
>> not look like regular submission fences. And hence the existing code that
>> says 'new fence belongs to same context as old fence, so safe to ignore'
>> does not work with parallel submission. However, this change does not appear
>> to be adding parallel submit support to an existing 'same context' check. It
>> seems to be a brand new check that does not exist for single submission.
>> What makes parallel submit different? If we aren't skipping same context
>> fences for single submits, why do we need it for parallel? Conversely, if we
>> need it for parallel then why don't we need it for single?
>>
>> And if the single submission version is simply somewhere else in the code,
>> why do the parallel version here instead of at the same place?
>>
>> John.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>>>    1 file changed, 29 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index e9bfa32f9270..cf89624020ad 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>>>    	return err;
>>>    }
>>> +static inline bool is_parallel_rq(struct i915_request *rq)
>>> +{
>>> +	return intel_context_is_parallel(rq->context);
>>> +}
>>> +
>>> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> +{
>>> +	return intel_context_to_parent(rq->context);
>>> +}
>>> +
>>> +static bool is_same_parallel_context(struct i915_request *to,
>>> +				     struct i915_request *from)
>>> +{
>>> +	if (is_parallel_rq(to))
>> Should this not say '&& is_parallel_rq(from)'?
>>
> Missed this one. That isn't necessary as if from is not a parallel
> submit the following compare of parents will always return false. I
> could add if you insist as either way works.
>
> Matt
It was more a question of whether req_to_parent() works fine 
irrespective of whether the rq is a parent, child or single?

John.

>
>>> +		return request_to_parent(to) == request_to_parent(from);
>>> +
>>> +	return false;
>>> +}
>>> +
>>>    int
>>>    i915_request_await_execution(struct i915_request *rq,
>>>    			     struct dma_fence *fence)
>>> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>>>    		 * want to run our callback in all cases.
>>>    		 */
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = __i915_request_await_execution(rq,
>>>    							     to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	} while (--nchild);
>>> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>>>    						 fence))
>>>    			continue;
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = i915_request_await_request(rq, to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>>>    	return ret;
>>>    }
>>> -static inline bool is_parallel_rq(struct i915_request *rq)
>>> -{
>>> -	return intel_context_is_parallel(rq->context);
>>> -}
>>> -
>>> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> -{
>>> -	return intel_context_to_parent(rq->context);
>>> -}
>>> -
>>>    static struct i915_request *
>>>    __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>>>    					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
  2021-10-13  0:32       ` [Intel-gfx] " Matthew Brost
@ 2021-10-13 19:35         ` John Harrison
  -1 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:35 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/12/2021 17:32, Matthew Brost wrote:
> On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> If an object in the excl or shared slot is a composite fence from a
>>> parallel submit and the current request in the conflict tracking is from
>>> the same parallel context there is no need to enforce ordering as the
>>> ordering already implicit. Make the request conflict tracking understand
>> ordering already -> ordering is already
>>
> Yep.
>
>>> this by comparing the parents parallel fence values and skipping the
>> parents -> parent's
>>
> Yep.
>
>>> conflict insertion if the values match.
>> Presumably, this is to cope with the fact that the parallel submit fences do
>> not look like regular submission fences. And hence the existing code that
>> says 'new fence belongs to same context as old fence, so safe to ignore'
>> does not work with parallel submission. However, this change does not appear
> Yes. The check for 'if (fence->context == rq->fence.context)' doesn't
> work with parallel submission as each rq->fence.context corresponds to a
> timeline. With parallel submission each intel_context in the parallel
> submit has its own timeline (seqno) so the compare fails for different
> intel_context within the same parallel submit. This is the reason for
> the additional compare on parallel submits parents, if they have the
> same parent it is the same parallel submission and there is no need to
> enforce additional ordering.
>
>> to be adding parallel submit support to an existing 'same context' check. It
>> seems to be a brand new check that does not exist for single submission.
>> What makes parallel submit different? If we aren't skipping same context
>> fences for single submits, why do we need it for parallel? Conversely, if we
>> need it for parallel then why don't we need it for single?
>>
> I'm confused by what you are asking here. The existing same context
> check is fine for parallel submits - it will just return true when we
> compare requests with the same intel_context and new additional check
> only true parallel submissions with the same parent.
>
>> And if the single submission version is simply somewhere else in the code,
>> why do the parallel version here instead of at the same place?
>>
> Again I'm confused by what you are asking. We might just need to sync on
> a quick call.
That's okay. I think I had partly confused myself ;).

I was just meaning that the parallel compliant version of the 'ctxtA == 
ctxtB -> skip' test should be coded adjacent to the single submission 
version of the same test. I had somehow completely missed that the 
single submission version is indeed the line above in 
i915_request_await_execution(). So the two are indeed very definitely 
next to each other.

It's all good :).

John.


>
> Matt
>   
>> John.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>>>    1 file changed, 29 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index e9bfa32f9270..cf89624020ad 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>>>    	return err;
>>>    }
>>> +static inline bool is_parallel_rq(struct i915_request *rq)
>>> +{
>>> +	return intel_context_is_parallel(rq->context);
>>> +}
>>> +
>>> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> +{
>>> +	return intel_context_to_parent(rq->context);
>>> +}
>>> +
>>> +static bool is_same_parallel_context(struct i915_request *to,
>>> +				     struct i915_request *from)
>>> +{
>>> +	if (is_parallel_rq(to))
>> Should this not say '&& is_parallel_rq(from)'?
>>
>>> +		return request_to_parent(to) == request_to_parent(from);
>>> +
>>> +	return false;
>>> +}
>>> +
>>>    int
>>>    i915_request_await_execution(struct i915_request *rq,
>>>    			     struct dma_fence *fence)
>>> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>>>    		 * want to run our callback in all cases.
>>>    		 */
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = __i915_request_await_execution(rq,
>>>    							     to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	} while (--nchild);
>>> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>>>    						 fence))
>>>    			continue;
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = i915_request_await_request(rq, to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>>>    	return ret;
>>>    }
>>> -static inline bool is_parallel_rq(struct i915_request *rq)
>>> -{
>>> -	return intel_context_is_parallel(rq->context);
>>> -}
>>> -
>>> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> -{
>>> -	return intel_context_to_parent(rq->context);
>>> -}
>>> -
>>>    static struct i915_request *
>>>    __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>>>    					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [Intel-gfx] [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits
@ 2021-10-13 19:35         ` John Harrison
  0 siblings, 0 replies; 165+ messages in thread
From: John Harrison @ 2021-10-13 19:35 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 10/12/2021 17:32, Matthew Brost wrote:
> On Tue, Oct 12, 2021 at 03:08:05PM -0700, John Harrison wrote:
>> On 10/4/2021 15:06, Matthew Brost wrote:
>>> If an object in the excl or shared slot is a composite fence from a
>>> parallel submit and the current request in the conflict tracking is from
>>> the same parallel context there is no need to enforce ordering as the
>>> ordering already implicit. Make the request conflict tracking understand
>> ordering already -> ordering is already
>>
> Yep.
>
>>> this by comparing the parents parallel fence values and skipping the
>> parents -> parent's
>>
> Yep.
>
>>> conflict insertion if the values match.
>> Presumably, this is to cope with the fact that the parallel submit fences do
>> not look like regular submission fences. And hence the existing code that
>> says 'new fence belongs to same context as old fence, so safe to ignore'
>> does not work with parallel submission. However, this change does not appear
> Yes. The check for 'if (fence->context == rq->fence.context)' doesn't
> work with parallel submission as each rq->fence.context corresponds to a
> timeline. With parallel submission each intel_context in the parallel
> submit has its own timeline (seqno) so the compare fails for different
> intel_context within the same parallel submit. This is the reason for
> the additional compare on parallel submits parents, if they have the
> same parent it is the same parallel submission and there is no need to
> enforce additional ordering.
>
>> to be adding parallel submit support to an existing 'same context' check. It
>> seems to be a brand new check that does not exist for single submission.
>> What makes parallel submit different? If we aren't skipping same context
>> fences for single submits, why do we need it for parallel? Conversely, if we
>> need it for parallel then why don't we need it for single?
>>
> I'm confused by what you are asking here. The existing same context
> check is fine for parallel submits - it will just return true when we
> compare requests with the same intel_context and new additional check
> only true parallel submissions with the same parent.
>
>> And if the single submission version is simply somewhere else in the code,
>> why do the parallel version here instead of at the same place?
>>
> Again I'm confused by what you are asking. We might just need to sync on
> a quick call.
That's okay. I think I had partly confused myself ;).

I was just meaning that the parallel compliant version of the 'ctxtA == 
ctxtB -> skip' test should be coded adjacent to the single submission 
version of the same test. I had somehow completely missed that the 
single submission version is indeed the line above in 
i915_request_await_execution(). So the two are indeed very definitely 
next to each other.

It's all good :).

John.


>
> Matt
>   
>> John.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>>>    1 file changed, 29 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index e9bfa32f9270..cf89624020ad 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -1325,6 +1325,25 @@ i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>>>    	return err;
>>>    }
>>> +static inline bool is_parallel_rq(struct i915_request *rq)
>>> +{
>>> +	return intel_context_is_parallel(rq->context);
>>> +}
>>> +
>>> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> +{
>>> +	return intel_context_to_parent(rq->context);
>>> +}
>>> +
>>> +static bool is_same_parallel_context(struct i915_request *to,
>>> +				     struct i915_request *from)
>>> +{
>>> +	if (is_parallel_rq(to))
>> Should this not say '&& is_parallel_rq(from)'?
>>
>>> +		return request_to_parent(to) == request_to_parent(from);
>>> +
>>> +	return false;
>>> +}
>>> +
>>>    int
>>>    i915_request_await_execution(struct i915_request *rq,
>>>    			     struct dma_fence *fence)
>>> @@ -1356,11 +1375,14 @@ i915_request_await_execution(struct i915_request *rq,
>>>    		 * want to run our callback in all cases.
>>>    		 */
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = __i915_request_await_execution(rq,
>>>    							     to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>>    	} while (--nchild);
>>> @@ -1461,10 +1483,13 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>>>    						 fence))
>>>    			continue;
>>> -		if (dma_fence_is_i915(fence))
>>> +		if (dma_fence_is_i915(fence)) {
>>> +			if (is_same_parallel_context(rq, to_request(fence)))
>>> +				continue;
>>>    			ret = i915_request_await_request(rq, to_request(fence));
>>> -		else
>>> +		} else {
>>>    			ret = i915_request_await_external(rq, fence);
>>> +		}
>>>    		if (ret < 0)
>>>    			return ret;
>>> @@ -1539,16 +1564,6 @@ i915_request_await_object(struct i915_request *to,
>>>    	return ret;
>>>    }
>>> -static inline bool is_parallel_rq(struct i915_request *rq)
>>> -{
>>> -	return intel_context_is_parallel(rq->context);
>>> -}
>>> -
>>> -static inline struct intel_context *request_to_parent(struct i915_request *rq)
>>> -{
>>> -	return intel_context_to_parent(rq->context);
>>> -}
>>> -
>>>    static struct i915_request *
>>>    __i915_request_ensure_parallel_ordering(struct i915_request *rq,
>>>    					struct intel_timeline *timeline)


^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2021-10-13 19:35 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04 22:06 [PATCH 00/26] Parallel submission aka multi-bb execbuf Matthew Brost
2021-10-04 22:06 ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 01/26] drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07  3:06   ` John Harrison
2021-10-07  3:06     ` [Intel-gfx] " John Harrison
2021-10-07 15:05     ` Matthew Brost
2021-10-07 15:05       ` [Intel-gfx] " Matthew Brost
2021-10-07 18:13       ` John Harrison
2021-10-07 18:13         ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07  3:37   ` John Harrison
2021-10-07  3:37     ` [Intel-gfx] " John Harrison
2021-10-08  1:28     ` Matthew Brost
2021-10-08  1:28       ` [Intel-gfx] " Matthew Brost
2021-10-08 18:23     ` Matthew Brost
2021-10-08 18:23       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [Intel-gfx] [PATCH 03/26] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
2021-10-04 22:06   ` Matthew Brost
2021-10-07  3:45   ` John Harrison
2021-10-07  3:45     ` [Intel-gfx] " John Harrison
2021-10-07 15:19     ` Matthew Brost
2021-10-07 15:19       ` [Intel-gfx] " Matthew Brost
2021-10-07 18:15       ` John Harrison
2021-10-07 18:15         ` [Intel-gfx] " John Harrison
2021-10-08  1:23         ` Matthew Brost
2021-10-08  1:23           ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 04/26] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07  3:49   ` John Harrison
2021-10-07  3:49     ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [Intel-gfx] [PATCH 05/26] drm/i915: Add logical engine mapping Matthew Brost
2021-10-04 22:06   ` Matthew Brost
2021-10-07 19:03   ` John Harrison
2021-10-07 19:03     ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 06/26] drm/i915: Expose logical engine instance to user Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 07/26] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07 19:35   ` John Harrison
2021-10-07 19:35     ` [Intel-gfx] " John Harrison
2021-10-08 18:33     ` Matthew Brost
2021-10-08 18:33       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 08/26] drm/i915/guc: Add multi-lrc context registration Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07 19:50   ` John Harrison
2021-10-07 19:50     ` [Intel-gfx] " John Harrison
2021-10-08  1:31     ` Matthew Brost
2021-10-08  1:31       ` [Intel-gfx] " Matthew Brost
2021-10-08 17:20     ` John Harrison
2021-10-08 17:29       ` Matthew Brost
2021-10-04 22:06 ` [PATCH 09/26] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07 20:23   ` John Harrison
2021-10-07 20:23     ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 10/26] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-07 22:03   ` John Harrison
2021-10-07 22:03     ` [Intel-gfx] " John Harrison
2021-10-08  1:21     ` Matthew Brost
2021-10-08  1:21       ` [Intel-gfx] " Matthew Brost
2021-10-08 16:40       ` John Harrison
2021-10-08 16:40         ` [Intel-gfx] " John Harrison
2021-10-13 18:03         ` Matthew Brost
2021-10-13 18:03           ` [Intel-gfx] " Matthew Brost
2021-10-13 19:11           ` John Harrison
2021-10-13 19:11             ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [Intel-gfx] [PATCH 11/26] drm/i915/guc: Implement parallel context pin / unpin functions Matthew Brost
2021-10-04 22:06   ` Matthew Brost
2021-10-04 22:06 ` [Intel-gfx] [PATCH 12/26] drm/i915/guc: Implement multi-lrc submission Matthew Brost
2021-10-04 22:06   ` Matthew Brost
2021-10-05  7:55   ` [Intel-gfx] " kernel test robot
2021-10-05  7:55     ` kernel test robot
2021-10-05 10:37   ` kernel test robot
2021-10-05 10:37     ` kernel test robot
2021-10-08 17:20   ` John Harrison
2021-10-08 17:20     ` [Intel-gfx] " John Harrison
2021-10-13 18:24     ` Matthew Brost
2021-10-13 18:24       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 13/26] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 14/26] drm/i915/guc: Implement multi-lrc reset Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-08 17:39   ` John Harrison
2021-10-08 17:39     ` [Intel-gfx] " John Harrison
2021-10-08 17:56     ` Matthew Brost
2021-10-08 17:56       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 15/26] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-08 17:46   ` John Harrison
2021-10-08 17:46     ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 16/26] drm/i915: Fix bug in user proto-context creation that leaked contexts Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-08 17:49   ` John Harrison
2021-10-08 17:49     ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 17/26] drm/i915/guc: Connect UAPI to GuC multi-lrc interface Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-11 22:09   ` John Harrison
2021-10-11 22:09     ` [Intel-gfx] " John Harrison
2021-10-11 22:59     ` Matthew Brost
2021-10-11 22:59       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 18/26] drm/i915/doc: Update parallel submit doc to point to i915_drm.h Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 19/26] drm/i915/guc: Add basic GuC multi-lrc selftest Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 20/26] drm/i915/guc: Implement no mid batch preemption for multi-lrc Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-11 23:32   ` John Harrison
2021-10-11 23:32     ` [Intel-gfx] " John Harrison
2021-10-13  1:52     ` Matthew Brost
2021-10-13  1:52       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 21/26] drm/i915: Multi-BB execbuf Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-05  8:31   ` kernel test robot
2021-10-05  8:31     ` kernel test robot
2021-10-05 17:02   ` Matthew Brost
2021-10-06 20:46   ` Matthew Brost
2021-10-12 21:22   ` John Harrison
2021-10-12 21:22     ` [Intel-gfx] " John Harrison
2021-10-13  0:37     ` Matthew Brost
2021-10-13  0:37       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 22/26] drm/i915/guc: Handle errors in multi-lrc requests Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-12 21:56   ` John Harrison
2021-10-12 21:56     ` [Intel-gfx] " John Harrison
2021-10-13  0:18     ` Matthew Brost
2021-10-13  0:18       ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 23/26] drm/i915: Make request conflict tracking understand parallel submits Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-12 22:08   ` John Harrison
2021-10-12 22:08     ` [Intel-gfx] " John Harrison
2021-10-13  0:32     ` Matthew Brost
2021-10-13  0:32       ` [Intel-gfx] " Matthew Brost
2021-10-13 19:35       ` John Harrison
2021-10-13 19:35         ` [Intel-gfx] " John Harrison
2021-10-13 17:51     ` Matthew Brost
2021-10-13 17:51       ` [Intel-gfx] " Matthew Brost
2021-10-13 19:25       ` John Harrison
2021-10-13 19:25         ` [Intel-gfx] " John Harrison
2021-10-04 22:06 ` [PATCH 24/26] drm/i915: Update I915_GEM_BUSY IOCTL to understand composite fences Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-11 22:15   ` Daniele Ceraolo Spurio
2021-10-11 22:15     ` [Intel-gfx] " Daniele Ceraolo Spurio
2021-10-12  7:53   ` Tvrtko Ursulin
2021-10-12 18:31     ` Matthew Brost
2021-10-04 22:06 ` [PATCH 25/26] drm/i915: Enable multi-bb execbuf Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:06 ` [PATCH 26/26] drm/i915/execlists: Weak parallel submission support for execlists Matthew Brost
2021-10-04 22:06   ` [Intel-gfx] " Matthew Brost
2021-10-04 22:21 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev4) Patchwork
2021-10-12 22:15   ` John Harrison
2021-10-13  0:15     ` Matthew Brost
2021-10-13 19:24       ` John Harrison
2021-10-04 22:23 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-10-04 22:26 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-10-12 22:15   ` John Harrison
2021-10-13  0:12     ` Matthew Brost
2021-10-04 22:54 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2021-10-05  1:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev5) Patchwork
2021-10-05  1:51 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-10-05  1:54 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-10-05  2:21 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2021-10-12 18:11 ` [PATCH 02/26] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
2021-10-12 18:11   ` [Intel-gfx] " Matthew Brost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.