All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/27] Parallel submission aka multi-bb execbuf
@ 2021-08-20 22:44 ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

As discussed in [1] we are introducing a new parallel submission uAPI
for the i915 which allows more than 1 BB to be submitted in an execbuf
IOCTL. This is the implemenation for both GuC and execlists.

In addition to selftests in the series, an IGT is available implemented
in the first 4 patches [2].

Media UMD changes to land soon.

First patch in the series in a squashed patch of [3] and does not need
to be reviewed here.

The execbuf IOCTL changes have been done in a single large patch (#24)
as all the changes flow together and I believe a single patch will be
better if some one has to lookup this change in the future. Can split in
a series of smaller patches if desired.

This code is available in a public [4] repo for UMD teams to test there
code on.

v2: Drop complicated state machine to block in kernel if no guc_ids
available, perma-pin parallel contexts, reworker execbuf IOCTL to be a
series of loops inside the IOCTL rather than 1 large one on the outside,
address Daniel Vetter's comments, rebase on [3]  

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/series/92028/
[2] https://patchwork.freedesktop.org/series/93071/
[3] https://patchwork.freedesktop.org/series/93704/
[4] https://gitlab.freedesktop.org/mbrost/mbrost-drm-intel/-/tree/drm-intel-parallel

Matthew Brost (27):
  drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and
    kernel DOC
  drm/i915/guc: Allow flexible number of context ids
  drm/i915/guc: Connect the number of guc_ids to debugfs
  drm/i915/guc: Take GT PM ref when deregistering context
  drm/i915: Add GT PM unpark worker
  drm/i915/guc: Take engine PM when a context is pinned with GuC
    submission
  drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  drm/i915: Add logical engine mapping
  drm/i915: Expose logical engine instance to user
  drm/i915/guc: Introduce context parent-child relationship
  drm/i915/guc: Implement parallel context pin / unpin functions
  drm/i915/guc: Add multi-lrc context registration
  drm/i915/guc: Ensure GuC schedule operations do not operate on child
    contexts
  drm/i915/guc: Assign contexts in parent-child relationship consecutive
    guc_ids
  drm/i915/guc: Implement multi-lrc submission
  drm/i915/guc: Insert submit fences between requests in parent-child
    relationship
  drm/i915/guc: Implement multi-lrc reset
  drm/i915/guc: Update debugfs for GuC multi-lrc
  drm/i915: Fix bug in user proto-context creation that leaked contexts
  drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  drm/i915/guc: Add basic GuC multi-lrc selftest
  drm/i915/guc: Implement no mid batch preemption for multi-lrc
  drm/i915: Multi-BB execbuf
  drm/i915/guc: Handle errors in multi-lrc requests
  drm/i915: Enable multi-bb execbuf
  drm/i915/execlists: Weak parallel submission support for execlists

 Documentation/gpu/rfc/i915_parallel_execbuf.h |  122 -
 Documentation/gpu/rfc/i915_scheduler.rst      |    4 +-
 drivers/gpu/drm/i915/Makefile                 |    1 +
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  222 +-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |    6 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  765 ++++--
 drivers/gpu/drm/i915/gt/intel_context.c       |   58 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   52 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  148 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   66 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |    9 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   20 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |    5 +
 .../drm/i915/gt/intel_execlists_submission.c  |   68 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |    3 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |    8 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   23 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c |   35 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h |   40 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   10 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |    7 +
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   12 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |    6 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   29 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |   21 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   61 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |    2 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   30 +-
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |   36 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   10 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 2312 +++++++++++++----
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     |  126 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  180 ++
 drivers/gpu/drm/i915/i915_gpu_error.c         |   39 +-
 drivers/gpu/drm/i915/i915_query.c             |    2 +
 drivers/gpu/drm/i915/i915_request.c           |  120 +-
 drivers/gpu/drm/i915/i915_request.h           |   40 +-
 drivers/gpu/drm/i915/i915_trace.h             |   12 +-
 drivers/gpu/drm/i915/i915_vma.c               |   21 +-
 drivers/gpu/drm/i915/i915_vma.h               |   13 +-
 drivers/gpu/drm/i915/intel_wakeref.c          |    5 +
 drivers/gpu/drm/i915/intel_wakeref.h          |   13 +
 .../drm/i915/selftests/i915_live_selftests.h  |    2 +
 drivers/gpu/drm/i915/selftests/i915_request.c |  100 +
 .../i915/selftests/intel_scheduler_helpers.c  |   12 +
 .../i915/selftests/intel_scheduler_helpers.h  |    2 +
 include/uapi/drm/i915_drm.h                   |  136 +-
 49 files changed, 3928 insertions(+), 1099 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 00/27] Parallel submission aka multi-bb execbuf
@ 2021-08-20 22:44 ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

As discussed in [1] we are introducing a new parallel submission uAPI
for the i915 which allows more than 1 BB to be submitted in an execbuf
IOCTL. This is the implemenation for both GuC and execlists.

In addition to selftests in the series, an IGT is available implemented
in the first 4 patches [2].

Media UMD changes to land soon.

First patch in the series in a squashed patch of [3] and does not need
to be reviewed here.

The execbuf IOCTL changes have been done in a single large patch (#24)
as all the changes flow together and I believe a single patch will be
better if some one has to lookup this change in the future. Can split in
a series of smaller patches if desired.

This code is available in a public [4] repo for UMD teams to test there
code on.

v2: Drop complicated state machine to block in kernel if no guc_ids
available, perma-pin parallel contexts, reworker execbuf IOCTL to be a
series of loops inside the IOCTL rather than 1 large one on the outside,
address Daniel Vetter's comments, rebase on [3]  

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/series/92028/
[2] https://patchwork.freedesktop.org/series/93071/
[3] https://patchwork.freedesktop.org/series/93704/
[4] https://gitlab.freedesktop.org/mbrost/mbrost-drm-intel/-/tree/drm-intel-parallel

Matthew Brost (27):
  drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and
    kernel DOC
  drm/i915/guc: Allow flexible number of context ids
  drm/i915/guc: Connect the number of guc_ids to debugfs
  drm/i915/guc: Take GT PM ref when deregistering context
  drm/i915: Add GT PM unpark worker
  drm/i915/guc: Take engine PM when a context is pinned with GuC
    submission
  drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  drm/i915: Add logical engine mapping
  drm/i915: Expose logical engine instance to user
  drm/i915/guc: Introduce context parent-child relationship
  drm/i915/guc: Implement parallel context pin / unpin functions
  drm/i915/guc: Add multi-lrc context registration
  drm/i915/guc: Ensure GuC schedule operations do not operate on child
    contexts
  drm/i915/guc: Assign contexts in parent-child relationship consecutive
    guc_ids
  drm/i915/guc: Implement multi-lrc submission
  drm/i915/guc: Insert submit fences between requests in parent-child
    relationship
  drm/i915/guc: Implement multi-lrc reset
  drm/i915/guc: Update debugfs for GuC multi-lrc
  drm/i915: Fix bug in user proto-context creation that leaked contexts
  drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  drm/i915/guc: Add basic GuC multi-lrc selftest
  drm/i915/guc: Implement no mid batch preemption for multi-lrc
  drm/i915: Multi-BB execbuf
  drm/i915/guc: Handle errors in multi-lrc requests
  drm/i915: Enable multi-bb execbuf
  drm/i915/execlists: Weak parallel submission support for execlists

 Documentation/gpu/rfc/i915_parallel_execbuf.h |  122 -
 Documentation/gpu/rfc/i915_scheduler.rst      |    4 +-
 drivers/gpu/drm/i915/Makefile                 |    1 +
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  222 +-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |    6 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  765 ++++--
 drivers/gpu/drm/i915/gt/intel_context.c       |   58 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   52 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  148 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   66 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |    9 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   20 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |    5 +
 .../drm/i915/gt/intel_execlists_submission.c  |   68 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |    3 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |    8 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   23 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c |   35 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h |   40 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   10 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |    7 +
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   12 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |    6 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   29 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |   21 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   61 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |    2 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   30 +-
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |   36 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   10 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 2312 +++++++++++++----
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     |  126 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  180 ++
 drivers/gpu/drm/i915/i915_gpu_error.c         |   39 +-
 drivers/gpu/drm/i915/i915_query.c             |    2 +
 drivers/gpu/drm/i915/i915_request.c           |  120 +-
 drivers/gpu/drm/i915/i915_request.h           |   40 +-
 drivers/gpu/drm/i915/i915_trace.h             |   12 +-
 drivers/gpu/drm/i915/i915_vma.c               |   21 +-
 drivers/gpu/drm/i915/i915_vma.h               |   13 +-
 drivers/gpu/drm/i915/intel_wakeref.c          |    5 +
 drivers/gpu/drm/i915/intel_wakeref.h          |   13 +
 .../drm/i915/selftests/i915_live_selftests.h  |    2 +
 drivers/gpu/drm/i915/selftests/i915_request.c |  100 +
 .../i915/selftests/intel_scheduler_helpers.c  |   12 +
 .../i915/selftests/intel_scheduler_helpers.h  |    2 +
 include/uapi/drm/i915_drm.h                   |  136 +-
 49 files changed, 3928 insertions(+), 1099 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 01/27] drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and kernel DOC
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

https://patchwork.freedesktop.org/series/93704/

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  19 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  81 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 952 +++++++++++-------
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 +++
 drivers/gpu/drm/i915/i915_gpu_error.c         |  39 +-
 drivers/gpu/drm/i915/i915_request.h           |  23 +-
 drivers/gpu/drm/i915/i915_trace.h             |  12 +-
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 ++
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 16 files changed, 965 insertions(+), 466 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 745e84c72c90..adfe49b53b1b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -394,19 +394,18 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	spin_lock_init(&ce->guc_state.lock);
 	INIT_LIST_HEAD(&ce->guc_state.fences);
+	INIT_LIST_HEAD(&ce->guc_state.requests);
 
-	spin_lock_init(&ce->guc_active.lock);
-	INIT_LIST_HEAD(&ce->guc_active.requests);
-
-	ce->guc_id = GUC_INVALID_LRC_ID;
-	INIT_LIST_HEAD(&ce->guc_id_link);
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
+	INIT_LIST_HEAD(&ce->guc_id.link);
 
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
 	 */
-	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_init(&ce->guc_state.blocked_fence,
+			   sw_fence_dummy_notify);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 
 	i915_active_init(&ce->active,
 			 __intel_context_active, __intel_context_retire, 0);
@@ -520,15 +519,15 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_active.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_active.requests,
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
 				    sched.link) {
 		if (i915_request_completed(rq))
 			break;
 
 		active = rq;
 	}
-	spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..80bbdc7810f6 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -112,6 +112,7 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
+#define CONTEXT_GUC_INIT		10
 
 	struct {
 		u64 timeout_us;
@@ -155,49 +156,73 @@ struct intel_context {
 	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
 
 	struct {
-		/** lock: protects everything in guc_state */
+		/** @lock: protects everything in guc_state */
 		spinlock_t lock;
 		/**
-		 * sched_state: scheduling state of this context using GuC
+		 * @sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
-		u16 sched_state;
+		u32 sched_state;
 		/*
-		 * fences: maintains of list of requests that have a submit
-		 * fence related to GuC submission
+		 * @fences: maintains a list of requests are currently being
+		 * fenced until a GuC operation completes
 		 */
 		struct list_head fences;
+		/**
+		 * @blocked_fence: fence used to signal when the blocking of a
+		 * contexts submissions is complete.
+		 */
+		struct i915_sw_fence blocked_fence;
+		/** @number_committed_requests: number of committed requests */
+		int number_committed_requests;
+		/** @requests: list of active requests on this context */
+		struct list_head requests;
+		/** @prio: the contexts current guc priority */
+		u8 prio;
+		/**
+		 * @prio_count: a counter of the number requests inflight in
+		 * each priority bucket
+		 */
+		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_state;
 
 	struct {
-		/** lock: protects everything in guc_active */
-		spinlock_t lock;
-		/** requests: active requests on this context */
-		struct list_head requests;
-	} guc_active;
-
-	/* GuC scheduling state flags that do not require a lock. */
-	atomic_t guc_sched_state_no_lock;
-
-	/* GuC LRC descriptor ID */
-	u16 guc_id;
-
-	/* GuC LRC descriptor reference count */
-	atomic_t guc_id_ref;
+		/**
+		 * @id: unique handle which is used to communicate information
+		 * with the GuC about this context, protected by
+		 * guc->contexts_lock
+		 */
+		u16 id;
+		/**
+		 * @ref: the number of references to the guc_id, when
+		 * transitioning in and out of zero protected by
+		 * guc->contexts_lock
+		 */
+		atomic_t ref;
+		/**
+		 * @link: in guc->guc_id_list when the guc_id has no refs but is
+		 * still valid, protected by guc->contexts_lock
+		 */
+		struct list_head link;
+	} guc_id;
 
-	/*
-	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
 	 */
-	struct list_head guc_id_link;
+	bool drop_schedule_enable;
 
-	/* GuC context blocked fence */
-	struct i915_sw_fence guc_blocked;
+	/**
+	 * @drop_schedule_disable: Force drop of schedule disable G2H for
+	 * selftest
+	 */
+	bool drop_schedule_disable;
 
-	/*
-	 * GuC priority management
+	/**
+	 * @drop_deregister: Force drop of deregister G2H for selftest
 	 */
-	u8 guc_prio;
-	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+	bool drop_deregister;
+#endif
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index de5f9c86b9a4..cafb0608ffb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
 			if (p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
-			/* Propagate any change in error status */
-			if (rq->fence.error)
-				i915_request_set_error_once(w, rq->fence.error);
-
 			if (w->engine != rq->engine)
 				continue;
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 08f011f893b2..bf43bed905db 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -789,7 +789,7 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 skip:
@@ -1098,7 +1098,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 			count++;
@@ -1108,7 +1108,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 					pr_err("i915_reset_engine(%s:%s): failed to reset request %lld:%lld [0x%04X]\n",
 					       engine->name, test_name,
 					       rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id);
+					       rq->fence.seqno, rq->context->guc_id.id);
 					i915_request_put(rq);
 
 					GEM_TRACE_DUMP();
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..cdc6ae48a1e1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog - VLK-22772 */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 2e27fe59786b..112dd29a63fe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -41,6 +41,10 @@ struct intel_guc {
 	spinlock_t irq_lock;
 	unsigned int msg_enabled_mask;
 
+	/**
+	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
+	 * submission, used to determine if the GT is idle
+	 */
 	atomic_t outstanding_submission_g2h;
 
 	struct {
@@ -49,12 +53,16 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/*
-	 * contexts_lock protects the pool of free guc ids and a linked list of
-	 * guc ids available to be stolen
+	/**
+	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
+	 * ce->guc_id.ref when transitioning in and out of zero
 	 */
 	spinlock_t contexts_lock;
+	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/**
+	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
+	 */
 	struct list_head guc_id_list;
 
 	bool submission_supported;
@@ -70,7 +78,10 @@ struct intel_guc {
 	struct i915_vma *lrc_desc_pool;
 	void *lrc_desc_pool_vaddr;
 
-	/* guc_id to intel_context lookup */
+	/**
+	 * @context_lookup: used to resolve intel_context from guc_id, if a
+	 * context is present in this structure it is registered with the GuC
+	 */
 	struct xarray context_lookup;
 
 	/* Control params for fw initialization */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 22b4733b55e2..20c710a74498 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -1042,9 +1042,9 @@ static void ct_incoming_request_worker_func(struct work_struct *w)
 		container_of(w, struct intel_guc_ct, requests.worker);
 	bool done;
 
-	done = ct_process_incoming_requests(ct);
-	if (!done)
-		queue_work(system_unbound_wq, &ct->requests.worker);
+	do {
+		done = ct_process_incoming_requests(ct);
+	} while (!done);
 }
 
 static int ct_handle_event(struct intel_guc_ct *ct, struct ct_incoming_msg *request)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..46158d996bf6 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -28,21 +28,6 @@
 /**
  * DOC: GuC-based command submission
  *
- * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
- * firmware is moving to an updated submission interface and we plan to
- * turn submission back on when that lands. The below documentation (and related
- * code) matches the old submission model and will be updated as part of the
- * upgrade to the new flow.
- *
- * GuC stage descriptor:
- * During initialization, the driver allocates a static pool of 1024 such
- * descriptors, and shares them with the GuC. Currently, we only use one
- * descriptor. This stage descriptor lets the GuC know about the workqueue and
- * process descriptor. Theoretically, it also lets the GuC know about our HW
- * contexts (context ID, etc...), but we actually employ a kind of submission
- * where the GuC uses the LRCA sent via the work item instead. This is called
- * a "proxy" submission.
- *
  * The Scratch registers:
  * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
  * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
@@ -51,14 +36,82 @@
  * processes the request. The kernel driver polls waiting for this update and
  * then proceeds.
  *
- * Work Items:
- * There are several types of work items that the host may place into a
- * workqueue, each with its own requirements and limitations. Currently only
- * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
- * represents in-order queue. The kernel driver packs ring tail pointer and an
- * ELSP context descriptor dword into Work Item.
- * See guc_add_request()
+ * Command Transport buffers (CTBs):
+ * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
+ * G2H) are a message interface between the i915 and GuC used to controls
+ * submissions.
+ *
+ * Context registration:
+ * Before a context can be submitted it must be registered with the GuC via a
+ * H2G. A unique guc_id is associated with each context. The context is either
+ * registered at request creation time (normal operation) or at submission time
+ * (abnormal operation, e.g. after a reset).
+ *
+ * Context submission:
+ * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
+ * or context submit H2G is used to submit a context.
+ *
+ * Context unpin:
+ * To unpin a context a H2G is used to disable scheduling and when the
+ * corresponding G2H returns indicating the scheduling disable operation has
+ * completed it is safe to unpin the context. While a disable is in flight it
+ * isn't safe to resubmit the context so a fence is used to stall all future
+ * requests until the G2H is returned.
+ *
+ * Context deregistration:
+ * Before a context can be destroyed or we steal its guc_id we must deregister
+ * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
+ * submit anything to this guc_id until the deregister completes so a fence is
+ * used to stall all requests associated with this guc_ids until the
+ * corresponding G2H returns indicating the guc_id has been deregistered.
+ *
+ * guc_ids:
+ * Unique number associated with private GuC context data passed in during
+ * context registration / submission / deregistration. 64k available. Simple ida
+ * is used for allocation.
+ *
+ * Stealing guc_ids:
+ * If no guc_ids are available they can be stolen from another context at
+ * request creation time if that context is unpinned. If a guc_id can't be found
+ * we punt this problem to the user as we believe this is near impossible to hit
+ * during normal use cases.
+ *
+ * Locking:
+ * In the GuC submission code we have 3 basic spin locks which protect
+ * everything. Details about each below.
+ *
+ * sched_engine->lock
+ * This is the submission lock for all contexts that share a i915 schedule
+ * engine (sched_engine), thus only 1 context which share a sched_engine can be
+ * submitting at a time. Currently only 1 sched_engine used for all of GuC
+ * submission but that could change in the future.
  *
+ * guc->contexts_lock
+ * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
+ * submission can hold this at a time.
+ *
+ * ce->guc_state.lock
+ * Protects everything under ce->guc_state. Ensures that a context is in the
+ * correct state before issuing a H2G. e.g. We don't issue a schedule disable
+ * on disabled context (bad idea), we don't issue schedule enable when a
+ * schedule disable is inflight, etc... Also protects list of inflight requests
+ * on the context and the priority management state. Lock individual to each
+ * context.
+ *
+ * Lock ordering rules:
+ * sched_engine->lock -> ce->guc_state.lock
+ * guc->contexts_lock -> ce->guc_state.lock
+ *
+ * Reset races:
+ * When a GPU full reset is triggered it is assumed that some G2H responses to
+ * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
+ * fatal as we do certain operations upon receiving a G2H (e.g. destroy
+ * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
+ * context state and cleanup appropriately, however this is quite racey. To
+ * avoid races the rules are check for submission being disabled (i.e. check for
+ * mid reset) with the appropriate lock being held. If submission is disabled
+ * don't send the H2G or update the context state. The reset code must disable
+ * submission and grab all these locks before scrubbing for the missing G2H.
  */
 
 /* GuC Virtual Engine */
@@ -73,167 +126,165 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
 /*
- * Below is a set of functions which control the GuC scheduling state which do
- * not require a lock as all state transitions are mutually exclusive. i.e. It
- * is not possible for the context pinning code and submission, for the same
- * context, to be executing simultaneously. We still need an atomic as it is
- * possible for some of the bits to changing at the same time though.
+ * Below is a set of functions which control the GuC scheduling state which
+ * require a lock.
  */
-#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
-#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
-#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
-static inline bool context_enabled(struct intel_context *ce)
+#define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
+#define SCHED_STATE_DESTROYED				BIT(1)
+#define SCHED_STATE_PENDING_DISABLE			BIT(2)
+#define SCHED_STATE_BANNED				BIT(3)
+#define SCHED_STATE_ENABLED				BIT(4)
+#define SCHED_STATE_PENDING_ENABLE			BIT(5)
+#define SCHED_STATE_REGISTERED				BIT(6)
+#define SCHED_STATE_BLOCKED_SHIFT			7
+#define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
+#define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
+static void init_sched_state(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_ENABLED);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
-static inline void set_context_enabled(struct intel_context *ce)
+__maybe_unused
+static bool sched_state_is_init(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
+	/*
+	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
+	 * suspend.
+	 */
+	return !(ce->guc_state.sched_state &=
+		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
 }
 
-static inline void clr_context_enabled(struct intel_context *ce)
+static bool
+context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
-		   &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state &
+		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline bool context_pending_enable(struct intel_context *ce)
+static void
+set_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |=
+		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void set_context_pending_enable(struct intel_context *ce)
+static void
+clr_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		  &ce->guc_sched_state_no_lock);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &=
+		~SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void clr_context_pending_enable(struct intel_context *ce)
+static bool
+context_destroyed(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		   &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state & SCHED_STATE_DESTROYED;
 }
 
-static inline bool context_registered(struct intel_context *ce)
+static void
+set_context_destroyed(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_REGISTERED);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
-static inline void set_context_registered(struct intel_context *ce)
+static bool context_pending_disable(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
-		  &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline void clr_context_registered(struct intel_context *ce)
+static void set_context_pending_disable(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
-		   &ce->guc_sched_state_no_lock);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_DISABLE;
 }
 
-/*
- * Below is a set of functions which control the GuC scheduling state which
- * require a lock, aside from the special case where the functions are called
- * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
- * path to be executing on the context.
- */
-#define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
-#define SCHED_STATE_DESTROYED				BIT(1)
-#define SCHED_STATE_PENDING_DISABLE			BIT(2)
-#define SCHED_STATE_BANNED				BIT(3)
-#define SCHED_STATE_BLOCKED_SHIFT			4
-#define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
-#define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
-static inline void init_sched_state(struct intel_context *ce)
+static void clr_context_pending_disable(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() */
-	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline bool
-context_wait_for_deregister_to_register(struct intel_context *ce)
+static bool context_banned(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state &
-		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	return ce->guc_state.sched_state & SCHED_STATE_BANNED;
 }
 
-static inline void
-set_context_wait_for_deregister_to_register(struct intel_context *ce)
+static void set_context_banned(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() without lock */
-	ce->guc_state.sched_state |=
-		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_BANNED;
 }
 
-static inline void
-clr_context_wait_for_deregister_to_register(struct intel_context *ce)
+static void clr_context_banned(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &=
-		~SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
 }
 
-static inline bool
-context_destroyed(struct intel_context *ce)
+static bool context_enabled(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_DESTROYED;
+	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
 }
 
-static inline void
-set_context_destroyed(struct intel_context *ce)
+static void set_context_enabled(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
+	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
 }
 
-static inline bool context_pending_disable(struct intel_context *ce)
+static void clr_context_enabled(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
 }
 
-static inline void set_context_pending_disable(struct intel_context *ce)
+static bool context_pending_enable(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
+}
+
+static void set_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_PENDING_DISABLE;
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline void clr_context_pending_disable(struct intel_context *ce)
+static void clr_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_DISABLE;
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline bool context_banned(struct intel_context *ce)
+static bool context_registered(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_BANNED;
+	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
 }
 
-static inline void set_context_banned(struct intel_context *ce)
+static void set_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_BANNED;
+	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
 }
 
-static inline void clr_context_banned(struct intel_context *ce)
+static void clr_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
+	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
 }
 
-static inline u32 context_blocked(struct intel_context *ce)
+static u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
 		SCHED_STATE_BLOCKED_SHIFT;
 }
 
-static inline void incr_context_blocked(struct intel_context *ce)
+static void incr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
@@ -241,9 +292,8 @@ static inline void incr_context_blocked(struct intel_context *ce)
 	GEM_BUG_ON(!context_blocked(ce));	/* Overflow check */
 }
 
-static inline void decr_context_blocked(struct intel_context *ce)
+static void decr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
@@ -251,22 +301,41 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
-static inline bool context_guc_id_invalid(struct intel_context *ce)
+static bool context_has_committed_requests(struct intel_context *ce)
+{
+	return !!ce->guc_state.number_committed_requests;
+}
+
+static void incr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	++ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static void decr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	--ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static bool context_guc_id_invalid(struct intel_context *ce)
 {
-	return ce->guc_id == GUC_INVALID_LRC_ID;
+	return ce->guc_id.id == GUC_INVALID_LRC_ID;
 }
 
-static inline void set_context_guc_id_invalid(struct intel_context *ce)
+static void set_context_guc_id_invalid(struct intel_context *ce)
 {
-	ce->guc_id = GUC_INVALID_LRC_ID;
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
 }
 
-static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
+static struct intel_guc *ce_to_guc(struct intel_context *ce)
 {
 	return &ce->engine->gt->uc.guc;
 }
 
-static inline struct i915_priolist *to_priolist(struct rb_node *rb)
+static struct i915_priolist *to_priolist(struct rb_node *rb)
 {
 	return rb_entry(rb, struct i915_priolist, node);
 }
@@ -280,7 +349,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 	return &base[index];
 }
 
-static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
+static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
@@ -310,12 +379,12 @@ static void guc_lrc_desc_pool_destroy(struct intel_guc *guc)
 	i915_vma_unpin_and_release(&guc->lrc_desc_pool, I915_VMA_RELEASE_MAP);
 }
 
-static inline bool guc_submission_initialized(struct intel_guc *guc)
+static bool guc_submission_initialized(struct intel_guc *guc)
 {
 	return !!guc->lrc_desc_pool_vaddr;
 }
 
-static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
+static void reset_lrc_desc(struct intel_guc *guc, u32 id)
 {
 	if (likely(guc_submission_initialized(guc))) {
 		struct guc_lrc_desc *desc = __get_lrc_desc(guc, id);
@@ -333,13 +402,13 @@ static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
 	}
 }
 
-static inline bool lrc_desc_registered(struct intel_guc *guc, u32 id)
+static bool lrc_desc_registered(struct intel_guc *guc, u32 id)
 {
 	return __get_context(guc, id);
 }
 
-static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
-					   struct intel_context *ce)
+static void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
+				    struct intel_context *ce)
 {
 	unsigned long flags;
 
@@ -352,6 +421,12 @@ static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
+static void decr_outstanding_submission_g2h(struct intel_guc *guc)
+{
+	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
+		wake_up_all(&guc->ct.wq);
+}
+
 static int guc_submission_send_busy_loop(struct intel_guc *guc,
 					 const u32 *action,
 					 u32 len,
@@ -360,11 +435,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+
 	return err;
 }
 
@@ -430,6 +505,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	u32 g2h_len_dw = 0;
 	bool enabled;
 
+	lockdep_assert_held(&rq->engine->sched_engine->lock);
+
 	/*
 	 * Corner case where requests were sitting in the priority list or a
 	 * request resubmitted after the context was banned.
@@ -437,22 +514,24 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(intel_context_is_banned(ce))) {
 		i915_request_put(i915_request_mark_eio(rq));
 		intel_engine_signal_breadcrumbs(ce->engine);
-		goto out;
+		return 0;
 	}
 
-	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
 	/*
 	 * Corner case where the GuC firmware was blown away and reloaded while
 	 * this context was pinned.
 	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
+	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
 		err = guc_lrc_desc_pin(ce, false);
 		if (unlikely(err))
-			goto out;
+			return err;
 	}
 
+	spin_lock(&ce->guc_state.lock);
+
 	/*
 	 * The request / context will be run on the hardware when scheduling
 	 * gets enabled in the unblock.
@@ -464,14 +543,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 		action[len++] = GUC_CONTEXT_ENABLE;
 		set_context_pending_enable(ce);
 		intel_context_get(ce);
 		g2h_len_dw = G2H_LEN_DW_SCHED_CONTEXT_MODE_SET;
 	} else {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 	}
 
 	err = intel_guc_send_nb(guc, action, len, g2h_len_dw);
@@ -487,16 +566,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_i915_request_guc_submit(rq);
 
 out:
+	spin_unlock(&ce->guc_state.lock);
 	return err;
 }
 
-static inline void guc_set_lrc_tail(struct i915_request *rq)
+static void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
 		intel_ring_set_tail(rq->ring, rq->tail);
 }
 
-static inline int rq_prio(const struct i915_request *rq)
+static int rq_prio(const struct i915_request *rq)
 {
 	return rq->sched.attr.priority;
 }
@@ -596,10 +676,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	unsigned long index, flags;
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		/* Flush context */
-		spin_lock_irqsave(&ce->guc_state.lock, flags);
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		/*
+		 * Corner case where the ref count on the object is zero but and
+		 * deregister G2H was lost. In this case we don't touch the ref
+		 * count and finish the destroy of the context.
+		 */
+		bool do_put = kref_get_unless_zero(&ce->ref);
+
+		xa_unlock(&guc->context_lookup);
+
+		spin_lock(&ce->guc_state.lock);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -615,8 +703,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
+		spin_unlock(&ce->guc_state.lock);
+
+		GEM_BUG_ON(!do_put && !destroyed);
+
 		if (pending_enable || destroyed || deregister) {
-			atomic_dec(&guc->outstanding_submission_g2h);
+			decr_outstanding_submission_g2h(guc);
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
@@ -635,17 +727,23 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 				intel_engine_signal_breadcrumbs(ce->engine);
 			}
 			intel_context_sched_disable_unpin(ce);
-			atomic_dec(&guc->outstanding_submission_g2h);
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
+			decr_outstanding_submission_g2h(guc);
+
+			spin_lock(&ce->guc_state.lock);
 			guc_blocked_fence_complete(ce);
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+			spin_unlock(&ce->guc_state.lock);
 
 			intel_context_put(ce);
 		}
+
+		if (do_put)
+			intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
-static inline bool
+static bool
 submission_disabled(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -694,8 +792,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -709,22 +805,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
-	guc_flush_submissions(guc);
+	flush_work(&guc->ct.requests.worker);
 
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
@@ -742,7 +824,7 @@ guc_virtual_get_sibling(struct intel_engine_cs *ve, unsigned int sibling)
 	return NULL;
 }
 
-static inline struct intel_engine_cs *
+static struct intel_engine_cs *
 __context_to_physical_engine(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -796,16 +878,14 @@ __unwind_incomplete_requests(struct intel_context *ce)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	spin_lock(&ce->guc_state.lock);
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_state.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
-
 		__i915_request_unsubmit(rq);
 
 		/* Push the request back into the queue for later resubmission. */
@@ -816,29 +896,44 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
+	bool skip = false;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission.
+	 *
+	 * XXX: If the context is reset as a result of the request cancellation
+	 * this G2H is received after the schedule disable complete G2H which is
+	 * likely wrong as this creates a race between the request cancellation
+	 * code re-submitting the context and this G2H handler. This likely
+	 * should be fixed in the GuC but until if / when that gets fixed we
+	 * need to workaround this. Convert this function to a NOP if a pending
+	 * enable is in flight as this indicates that a request cancellation has
+	 * occurred.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (likely(!context_pending_enable(ce)))
+		clr_context_enabled(ce);
+	else
+		skip = true;
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(skip))
+		goto out_put;
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -857,6 +952,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 out_replay:
 	guc_reset_state(ce, head, stalled);
 	__unwind_incomplete_requests(ce);
+out_put:
 	intel_context_put(ce);
 }
 
@@ -864,16 +960,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
 	}
 
-	xa_for_each(&guc->context_lookup, index, ce)
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
+
 		if (intel_context_is_pinned(ce))
 			__guc_reset_context(ce, stalled);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
@@ -886,10 +995,10 @@ static void guc_cancel_context_requests(struct intel_context *ce)
 
 	/* Mark all executing requests as skipped. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
+	spin_lock(&ce->guc_state.lock);
+	list_for_each_entry(rq, &ce->guc_state.requests, sched.link)
 		i915_request_put(i915_request_mark_eio(rq));
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
@@ -948,11 +1057,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
+
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
 
-	xa_for_each(&guc->context_lookup, index, ce)
 		if (intel_context_is_pinned(ce))
 			guc_cancel_context_requests(ce);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	guc_cancel_sched_engine_requests(guc->sched_engine);
 
 	/* GuC is blown away, drop all references to contexts */
@@ -1019,14 +1141,15 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	i915_sched_engine_put(guc->sched_engine);
 }
 
-static inline void queue_request(struct i915_sched_engine *sched_engine,
-				 struct i915_request *rq,
-				 int prio)
+static void queue_request(struct i915_sched_engine *sched_engine,
+			  struct i915_request *rq,
+			  int prio)
 {
 	GEM_BUG_ON(!list_empty(&rq->sched.link));
 	list_add_tail(&rq->sched.link,
 		      i915_sched_lookup_priolist(sched_engine, prio));
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
 static int guc_bypass_tasklet_submit(struct intel_guc *guc,
@@ -1077,12 +1200,12 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id);
-		reset_lrc_desc(guc, ce->guc_id);
+		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 }
 
 static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
@@ -1104,14 +1227,18 @@ static int steal_guc_id(struct intel_guc *guc)
 	if (!list_empty(&guc->guc_id_list)) {
 		ce = list_first_entry(&guc->guc_id_list,
 				      struct intel_context,
-				      guc_id_link);
+				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 		GEM_BUG_ON(context_guc_id_invalid(ce));
 
-		list_del_init(&ce->guc_id_link);
-		guc_id = ce->guc_id;
+		list_del_init(&ce->guc_id.link);
+		guc_id = ce->guc_id.id;
+
+		spin_lock(&ce->guc_state.lock);
 		clr_context_registered(ce);
+		spin_unlock(&ce->guc_state.lock);
+
 		set_context_guc_id_invalid(ce);
 		return guc_id;
 	} else {
@@ -1142,26 +1269,28 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	int ret = 0;
 	unsigned long flags, tries = PIN_GUC_ID_TRIES;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
+	might_lock(&ce->guc_state.lock);
+
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id);
+		ret = assign_guc_id(guc, &ce->guc_id.id);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
-	atomic_inc(&ce->guc_id_ref);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
+	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/*
-	 * -EAGAIN indicates no guc_ids are available, let's retire any
+	 * -EAGAIN indicates no guc_id are available, let's retire any
 	 * outstanding requests to see if that frees up a guc_id. If the first
 	 * retire didn't help, insert a sleep with the timeslice duration before
 	 * attempting to retire more requests. Double the sleep period each
@@ -1189,15 +1318,15 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
 
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
-	    !atomic_read(&ce->guc_id_ref))
-		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
+	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
+	    !atomic_read(&ce->guc_id.ref))
+		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
@@ -1220,14 +1349,19 @@ static int register_context(struct intel_context *ce, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	u32 offset = intel_guc_ggtt_offset(guc, guc->lrc_desc_pool) +
-		ce->guc_id * sizeof(struct guc_lrc_desc);
+		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
-	if (likely(!ret))
+	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (likely(!ret)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		set_context_registered(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	}
 
 	return ret;
 }
@@ -1285,22 +1419,19 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
 	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio);
-
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 {
 	struct intel_engine_cs *engine = ce->engine;
 	struct intel_runtime_pm *runtime_pm = engine->uncore->rpm;
 	struct intel_guc *guc = &engine->gt->uc.guc;
-	u32 desc_idx = ce->guc_id;
+	u32 desc_idx = ce->guc_id.id;
 	struct guc_lrc_desc *desc;
-	const struct i915_gem_context *ctx;
-	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
 	bool context_registered;
 	intel_wakeref_t wakeref;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
+	GEM_BUG_ON(!sched_state_is_init(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -1311,12 +1442,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
-	rcu_read_lock();
-	ctx = rcu_dereference(ce->gem_context);
-	if (ctx)
-		prio = ctx->sched.priority;
-	rcu_read_unlock();
-
 	reset_lrc_desc(guc, desc_idx);
 	set_lrc_desc_registered(guc, desc_idx, ce);
 
@@ -1325,11 +1450,9 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
 	desc->hw_context_desc = ce->lrc.lrca;
-	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
-	desc->priority = ce->guc_prio;
+	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
-	init_sched_state(ce);
 
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
@@ -1340,26 +1463,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	 * registering this context.
 	 */
 	if (context_registered) {
+		bool disabled;
+		unsigned long flags;
+
 		trace_intel_context_steal_guc_id(ce);
-		if (!loop) {
+		GEM_BUG_ON(!loop);
+
+		/* Seal race with Reset */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		disabled = submission_disabled(guc);
+		if (likely(!disabled)) {
 			set_context_wait_for_deregister_to_register(ce);
 			intel_context_get(ce);
-		} else {
-			bool disabled;
-			unsigned long flags;
-
-			/* Seal race with Reset */
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
-			disabled = submission_disabled(guc);
-			if (likely(!disabled)) {
-				set_context_wait_for_deregister_to_register(ce);
-				intel_context_get(ce);
-			}
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-			if (unlikely(disabled)) {
-				reset_lrc_desc(guc, desc_idx);
-				return 0;	/* Will get registered later */
-			}
+		}
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		if (unlikely(disabled)) {
+			reset_lrc_desc(guc, desc_idx);
+			return 0;	/* Will get registered later */
 		}
 
 		/*
@@ -1367,20 +1487,19 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		 * context whose guc_id was stolen.
 		 */
 		with_intel_runtime_pm(runtime_pm, wakeref)
-			ret = deregister_context(ce, ce->guc_id, loop);
-		if (unlikely(ret == -EBUSY)) {
-			clr_context_wait_for_deregister_to_register(ce);
-			intel_context_put(ce);
-		} else if (unlikely(ret == -ENODEV)) {
+			ret = deregister_context(ce, ce->guc_id.id, loop);
+		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
 	} else {
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = register_context(ce, loop);
-		if (unlikely(ret == -EBUSY))
+		if (unlikely(ret == -EBUSY)) {
+			reset_lrc_desc(guc, desc_idx);
+		} else if (unlikely(ret == -ENODEV)) {
 			reset_lrc_desc(guc, desc_idx);
-		else if (unlikely(ret == -ENODEV))
 			ret = 0;	/* Will get registered later */
+		}
 	}
 
 	return ret;
@@ -1440,7 +1559,7 @@ static void __guc_context_sched_enable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		ce->guc_id,
+		ce->guc_id.id,
 		GUC_CONTEXT_ENABLE
 	};
 
@@ -1456,7 +1575,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		guc_id,	/* ce->guc_id not stable */
+		guc_id,	/* ce->guc_id.id not stable */
 		GUC_CONTEXT_DISABLE
 	};
 
@@ -1472,24 +1591,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (!i915_sw_fence_done(&ce->guc_blocked))
-		i915_sw_fence_complete(&ce->guc_blocked);
+	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
+		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
 }
 
 static void guc_blocked_fence_reinit(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
+	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
 
 	/*
 	 * This fence is always complete unless a pending schedule disable is
 	 * outstanding. We arm the fence here and complete it when we receive
 	 * the pending schedule disable complete message.
 	 */
-	i915_sw_fence_fini(&ce->guc_blocked);
-	i915_sw_fence_reinit(&ce->guc_blocked);
-	i915_sw_fence_await(&ce->guc_blocked);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
+	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
+	i915_sw_fence_await(&ce->guc_state.blocked_fence);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 }
 
 static u16 prep_context_pending_disable(struct intel_context *ce)
@@ -1501,13 +1620,12 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
 	guc_blocked_fence_reinit(ce);
 	intel_context_get(ce);
 
-	return ce->guc_id;
+	return ce->guc_id.id;
 }
 
 static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1516,20 +1634,14 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
-	/*
-	 * Sync with submission path, increment before below changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	incr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
 		if (enabled)
 			clr_context_enabled(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-		return &ce->guc_blocked;
+		return &ce->guc_state.blocked_fence;
 	}
 
 	/*
@@ -1545,13 +1657,12 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	with_intel_runtime_pm(runtime_pm, wakeref)
 		__guc_context_sched_disable(guc, ce, guc_id);
 
-	return &ce->guc_blocked;
+	return &ce->guc_state.blocked_fence;
 }
 
 static void guc_context_unblock(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1562,6 +1673,9 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
+		     context_guc_id_invalid(ce) ||
+		     !lrc_desc_registered(guc, ce->guc_id.id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
@@ -1573,13 +1687,7 @@ static void guc_context_unblock(struct intel_context *ce)
 		intel_context_get(ce);
 	}
 
-	/*
-	 * Sync with submission path, decrement after above changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	decr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
@@ -1593,15 +1701,25 @@ static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
 	if (i915_sw_fence_signaled(&rq->submit)) {
-		struct i915_sw_fence *fence = guc_context_block(ce);
+		struct i915_sw_fence *fence;
 
+		intel_context_get(ce);
+		fence = guc_context_block(ce);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
 			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
 					true);
 		}
+
+		/*
+		 * XXX: Racey if context is reset, see comment in
+		 * __guc_reset_context().
+		 */
+		flush_work(&ce_to_guc(ce)->ct.requests.worker);
+
 		guc_context_unblock(ce);
+		intel_context_put(ce);
 	}
 }
 
@@ -1662,7 +1780,7 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 		if (!context_guc_id_invalid(ce))
 			with_intel_runtime_pm(runtime_pm, wakeref)
 				__guc_context_set_preemption_timeout(guc,
-								     ce->guc_id,
+								     ce->guc_id.id,
 								     1);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	}
@@ -1678,8 +1796,10 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	bool enabled;
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
-	    !lrc_desc_registered(guc, ce->guc_id)) {
+	    !lrc_desc_registered(guc, ce->guc_id.id)) {
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_enabled(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
 
@@ -1689,14 +1809,11 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
-	 * We have to check if the context has been disabled by another thread.
-	 * We also have to check if the context has been pinned again as another
-	 * pin operation is allowed to pass this function. Checking the pin
-	 * count, within ce->guc_state.lock, synchronizes this function with
-	 * guc_request_alloc ensuring a request doesn't slip through the
-	 * 'context_pending_disable' fence. Checking within the spin lock (can't
-	 * sleep) ensures another process doesn't pin this context and generate
-	 * a request before we set the 'context_pending_disable' flag here.
+	 * We have to check if the context has been disabled by another thread,
+	 * check if submssion has been disabled to seal a race with reset and
+	 * finally check if any more requests have been committed to the
+	 * context ensursing that a request doesn't slip through the
+	 * 'context_pending_disable' fence.
 	 */
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1705,7 +1822,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
-	if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+	if (unlikely(context_has_committed_requests(ce))) {
+		intel_context_sched_disable_unpin(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		return;
 	}
@@ -1721,24 +1839,24 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_context_sched_disable_unpin(ce);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
-	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
-	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
+	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
+	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
-	clr_context_registered(ce);
-	deregister_context(ce, ce->guc_id, true);
+	deregister_context(ce, ce->guc_id.id, true);
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
 {
-	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
 	intel_context_fini(ce);
@@ -1774,7 +1892,7 @@ static void guc_context_destroy(struct kref *kref)
 		__guc_context_destroy(ce);
 		return;
 	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id)) {
+		   !lrc_desc_registered(guc, ce->guc_id.id)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 		return;
@@ -1783,10 +1901,10 @@ static void guc_context_destroy(struct kref *kref)
 	/*
 	 * We have to acquire the context spinlock and check guc_id again, if it
 	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_ids available to
+	 * delete this context from the list of unpinned guc_id available to
 	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
 	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_ids.
+	 * returned to the pool of available guc_id.
 	 */
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 	if (context_guc_id_invalid(ce)) {
@@ -1795,15 +1913,17 @@ static void guc_context_destroy(struct kref *kref)
 		return;
 	}
 
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
-	if (likely(!disabled))
+	if (likely(!disabled)) {
 		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
@@ -1839,24 +1959,27 @@ static void guc_context_set_prio(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SET_CONTEXT_PRIORITY,
-		ce->guc_id,
+		ce->guc_id.id,
 		prio,
 	};
 
 	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
 		   prio > GUC_CLIENT_PRIORITY_NORMAL);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (ce->guc_prio == prio || submission_disabled(guc) ||
-	    !context_registered(ce))
+	if (ce->guc_state.prio == prio || submission_disabled(guc) ||
+	    !context_registered(ce)) {
+		ce->guc_state.prio = prio;
 		return;
+	}
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
 
-	ce->guc_prio = prio;
+	ce->guc_state.prio = prio;
 	trace_intel_context_set_prio(ce);
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio)
+static u8 map_i915_prio_to_guc_prio(int prio)
 {
 	if (prio == I915_PRIORITY_NORMAL)
 		return GUC_CLIENT_PRIORITY_KMD_NORMAL;
@@ -1868,31 +1991,29 @@ static inline u8 map_i915_prio_to_guc_prio(int prio)
 		return GUC_CLIENT_PRIORITY_KMD_HIGH;
 }
 
-static inline void add_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void add_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
-	++ce->guc_prio_count[guc_prio];
+	++ce->guc_state.prio_count[guc_prio];
 
 	/* Overflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 }
 
-static inline void sub_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void sub_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
 	/* Underflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 
-	--ce->guc_prio_count[guc_prio];
+	--ce->guc_state.prio_count[guc_prio];
 }
 
-static inline void update_context_prio(struct intel_context *ce)
+static void update_context_prio(struct intel_context *ce)
 {
 	struct intel_guc *guc = &ce->engine->gt->uc.guc;
 	int i;
@@ -1900,17 +2021,17 @@ static inline void update_context_prio(struct intel_context *ce)
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH > GUC_CLIENT_PRIORITY_NORMAL);
 
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
-		if (ce->guc_prio_count[i]) {
+	for (i = 0; i < ARRAY_SIZE(ce->guc_state.prio_count); ++i) {
+		if (ce->guc_state.prio_count[i]) {
 			guc_context_set_prio(guc, ce, i);
 			break;
 		}
 	}
 }
 
-static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
+static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 {
 	/* Lower value is higher priority */
 	return new_guc_prio < old_guc_prio;
@@ -1923,8 +2044,8 @@ static void add_to_context(struct i915_request *rq)
 
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
-	spin_lock(&ce->guc_active.lock);
-	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
+	spin_lock(&ce->guc_state.lock);
+	list_move_tail(&rq->sched.link, &ce->guc_state.requests);
 
 	if (rq->guc_prio == GUC_PRIO_INIT) {
 		rq->guc_prio = new_guc_prio;
@@ -1936,12 +2057,12 @@ static void add_to_context(struct i915_request *rq)
 	}
 	update_context_prio(ce);
 
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
 	if (rq->guc_prio != GUC_PRIO_INIT &&
 	    rq->guc_prio != GUC_PRIO_FINI) {
@@ -1955,7 +2076,7 @@ static void remove_from_context(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock_irq(&ce->guc_active.lock);
+	spin_lock_irq(&ce->guc_state.lock);
 
 	list_del_init(&rq->sched.link);
 	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
@@ -1965,9 +2086,11 @@ static void remove_from_context(struct i915_request *rq)
 
 	guc_prio_fini(rq, ce);
 
-	spin_unlock_irq(&ce->guc_active.lock);
+	decr_context_committed_requests(ce);
 
-	atomic_dec(&ce->guc_id_ref);
+	spin_unlock_irq(&ce->guc_state.lock);
+
+	atomic_dec(&ce->guc_id.ref);
 	i915_request_notify_execute_cb_imm(rq);
 }
 
@@ -1994,6 +2117,14 @@ static const struct intel_context_ops guc_context_ops = {
 	.create_virtual = guc_create_virtual,
 };
 
+static void submit_work_cb(struct irq_work *wrk)
+{
+	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
+
+	might_lock(&rq->engine->sched_engine->lock);
+	i915_sw_fence_complete(&rq->submit);
+}
+
 static void __guc_signal_context_fence(struct intel_context *ce)
 {
 	struct i915_request *rq;
@@ -2003,8 +2134,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
 	if (!list_empty(&ce->guc_state.fences))
 		trace_intel_context_fence_release(ce);
 
+	/*
+	 * Use an IRQ to ensure locking order of sched_engine->lock ->
+	 * ce->guc_state.lock is preserved.
+	 */
 	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
-		i915_sw_fence_complete(&rq->submit);
+		irq_work_queue(&rq->submit_work);
 
 	INIT_LIST_HEAD(&ce->guc_state.fences);
 }
@@ -2022,10 +2157,24 @@ static void guc_signal_context_fence(struct intel_context *ce)
 static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
 {
 	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
-		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
+		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id)) &&
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static void guc_context_init(struct intel_context *ce)
+{
+	const struct i915_gem_context *ctx;
+	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(ce->gem_context);
+	if (ctx)
+		prio = ctx->sched.priority;
+	rcu_read_unlock();
+
+	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
+}
+
 static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
@@ -2057,14 +2206,17 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
+		guc_context_init(ce);
+
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
-	 * guc_ids and creating horrible race conditions. This is especially bad
-	 * when guc_ids are being stolen due to over subscription. By the time
+	 * guc_id and creating horrible race conditions. This is especially bad
+	 * when guc_id are being stolen due to over subscription. By the time
 	 * this function is reached, it is guaranteed that the guc_id will be
 	 * persistent until the generated request is retired. Thus, sealing these
-	 * race conditions. It is still safe to fail here if guc_ids are
+	 * race conditions. It is still safe to fail here if guc_id are
 	 * exhausted and return -EAGAIN to the user indicating that they can try
 	 * again in the future.
 	 *
@@ -2074,7 +2226,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * decremented on each retire. When it is zero, a lock around the
 	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
 	 */
-	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
+	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
 		goto out;
 
 	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
@@ -2087,7 +2239,7 @@ static int guc_request_alloc(struct i915_request *rq)
 				disable_submission(guc);
 				goto out;	/* GPU will be reset */
 			}
-			atomic_dec(&ce->guc_id_ref);
+			atomic_dec(&ce->guc_id.ref);
 			unpin_guc_id(guc, ce);
 			return ret;
 		}
@@ -2102,22 +2254,16 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * schedule enable or context registration if either G2H is pending
 	 * respectfully. Once a G2H returns, the fence is released that is
 	 * blocking these requests (see guc_signal_context_fence).
-	 *
-	 * We can safely check the below fields outside of the lock as it isn't
-	 * possible for these fields to transition from being clear to set but
-	 * converse is possible, hence the need for the check within the lock.
 	 */
-	if (likely(!context_wait_for_deregister_to_register(ce) &&
-		   !context_pending_disable(ce)))
-		return 0;
-
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
+		init_irq_work(&rq->submit_work, submit_work_cb);
 		i915_sw_fence_await(&rq->submit);
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
 	}
+	incr_context_committed_requests(ce);
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return 0;
@@ -2259,7 +2405,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 	     !new_guc_prio_higher(rq->guc_prio, new_guc_prio)))
 		return;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	if (rq->guc_prio != GUC_PRIO_FINI) {
 		if (rq->guc_prio != GUC_PRIO_INIT)
 			sub_context_inflight_prio(ce, rq->guc_prio);
@@ -2267,16 +2413,16 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 		add_context_inflight_prio(ce, rq->guc_prio);
 		update_context_prio(ce);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void sanitize_hwsp(struct intel_engine_cs *engine)
@@ -2355,15 +2501,15 @@ static void guc_set_default_submission(struct intel_engine_cs *engine)
 	engine->submit_request = guc_submit_request;
 }
 
-static inline void guc_kernel_context_pin(struct intel_guc *guc,
-					  struct intel_context *ce)
+static void guc_kernel_context_pin(struct intel_guc *guc,
+				   struct intel_context *ce)
 {
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 	guc_lrc_desc_pin(ce, true);
 }
 
-static inline void guc_init_lrc_mapping(struct intel_guc *guc)
+static void guc_init_lrc_mapping(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct intel_engine_cs *engine;
@@ -2466,7 +2612,7 @@ static void rcs_submission_override(struct intel_engine_cs *engine)
 	}
 }
 
-static inline void guc_default_irqs(struct intel_engine_cs *engine)
+static void guc_default_irqs(struct intel_engine_cs *engine)
 {
 	engine->irq_keep_mask = GT_RENDER_USER_INTERRUPT;
 	intel_engine_set_irq_handler(engine, cs_irq_handler);
@@ -2562,7 +2708,7 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
-static inline struct intel_context *
+static struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
@@ -2583,12 +2729,6 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 	return ce;
 }
 
-static void decr_outstanding_submission_g2h(struct intel_guc *guc)
-{
-	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
-		wake_up_all(&guc->ct.wq);
-}
-
 int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 					  const u32 *msg,
 					  u32 len)
@@ -2607,6 +2747,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+	if (unlikely(ce->drop_deregister)) {
+		ce->drop_deregister = false;
+		return 0;
+	}
+#endif
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2652,8 +2799,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		     (!context_pending_enable(ce) &&
 		     !context_pending_disable(ce)))) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
-			atomic_read(&ce->guc_sched_state_no_lock),
+			"Bad context sched_state 0x%x, desc_idx %u",
 			ce->guc_state.sched_state, desc_idx);
 		return -EPROTO;
 	}
@@ -2661,10 +2807,26 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_enable)) {
+			ce->drop_schedule_enable = false;
+			return 0;
+		}
+#endif
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_pending_enable(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_disable)) {
+			ce->drop_schedule_disable = false;
+			return 0;
+		}
+#endif
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -2721,7 +2883,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
-	if (likely(!intel_context_is_banned(ce))) {
+	/*
+	 * XXX: Racey if request cancellation has occurred, see comment in
+	 * __guc_reset_context().
+	 */
+	if (likely(!intel_context_is_banned(ce) &&
+		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
 	}
@@ -2797,33 +2964,48 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 	struct intel_context *ce;
 	struct i915_request *rq;
 	unsigned long index;
+	unsigned long flags;
 
 	/* Reset called during driver load? GuC not yet initialised! */
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
-		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
+		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
-			return;
+			intel_context_put(ce);
+			xa_lock(&guc->context_lookup);
+			goto done;
 		}
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+
 	}
+done:
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
@@ -2839,23 +3021,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
-		spin_lock_irqsave(&ce->guc_active.lock, flags);
-		intel_engine_dump_active_requests(&ce->guc_active.requests,
+		spin_lock(&ce->guc_state.lock);
+		intel_engine_dump_active_requests(&ce->guc_state.requests,
 						  hung_rq, m);
-		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+		spin_unlock(&ce->guc_state.lock);
+
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_submission_print_info(struct intel_guc *guc,
@@ -2881,25 +3074,24 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 		priolist_for_each_request(rq, pl)
 			drm_printf(p, "guc_id=%u, seqno=%llu\n",
-				   rq->context->guc_id,
+				   rq->context->guc_id.id,
 				   rq->fence.seqno);
 	}
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 	drm_printf(p, "\n");
 }
 
-static inline void guc_log_context_priority(struct drm_printer *p,
-					    struct intel_context *ce)
+static void guc_log_context_priority(struct drm_printer *p,
+				     struct intel_context *ce)
 {
 	int i;
 
-	drm_printf(p, "\t\tPriority: %d\n",
-		   ce->guc_prio);
+	drm_printf(p, "\t\tPriority: %d\n", ce->guc_state.prio);
 	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
 	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
 	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
 		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
-			   i, ce->guc_prio_count[i]);
+			   i, ce->guc_state.prio_count[i]);
 	}
 	drm_printf(p, "\n");
 }
@@ -2909,9 +3101,11 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
+		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
 		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
 			   ce->ring->head,
@@ -2922,13 +3116,13 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 		drm_printf(p, "\t\tContext Pin Count: %u\n",
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id_ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
-			   ce->guc_state.sched_state,
-			   atomic_read(&ce->guc_sched_state_no_lock));
+			   atomic_read(&ce->guc_id.ref));
+		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+			   ce->guc_state.sched_state);
 
 		guc_log_context_priority(p, ce);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static struct intel_context *
@@ -3036,3 +3230,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..264e2f705c17
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch (i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 0f08bcfbe964..2819b69fbb72 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -49,8 +49,7 @@
 #include "i915_memcpy.h"
 #include "i915_scatterlist.h"
 
-#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
-#define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)
+#define ATOMIC_MAYFAIL (GFP_NOWAIT | __GFP_NOWARN)
 
 static void __sg_set_buf(struct scatterlist *sg,
 			 void *addr, unsigned int len, loff_t it)
@@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	if (e->cur == e->end) {
 		struct scatterlist *sgl;
 
-		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
+		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
 		if (!sgl) {
 			e->err = -ENOMEM;
 			return false;
@@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	}
 
 	e->size = ALIGN(len + 1, SZ_64K);
-	e->buf = kmalloc(e->size, ALLOW_FAIL);
+	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	if (!e->buf) {
 		e->size = PAGE_ALIGN(len + 1);
-		e->buf = kmalloc(e->size, GFP_KERNEL);
+		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	}
 	if (!e->buf) {
 		e->err = -ENOMEM;
@@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
 {
 	struct z_stream_s *zstream = &c->zstream;
 
-	if (pool_init(&c->pool, ALLOW_FAIL))
+	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
 		return false;
 
 	zstream->workspace =
 		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
-			ALLOW_FAIL);
+			ATOMIC_MAYFAIL);
 	if (!zstream->workspace) {
 		pool_fini(&c->pool);
 		return false;
@@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
 
 	c->tmp = NULL;
 	if (i915_has_memcpy_from_wc())
-		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
+		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 
 	return true;
 }
@@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
 	if (dst->page_count >= dst->num_pages)
 		return ERR_PTR(-ENOSPC);
 
-	page = pool_alloc(&c->pool, ALLOW_FAIL);
+	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 
@@ -376,7 +375,7 @@ struct i915_vma_compress {
 
 static bool compress_init(struct i915_vma_compress *c)
 {
-	return pool_init(&c->pool, ALLOW_FAIL) == 0;
+	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
 }
 
 static bool compress_start(struct i915_vma_compress *c)
@@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
 {
 	void *ptr;
 
-	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
+	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!ptr)
 		return -ENOMEM;
 
@@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
 
 	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
 	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
-	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
+	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
 	if (!dst)
 		return NULL;
 
@@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
 	if (!ee)
 		return NULL;
 
@@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
 		struct intel_engine_coredump *ee;
 
 		/* Refill our page pool before entering atomic section */
-		pool_refill(&compress->pool, ALLOW_FAIL);
+		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
 
 		ee = capture_engine(engine, compress);
 		if (!ee)
@@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	const struct intel_uc *uc = &gt->_gt->uc;
 	struct intel_uc_coredump *error_uc;
 
-	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
+	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
 	if (!error_uc)
 		return NULL;
 
@@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	 * As modparams are generally accesible from the userspace make
 	 * explicit copies of the firmware paths.
 	 */
-	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
-	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
+	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
+	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
 	error_uc->guc_log =
 		i915_vma_coredump_create(gt->_gt,
 					 uc->guc.log.vma, "GuC log buffer",
@@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
 {
 	struct i915_vma_compress *compress;
 
-	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
+	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
 	if (!compress)
 		return NULL;
 
@@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 	if (IS_ERR(error))
 		return error;
 
-	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
+	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
 	if (!error)
 		return ERR_PTR(-ENOMEM);
 
-	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
+	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
 	if (error->gt) {
 		struct i915_vma_compress *compress;
 
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 1bc1349ba3c2..177eaf55adff 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -218,6 +218,11 @@ struct i915_request {
 	};
 	struct llist_head execute_cb;
 	struct i915_sw_fence semaphore;
+	/**
+	 * @submit_work: complete submit fence from an IRQ if needed for
+	 * locking hierarchy reasons.
+	 */
+	struct irq_work submit_work;
 
 	/*
 	 * A list of everyone we wait upon, and everyone who waits upon us.
@@ -285,18 +290,20 @@ struct i915_request {
 		struct hrtimer timer;
 	} watchdog;
 
-	/*
-	 * Requests may need to be stalled when using GuC submission waiting for
-	 * certain GuC operations to complete. If that is the case, stalled
-	 * requests are added to a per context list of stalled requests. The
-	 * below list_head is the link in that list.
+	/**
+	 * @guc_fence_link: Requests may need to be stalled when using GuC
+	 * submission waiting for certain GuC operations to complete. If that is
+	 * the case, stalled requests are added to a per context list of stalled
+	 * requests. The below list_head is the link in that list. Protected by
+	 * ce->guc_state.lock.
 	 */
 	struct list_head guc_fence_link;
 
 	/**
-	 * Priority level while the request is inflight. Differs from i915
-	 * scheduler priority. See comment above
-	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
+	 * @guc_prio: Priority level while the request is inflight. Differs from
+	 * i915 scheduler priority. See comment above
+	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
+	 * ce->guc_active.lock.
 	 */
 #define	GUC_PRIO_INIT	0xff
 #define	GUC_PRIO_FINI	0xfe
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 806ad688274b..ec7fe12b94aa 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -805,7 +805,7 @@ DECLARE_EVENT_CLASS(i915_request,
 			   __entry->dev = rq->engine->i915->drm.primary->index;
 			   __entry->class = rq->engine->uabi_class;
 			   __entry->instance = rq->engine->uabi_instance;
-			   __entry->guc_id = rq->context->guc_id;
+			   __entry->guc_id = rq->context->guc_id.id;
 			   __entry->ctx = rq->fence.context;
 			   __entry->seqno = rq->fence.seqno;
 			   __entry->tail = rq->tail;
@@ -903,23 +903,19 @@ DECLARE_EVENT_CLASS(intel_context,
 			     __field(u32, guc_id)
 			     __field(int, pin_count)
 			     __field(u32, sched_state)
-			     __field(u32, guc_sched_state_no_lock)
 			     __field(u8, guc_prio)
 			     ),
 
 		    TP_fast_assign(
-			   __entry->guc_id = ce->guc_id;
+			   __entry->guc_id = ce->guc_id.id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_sched_state_no_lock =
-			   atomic_read(&ce->guc_sched_state_no_lock);
-			   __entry->guc_prio = ce->guc_prio;
+			   __entry->guc_prio = ce->guc_state.prio;
 			   ),
 
-		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
+		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
 			      __entry->guc_id, __entry->pin_count,
 			      __entry->sched_state,
-			      __entry->guc_sched_state_no_lock,
 			      __entry->guc_prio)
 );
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index d67710d10615..e2c5db77f087 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -772,6 +772,98 @@ static int __cancel_completed(struct intel_engine_cs *engine)
 	return err;
 }
 
+static int __cancel_reset(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq, *nop;
+	unsigned long preempt_timeout_ms;
+	int err = 0;
+
+	preempt_timeout_ms = engine->props.preempt_timeout_ms;
+	engine->props.preempt_timeout_ms = 100;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		goto out_restore;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	nop = intel_context_create_request(ce);
+	if (IS_ERR(nop))
+		goto out_nop;
+	i915_request_get(nop);
+	i915_request_add(nop);
+
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel hung request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+		goto out_nop;
+	}
+
+	if (i915_request_wait(nop, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to complete nop request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (nop->fence.error != 0) {
+		pr_err("%s: Nop request errored (%u)\n",
+		       engine->name, nop->fence.error);
+		err = -EINVAL;
+	}
+
+out_nop:
+	i915_request_put(nop);
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+out_restore:
+	engine->props.preempt_timeout_ms = preempt_timeout_ms;
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
 static int live_cancel_request(void *arg)
 {
 	struct drm_i915_private *i915 = arg;
@@ -804,6 +896,14 @@ static int live_cancel_request(void *arg)
 			return err;
 		if (err2)
 			return err2;
+
+		/* Expects reset so call outside of igt_live_test_* */
+		err = __cancel_reset(engine);
+		if (err)
+			return err;
+
+		if (igt_flush_test(i915))
+			return -EIO;
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 01/27] drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and kernel DOC
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

https://patchwork.freedesktop.org/series/93704/

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  19 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  81 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 952 +++++++++++-------
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 +++
 drivers/gpu/drm/i915/i915_gpu_error.c         |  39 +-
 drivers/gpu/drm/i915/i915_request.h           |  23 +-
 drivers/gpu/drm/i915/i915_trace.h             |  12 +-
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c | 100 ++
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 16 files changed, 965 insertions(+), 466 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 745e84c72c90..adfe49b53b1b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -394,19 +394,18 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	spin_lock_init(&ce->guc_state.lock);
 	INIT_LIST_HEAD(&ce->guc_state.fences);
+	INIT_LIST_HEAD(&ce->guc_state.requests);
 
-	spin_lock_init(&ce->guc_active.lock);
-	INIT_LIST_HEAD(&ce->guc_active.requests);
-
-	ce->guc_id = GUC_INVALID_LRC_ID;
-	INIT_LIST_HEAD(&ce->guc_id_link);
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
+	INIT_LIST_HEAD(&ce->guc_id.link);
 
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
 	 */
-	i915_sw_fence_init(&ce->guc_blocked, sw_fence_dummy_notify);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_init(&ce->guc_state.blocked_fence,
+			   sw_fence_dummy_notify);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 
 	i915_active_init(&ce->active,
 			 __intel_context_active, __intel_context_retire, 0);
@@ -520,15 +519,15 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_active.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_active.requests,
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
 				    sched.link) {
 		if (i915_request_completed(rq))
 			break;
 
 		active = rq;
 	}
-	spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..80bbdc7810f6 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -112,6 +112,7 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
+#define CONTEXT_GUC_INIT		10
 
 	struct {
 		u64 timeout_us;
@@ -155,49 +156,73 @@ struct intel_context {
 	u8 wa_bb_page; /* if set, page num reserved for context workarounds */
 
 	struct {
-		/** lock: protects everything in guc_state */
+		/** @lock: protects everything in guc_state */
 		spinlock_t lock;
 		/**
-		 * sched_state: scheduling state of this context using GuC
+		 * @sched_state: scheduling state of this context using GuC
 		 * submission
 		 */
-		u16 sched_state;
+		u32 sched_state;
 		/*
-		 * fences: maintains of list of requests that have a submit
-		 * fence related to GuC submission
+		 * @fences: maintains a list of requests are currently being
+		 * fenced until a GuC operation completes
 		 */
 		struct list_head fences;
+		/**
+		 * @blocked_fence: fence used to signal when the blocking of a
+		 * contexts submissions is complete.
+		 */
+		struct i915_sw_fence blocked_fence;
+		/** @number_committed_requests: number of committed requests */
+		int number_committed_requests;
+		/** @requests: list of active requests on this context */
+		struct list_head requests;
+		/** @prio: the contexts current guc priority */
+		u8 prio;
+		/**
+		 * @prio_count: a counter of the number requests inflight in
+		 * each priority bucket
+		 */
+		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
 	} guc_state;
 
 	struct {
-		/** lock: protects everything in guc_active */
-		spinlock_t lock;
-		/** requests: active requests on this context */
-		struct list_head requests;
-	} guc_active;
-
-	/* GuC scheduling state flags that do not require a lock. */
-	atomic_t guc_sched_state_no_lock;
-
-	/* GuC LRC descriptor ID */
-	u16 guc_id;
-
-	/* GuC LRC descriptor reference count */
-	atomic_t guc_id_ref;
+		/**
+		 * @id: unique handle which is used to communicate information
+		 * with the GuC about this context, protected by
+		 * guc->contexts_lock
+		 */
+		u16 id;
+		/**
+		 * @ref: the number of references to the guc_id, when
+		 * transitioning in and out of zero protected by
+		 * guc->contexts_lock
+		 */
+		atomic_t ref;
+		/**
+		 * @link: in guc->guc_id_list when the guc_id has no refs but is
+		 * still valid, protected by guc->contexts_lock
+		 */
+		struct list_head link;
+	} guc_id;
 
-	/*
-	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
+#ifdef CONFIG_DRM_I915_SELFTEST
+	/**
+	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
 	 */
-	struct list_head guc_id_link;
+	bool drop_schedule_enable;
 
-	/* GuC context blocked fence */
-	struct i915_sw_fence guc_blocked;
+	/**
+	 * @drop_schedule_disable: Force drop of schedule disable G2H for
+	 * selftest
+	 */
+	bool drop_schedule_disable;
 
-	/*
-	 * GuC priority management
+	/**
+	 * @drop_deregister: Force drop of deregister G2H for selftest
 	 */
-	u8 guc_prio;
-	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+	bool drop_deregister;
+#endif
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index de5f9c86b9a4..cafb0608ffb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2140,10 +2140,6 @@ static void __execlists_unhold(struct i915_request *rq)
 			if (p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
-			/* Propagate any change in error status */
-			if (rq->fence.error)
-				i915_request_set_error_once(w, rq->fence.error);
-
 			if (w->engine != rq->engine)
 				continue;
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 08f011f893b2..bf43bed905db 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -789,7 +789,7 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 skip:
@@ -1098,7 +1098,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 				if (err)
 					pr_err("[%s] Wait for request %lld:%lld [0x%04X] failed: %d!\n",
 					       engine->name, rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id, err);
+					       rq->fence.seqno, rq->context->guc_id.id, err);
 			}
 
 			count++;
@@ -1108,7 +1108,7 @@ static int __igt_reset_engines(struct intel_gt *gt,
 					pr_err("i915_reset_engine(%s:%s): failed to reset request %lld:%lld [0x%04X]\n",
 					       engine->name, test_name,
 					       rq->fence.context,
-					       rq->fence.seqno, rq->context->guc_id);
+					       rq->fence.seqno, rq->context->guc_id.id);
 					i915_request_put(rq);
 
 					GEM_TRACE_DUMP();
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b0977a3b699b..cdc6ae48a1e1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
 	goto err_after;
 }
 
+static u32 safe_offset(u32 offset, u32 reg)
+{
+	/* XXX skip testing of watchdog - VLK-22772 */
+	if (offset == 0x178 || offset == 0x17c)
+		reg = 0;
+
+	return reg;
+}
+
+static int get_offset_mask(struct intel_engine_cs *engine)
+{
+	if (GRAPHICS_VER(engine->i915) < 12)
+		return 0xfff;
+
+	switch (engine->class) {
+	default:
+	case RENDER_CLASS:
+		return 0x07ff;
+	case COPY_ENGINE_CLASS:
+		return 0x0fff;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		return 0x3fff;
+	}
+}
+
 static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 {
 	struct i915_vma *batch;
@@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
 		len = (len + 1) / 2;
 		*cs++ = MI_LOAD_REGISTER_IMM(len);
 		while (len--) {
-			*cs++ = hw[dw];
+			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
+					    hw[dw]);
 			*cs++ = poison;
 			dw += 2;
 		}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 2e27fe59786b..112dd29a63fe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -41,6 +41,10 @@ struct intel_guc {
 	spinlock_t irq_lock;
 	unsigned int msg_enabled_mask;
 
+	/**
+	 * @outstanding_submission_g2h: number of outstanding G2H related to GuC
+	 * submission, used to determine if the GT is idle
+	 */
 	atomic_t outstanding_submission_g2h;
 
 	struct {
@@ -49,12 +53,16 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/*
-	 * contexts_lock protects the pool of free guc ids and a linked list of
-	 * guc ids available to be stolen
+	/**
+	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
+	 * ce->guc_id.ref when transitioning in and out of zero
 	 */
 	spinlock_t contexts_lock;
+	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/**
+	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
+	 */
 	struct list_head guc_id_list;
 
 	bool submission_supported;
@@ -70,7 +78,10 @@ struct intel_guc {
 	struct i915_vma *lrc_desc_pool;
 	void *lrc_desc_pool_vaddr;
 
-	/* guc_id to intel_context lookup */
+	/**
+	 * @context_lookup: used to resolve intel_context from guc_id, if a
+	 * context is present in this structure it is registered with the GuC
+	 */
 	struct xarray context_lookup;
 
 	/* Control params for fw initialization */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 22b4733b55e2..20c710a74498 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -1042,9 +1042,9 @@ static void ct_incoming_request_worker_func(struct work_struct *w)
 		container_of(w, struct intel_guc_ct, requests.worker);
 	bool done;
 
-	done = ct_process_incoming_requests(ct);
-	if (!done)
-		queue_work(system_unbound_wq, &ct->requests.worker);
+	do {
+		done = ct_process_incoming_requests(ct);
+	} while (!done);
 }
 
 static int ct_handle_event(struct intel_guc_ct *ct, struct ct_incoming_msg *request)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 87d8dc8f51b9..46158d996bf6 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -28,21 +28,6 @@
 /**
  * DOC: GuC-based command submission
  *
- * IMPORTANT NOTE: GuC submission is currently not supported in i915. The GuC
- * firmware is moving to an updated submission interface and we plan to
- * turn submission back on when that lands. The below documentation (and related
- * code) matches the old submission model and will be updated as part of the
- * upgrade to the new flow.
- *
- * GuC stage descriptor:
- * During initialization, the driver allocates a static pool of 1024 such
- * descriptors, and shares them with the GuC. Currently, we only use one
- * descriptor. This stage descriptor lets the GuC know about the workqueue and
- * process descriptor. Theoretically, it also lets the GuC know about our HW
- * contexts (context ID, etc...), but we actually employ a kind of submission
- * where the GuC uses the LRCA sent via the work item instead. This is called
- * a "proxy" submission.
- *
  * The Scratch registers:
  * There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
  * a value to the action register (SOFT_SCRATCH_0) along with any data. It then
@@ -51,14 +36,82 @@
  * processes the request. The kernel driver polls waiting for this update and
  * then proceeds.
  *
- * Work Items:
- * There are several types of work items that the host may place into a
- * workqueue, each with its own requirements and limitations. Currently only
- * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
- * represents in-order queue. The kernel driver packs ring tail pointer and an
- * ELSP context descriptor dword into Work Item.
- * See guc_add_request()
+ * Command Transport buffers (CTBs):
+ * Covered in detail in other sections but CTBs (host-to-guc, H2G, guc-to-host
+ * G2H) are a message interface between the i915 and GuC used to controls
+ * submissions.
+ *
+ * Context registration:
+ * Before a context can be submitted it must be registered with the GuC via a
+ * H2G. A unique guc_id is associated with each context. The context is either
+ * registered at request creation time (normal operation) or at submission time
+ * (abnormal operation, e.g. after a reset).
+ *
+ * Context submission:
+ * The i915 updates the LRC tail value in memory. Either a schedule enable H2G
+ * or context submit H2G is used to submit a context.
+ *
+ * Context unpin:
+ * To unpin a context a H2G is used to disable scheduling and when the
+ * corresponding G2H returns indicating the scheduling disable operation has
+ * completed it is safe to unpin the context. While a disable is in flight it
+ * isn't safe to resubmit the context so a fence is used to stall all future
+ * requests until the G2H is returned.
+ *
+ * Context deregistration:
+ * Before a context can be destroyed or we steal its guc_id we must deregister
+ * the context with the GuC via H2G. If stealing the guc_id it isn't safe to
+ * submit anything to this guc_id until the deregister completes so a fence is
+ * used to stall all requests associated with this guc_ids until the
+ * corresponding G2H returns indicating the guc_id has been deregistered.
+ *
+ * guc_ids:
+ * Unique number associated with private GuC context data passed in during
+ * context registration / submission / deregistration. 64k available. Simple ida
+ * is used for allocation.
+ *
+ * Stealing guc_ids:
+ * If no guc_ids are available they can be stolen from another context at
+ * request creation time if that context is unpinned. If a guc_id can't be found
+ * we punt this problem to the user as we believe this is near impossible to hit
+ * during normal use cases.
+ *
+ * Locking:
+ * In the GuC submission code we have 3 basic spin locks which protect
+ * everything. Details about each below.
+ *
+ * sched_engine->lock
+ * This is the submission lock for all contexts that share a i915 schedule
+ * engine (sched_engine), thus only 1 context which share a sched_engine can be
+ * submitting at a time. Currently only 1 sched_engine used for all of GuC
+ * submission but that could change in the future.
  *
+ * guc->contexts_lock
+ * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
+ * submission can hold this at a time.
+ *
+ * ce->guc_state.lock
+ * Protects everything under ce->guc_state. Ensures that a context is in the
+ * correct state before issuing a H2G. e.g. We don't issue a schedule disable
+ * on disabled context (bad idea), we don't issue schedule enable when a
+ * schedule disable is inflight, etc... Also protects list of inflight requests
+ * on the context and the priority management state. Lock individual to each
+ * context.
+ *
+ * Lock ordering rules:
+ * sched_engine->lock -> ce->guc_state.lock
+ * guc->contexts_lock -> ce->guc_state.lock
+ *
+ * Reset races:
+ * When a GPU full reset is triggered it is assumed that some G2H responses to
+ * a H2G can be lost as the GuC is likely toast. Losing these G2H can prove to
+ * fatal as we do certain operations upon receiving a G2H (e.g. destroy
+ * contexts, release guc_ids, etc...). Luckly when this occurs we can scrub
+ * context state and cleanup appropriately, however this is quite racey. To
+ * avoid races the rules are check for submission being disabled (i.e. check for
+ * mid reset) with the appropriate lock being held. If submission is disabled
+ * don't send the H2G or update the context state. The reset code must disable
+ * submission and grab all these locks before scrubbing for the missing G2H.
  */
 
 /* GuC Virtual Engine */
@@ -73,167 +126,165 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
 /*
- * Below is a set of functions which control the GuC scheduling state which do
- * not require a lock as all state transitions are mutually exclusive. i.e. It
- * is not possible for the context pinning code and submission, for the same
- * context, to be executing simultaneously. We still need an atomic as it is
- * possible for some of the bits to changing at the same time though.
+ * Below is a set of functions which control the GuC scheduling state which
+ * require a lock.
  */
-#define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
-#define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
-#define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
-static inline bool context_enabled(struct intel_context *ce)
+#define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
+#define SCHED_STATE_DESTROYED				BIT(1)
+#define SCHED_STATE_PENDING_DISABLE			BIT(2)
+#define SCHED_STATE_BANNED				BIT(3)
+#define SCHED_STATE_ENABLED				BIT(4)
+#define SCHED_STATE_PENDING_ENABLE			BIT(5)
+#define SCHED_STATE_REGISTERED				BIT(6)
+#define SCHED_STATE_BLOCKED_SHIFT			7
+#define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
+#define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
+static void init_sched_state(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_ENABLED);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
-static inline void set_context_enabled(struct intel_context *ce)
+__maybe_unused
+static bool sched_state_is_init(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_ENABLED, &ce->guc_sched_state_no_lock);
+	/*
+	 * XXX: Kernel contexts can have SCHED_STATE_NO_LOCK_REGISTERED after
+	 * suspend.
+	 */
+	return !(ce->guc_state.sched_state &=
+		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
 }
 
-static inline void clr_context_enabled(struct intel_context *ce)
+static bool
+context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
-		   &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state &
+		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline bool context_pending_enable(struct intel_context *ce)
+static void
+set_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_PENDING_ENABLE);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |=
+		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void set_context_pending_enable(struct intel_context *ce)
+static void
+clr_context_wait_for_deregister_to_register(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		  &ce->guc_sched_state_no_lock);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &=
+		~SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
 }
 
-static inline void clr_context_pending_enable(struct intel_context *ce)
+static bool
+context_destroyed(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_PENDING_ENABLE,
-		   &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state & SCHED_STATE_DESTROYED;
 }
 
-static inline bool context_registered(struct intel_context *ce)
+static void
+set_context_destroyed(struct intel_context *ce)
 {
-	return (atomic_read(&ce->guc_sched_state_no_lock) &
-		SCHED_STATE_NO_LOCK_REGISTERED);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
-static inline void set_context_registered(struct intel_context *ce)
+static bool context_pending_disable(struct intel_context *ce)
 {
-	atomic_or(SCHED_STATE_NO_LOCK_REGISTERED,
-		  &ce->guc_sched_state_no_lock);
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline void clr_context_registered(struct intel_context *ce)
+static void set_context_pending_disable(struct intel_context *ce)
 {
-	atomic_and((u32)~SCHED_STATE_NO_LOCK_REGISTERED,
-		   &ce->guc_sched_state_no_lock);
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_DISABLE;
 }
 
-/*
- * Below is a set of functions which control the GuC scheduling state which
- * require a lock, aside from the special case where the functions are called
- * from guc_lrc_desc_pin(). In that case it isn't possible for any other code
- * path to be executing on the context.
- */
-#define SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER	BIT(0)
-#define SCHED_STATE_DESTROYED				BIT(1)
-#define SCHED_STATE_PENDING_DISABLE			BIT(2)
-#define SCHED_STATE_BANNED				BIT(3)
-#define SCHED_STATE_BLOCKED_SHIFT			4
-#define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
-#define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
-static inline void init_sched_state(struct intel_context *ce)
+static void clr_context_pending_disable(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() */
-	atomic_set(&ce->guc_sched_state_no_lock, 0);
-	ce->guc_state.sched_state = 0;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_DISABLE;
 }
 
-static inline bool
-context_wait_for_deregister_to_register(struct intel_context *ce)
+static bool context_banned(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state &
-		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	return ce->guc_state.sched_state & SCHED_STATE_BANNED;
 }
 
-static inline void
-set_context_wait_for_deregister_to_register(struct intel_context *ce)
+static void set_context_banned(struct intel_context *ce)
 {
-	/* Only should be called from guc_lrc_desc_pin() without lock */
-	ce->guc_state.sched_state |=
-		SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_BANNED;
 }
 
-static inline void
-clr_context_wait_for_deregister_to_register(struct intel_context *ce)
+static void clr_context_banned(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &=
-		~SCHED_STATE_WAIT_FOR_DEREGISTER_TO_REGISTER;
+	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
 }
 
-static inline bool
-context_destroyed(struct intel_context *ce)
+static bool context_enabled(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_DESTROYED;
+	return ce->guc_state.sched_state & SCHED_STATE_ENABLED;
 }
 
-static inline void
-set_context_destroyed(struct intel_context *ce)
+static void set_context_enabled(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
+	ce->guc_state.sched_state |= SCHED_STATE_ENABLED;
 }
 
-static inline bool context_pending_disable(struct intel_context *ce)
+static void clr_context_enabled(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_ENABLED;
 }
 
-static inline void set_context_pending_disable(struct intel_context *ce)
+static bool context_pending_enable(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_PENDING_ENABLE;
+}
+
+static void set_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_PENDING_DISABLE;
+	ce->guc_state.sched_state |= SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline void clr_context_pending_disable(struct intel_context *ce)
+static void clr_context_pending_enable(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_DISABLE;
+	ce->guc_state.sched_state &= ~SCHED_STATE_PENDING_ENABLE;
 }
 
-static inline bool context_banned(struct intel_context *ce)
+static bool context_registered(struct intel_context *ce)
 {
-	return ce->guc_state.sched_state & SCHED_STATE_BANNED;
+	return ce->guc_state.sched_state & SCHED_STATE_REGISTERED;
 }
 
-static inline void set_context_banned(struct intel_context *ce)
+static void set_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state |= SCHED_STATE_BANNED;
+	ce->guc_state.sched_state |= SCHED_STATE_REGISTERED;
 }
 
-static inline void clr_context_banned(struct intel_context *ce)
+static void clr_context_registered(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	ce->guc_state.sched_state &= ~SCHED_STATE_BANNED;
+	ce->guc_state.sched_state &= ~SCHED_STATE_REGISTERED;
 }
 
-static inline u32 context_blocked(struct intel_context *ce)
+static u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
 		SCHED_STATE_BLOCKED_SHIFT;
 }
 
-static inline void incr_context_blocked(struct intel_context *ce)
+static void incr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	ce->guc_state.sched_state += SCHED_STATE_BLOCKED;
@@ -241,9 +292,8 @@ static inline void incr_context_blocked(struct intel_context *ce)
 	GEM_BUG_ON(!context_blocked(ce));	/* Overflow check */
 }
 
-static inline void decr_context_blocked(struct intel_context *ce)
+static void decr_context_blocked(struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->engine->sched_engine->lock);
 	lockdep_assert_held(&ce->guc_state.lock);
 
 	GEM_BUG_ON(!context_blocked(ce));	/* Underflow check */
@@ -251,22 +301,41 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
-static inline bool context_guc_id_invalid(struct intel_context *ce)
+static bool context_has_committed_requests(struct intel_context *ce)
+{
+	return !!ce->guc_state.number_committed_requests;
+}
+
+static void incr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	++ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static void decr_context_committed_requests(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	--ce->guc_state.number_committed_requests;
+	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
+}
+
+static bool context_guc_id_invalid(struct intel_context *ce)
 {
-	return ce->guc_id == GUC_INVALID_LRC_ID;
+	return ce->guc_id.id == GUC_INVALID_LRC_ID;
 }
 
-static inline void set_context_guc_id_invalid(struct intel_context *ce)
+static void set_context_guc_id_invalid(struct intel_context *ce)
 {
-	ce->guc_id = GUC_INVALID_LRC_ID;
+	ce->guc_id.id = GUC_INVALID_LRC_ID;
 }
 
-static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
+static struct intel_guc *ce_to_guc(struct intel_context *ce)
 {
 	return &ce->engine->gt->uc.guc;
 }
 
-static inline struct i915_priolist *to_priolist(struct rb_node *rb)
+static struct i915_priolist *to_priolist(struct rb_node *rb)
 {
 	return rb_entry(rb, struct i915_priolist, node);
 }
@@ -280,7 +349,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 	return &base[index];
 }
 
-static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
+static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
@@ -310,12 +379,12 @@ static void guc_lrc_desc_pool_destroy(struct intel_guc *guc)
 	i915_vma_unpin_and_release(&guc->lrc_desc_pool, I915_VMA_RELEASE_MAP);
 }
 
-static inline bool guc_submission_initialized(struct intel_guc *guc)
+static bool guc_submission_initialized(struct intel_guc *guc)
 {
 	return !!guc->lrc_desc_pool_vaddr;
 }
 
-static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
+static void reset_lrc_desc(struct intel_guc *guc, u32 id)
 {
 	if (likely(guc_submission_initialized(guc))) {
 		struct guc_lrc_desc *desc = __get_lrc_desc(guc, id);
@@ -333,13 +402,13 @@ static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
 	}
 }
 
-static inline bool lrc_desc_registered(struct intel_guc *guc, u32 id)
+static bool lrc_desc_registered(struct intel_guc *guc, u32 id)
 {
 	return __get_context(guc, id);
 }
 
-static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
-					   struct intel_context *ce)
+static void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
+				    struct intel_context *ce)
 {
 	unsigned long flags;
 
@@ -352,6 +421,12 @@ static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
+static void decr_outstanding_submission_g2h(struct intel_guc *guc)
+{
+	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
+		wake_up_all(&guc->ct.wq);
+}
+
 static int guc_submission_send_busy_loop(struct intel_guc *guc,
 					 const u32 *action,
 					 u32 len,
@@ -360,11 +435,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
 {
 	int err;
 
-	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
-
-	if (!err && g2h_len_dw)
+	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
+	err = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+
 	return err;
 }
 
@@ -430,6 +505,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	u32 g2h_len_dw = 0;
 	bool enabled;
 
+	lockdep_assert_held(&rq->engine->sched_engine->lock);
+
 	/*
 	 * Corner case where requests were sitting in the priority list or a
 	 * request resubmitted after the context was banned.
@@ -437,22 +514,24 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	if (unlikely(intel_context_is_banned(ce))) {
 		i915_request_put(i915_request_mark_eio(rq));
 		intel_engine_signal_breadcrumbs(ce->engine);
-		goto out;
+		return 0;
 	}
 
-	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
 	/*
 	 * Corner case where the GuC firmware was blown away and reloaded while
 	 * this context was pinned.
 	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
+	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
 		err = guc_lrc_desc_pin(ce, false);
 		if (unlikely(err))
-			goto out;
+			return err;
 	}
 
+	spin_lock(&ce->guc_state.lock);
+
 	/*
 	 * The request / context will be run on the hardware when scheduling
 	 * gets enabled in the unblock.
@@ -464,14 +543,14 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 		action[len++] = GUC_CONTEXT_ENABLE;
 		set_context_pending_enable(ce);
 		intel_context_get(ce);
 		g2h_len_dw = G2H_LEN_DW_SCHED_CONTEXT_MODE_SET;
 	} else {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT;
-		action[len++] = ce->guc_id;
+		action[len++] = ce->guc_id.id;
 	}
 
 	err = intel_guc_send_nb(guc, action, len, g2h_len_dw);
@@ -487,16 +566,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_i915_request_guc_submit(rq);
 
 out:
+	spin_unlock(&ce->guc_state.lock);
 	return err;
 }
 
-static inline void guc_set_lrc_tail(struct i915_request *rq)
+static void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
 		intel_ring_set_tail(rq->ring, rq->tail);
 }
 
-static inline int rq_prio(const struct i915_request *rq)
+static int rq_prio(const struct i915_request *rq)
 {
 	return rq->sched.attr.priority;
 }
@@ -596,10 +676,18 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	unsigned long index, flags;
 	bool pending_disable, pending_enable, deregister, destroyed, banned;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		/* Flush context */
-		spin_lock_irqsave(&ce->guc_state.lock, flags);
-		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		/*
+		 * Corner case where the ref count on the object is zero but and
+		 * deregister G2H was lost. In this case we don't touch the ref
+		 * count and finish the destroy of the context.
+		 */
+		bool do_put = kref_get_unless_zero(&ce->ref);
+
+		xa_unlock(&guc->context_lookup);
+
+		spin_lock(&ce->guc_state.lock);
 
 		/*
 		 * Once we are at this point submission_disabled() is guaranteed
@@ -615,8 +703,12 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		banned = context_banned(ce);
 		init_sched_state(ce);
 
+		spin_unlock(&ce->guc_state.lock);
+
+		GEM_BUG_ON(!do_put && !destroyed);
+
 		if (pending_enable || destroyed || deregister) {
-			atomic_dec(&guc->outstanding_submission_g2h);
+			decr_outstanding_submission_g2h(guc);
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
@@ -635,17 +727,23 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 				intel_engine_signal_breadcrumbs(ce->engine);
 			}
 			intel_context_sched_disable_unpin(ce);
-			atomic_dec(&guc->outstanding_submission_g2h);
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
+			decr_outstanding_submission_g2h(guc);
+
+			spin_lock(&ce->guc_state.lock);
 			guc_blocked_fence_complete(ce);
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+			spin_unlock(&ce->guc_state.lock);
 
 			intel_context_put(ce);
 		}
+
+		if (do_put)
+			intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
-static inline bool
+static bool
 submission_disabled(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -694,8 +792,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -709,22 +805,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_lock_irq(&guc_to_gt(guc)->irq_lock);
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
-	guc_flush_submissions(guc);
+	flush_work(&guc->ct.requests.worker);
 
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
 
@@ -742,7 +824,7 @@ guc_virtual_get_sibling(struct intel_engine_cs *ve, unsigned int sibling)
 	return NULL;
 }
 
-static inline struct intel_engine_cs *
+static struct intel_engine_cs *
 __context_to_physical_engine(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -796,16 +878,14 @@ __unwind_incomplete_requests(struct intel_context *ce)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry_safe(rq, rn,
-				 &ce->guc_active.requests,
-				 sched.link) {
+	spin_lock(&ce->guc_state.lock);
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &ce->guc_state.requests,
+					 sched.link) {
 		if (i915_request_completed(rq))
 			continue;
 
 		list_del_init(&rq->sched.link);
-		spin_unlock(&ce->guc_active.lock);
-
 		__i915_request_unsubmit(rq);
 
 		/* Push the request back into the queue for later resubmission. */
@@ -816,29 +896,44 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		}
 		GEM_BUG_ON(i915_sched_engine_is_empty(sched_engine));
 
-		list_add_tail(&rq->sched.link, pl);
+		list_add(&rq->sched.link, pl);
 		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-
-		spin_lock(&ce->guc_active.lock);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
 	struct i915_request *rq;
+	unsigned long flags;
 	u32 head;
+	bool skip = false;
 
 	intel_context_get(ce);
 
 	/*
-	 * GuC will implicitly mark the context as non-schedulable
-	 * when it sends the reset notification. Make sure our state
-	 * reflects this change. The context will be marked enabled
-	 * on resubmission.
+	 * GuC will implicitly mark the context as non-schedulable when it sends
+	 * the reset notification. Make sure our state reflects this change. The
+	 * context will be marked enabled on resubmission.
+	 *
+	 * XXX: If the context is reset as a result of the request cancellation
+	 * this G2H is received after the schedule disable complete G2H which is
+	 * likely wrong as this creates a race between the request cancellation
+	 * code re-submitting the context and this G2H handler. This likely
+	 * should be fixed in the GuC but until if / when that gets fixed we
+	 * need to workaround this. Convert this function to a NOP if a pending
+	 * enable is in flight as this indicates that a request cancellation has
+	 * occurred.
 	 */
-	clr_context_enabled(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	if (likely(!context_pending_enable(ce)))
+		clr_context_enabled(ce);
+	else
+		skip = true;
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(skip))
+		goto out_put;
 
 	rq = intel_context_find_active_request(ce);
 	if (!rq) {
@@ -857,6 +952,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 out_replay:
 	guc_reset_state(ce, head, stalled);
 	__unwind_incomplete_requests(ce);
+out_put:
 	intel_context_put(ce);
 }
 
@@ -864,16 +960,29 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
 	}
 
-	xa_for_each(&guc->context_lookup, index, ce)
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
+
 		if (intel_context_is_pinned(ce))
 			__guc_reset_context(ce, stalled);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
@@ -886,10 +995,10 @@ static void guc_cancel_context_requests(struct intel_context *ce)
 
 	/* Mark all executing requests as skipped. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	spin_lock(&ce->guc_active.lock);
-	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
+	spin_lock(&ce->guc_state.lock);
+	list_for_each_entry(rq, &ce->guc_state.requests, sched.link)
 		i915_request_put(i915_request_mark_eio(rq));
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
@@ -948,11 +1057,24 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
+
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+
+		xa_unlock(&guc->context_lookup);
 
-	xa_for_each(&guc->context_lookup, index, ce)
 		if (intel_context_is_pinned(ce))
 			guc_cancel_context_requests(ce);
 
+		intel_context_put(ce);
+
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	guc_cancel_sched_engine_requests(guc->sched_engine);
 
 	/* GuC is blown away, drop all references to contexts */
@@ -1019,14 +1141,15 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	i915_sched_engine_put(guc->sched_engine);
 }
 
-static inline void queue_request(struct i915_sched_engine *sched_engine,
-				 struct i915_request *rq,
-				 int prio)
+static void queue_request(struct i915_sched_engine *sched_engine,
+			  struct i915_request *rq,
+			  int prio)
 {
 	GEM_BUG_ON(!list_empty(&rq->sched.link));
 	list_add_tail(&rq->sched.link,
 		      i915_sched_lookup_priolist(sched_engine, prio));
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
 static int guc_bypass_tasklet_submit(struct intel_guc *guc,
@@ -1077,12 +1200,12 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id);
-		reset_lrc_desc(guc, ce->guc_id);
+		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 }
 
 static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
@@ -1104,14 +1227,18 @@ static int steal_guc_id(struct intel_guc *guc)
 	if (!list_empty(&guc->guc_id_list)) {
 		ce = list_first_entry(&guc->guc_id_list,
 				      struct intel_context,
-				      guc_id_link);
+				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 		GEM_BUG_ON(context_guc_id_invalid(ce));
 
-		list_del_init(&ce->guc_id_link);
-		guc_id = ce->guc_id;
+		list_del_init(&ce->guc_id.link);
+		guc_id = ce->guc_id.id;
+
+		spin_lock(&ce->guc_state.lock);
 		clr_context_registered(ce);
+		spin_unlock(&ce->guc_state.lock);
+
 		set_context_guc_id_invalid(ce);
 		return guc_id;
 	} else {
@@ -1142,26 +1269,28 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	int ret = 0;
 	unsigned long flags, tries = PIN_GUC_ID_TRIES;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
+	might_lock(&ce->guc_state.lock);
+
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id);
+		ret = assign_guc_id(guc, &ce->guc_id.id);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
 	}
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
-	atomic_inc(&ce->guc_id_ref);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
+	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/*
-	 * -EAGAIN indicates no guc_ids are available, let's retire any
+	 * -EAGAIN indicates no guc_id are available, let's retire any
 	 * outstanding requests to see if that frees up a guc_id. If the first
 	 * retire didn't help, insert a sleep with the timeslice duration before
 	 * attempting to retire more requests. Double the sleep period each
@@ -1189,15 +1318,15 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
+	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
 
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
-	    !atomic_read(&ce->guc_id_ref))
-		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
+	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
+	    !atomic_read(&ce->guc_id.ref))
+		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
@@ -1220,14 +1349,19 @@ static int register_context(struct intel_context *ce, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	u32 offset = intel_guc_ggtt_offset(guc, guc->lrc_desc_pool) +
-		ce->guc_id * sizeof(struct guc_lrc_desc);
+		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
-	if (likely(!ret))
+	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (likely(!ret)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		set_context_registered(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	}
 
 	return ret;
 }
@@ -1285,22 +1419,19 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
 	desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio);
-
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 {
 	struct intel_engine_cs *engine = ce->engine;
 	struct intel_runtime_pm *runtime_pm = engine->uncore->rpm;
 	struct intel_guc *guc = &engine->gt->uc.guc;
-	u32 desc_idx = ce->guc_id;
+	u32 desc_idx = ce->guc_id.id;
 	struct guc_lrc_desc *desc;
-	const struct i915_gem_context *ctx;
-	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
 	bool context_registered;
 	intel_wakeref_t wakeref;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
+	GEM_BUG_ON(!sched_state_is_init(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -1311,12 +1442,6 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
-	rcu_read_lock();
-	ctx = rcu_dereference(ce->gem_context);
-	if (ctx)
-		prio = ctx->sched.priority;
-	rcu_read_unlock();
-
 	reset_lrc_desc(guc, desc_idx);
 	set_lrc_desc_registered(guc, desc_idx, ce);
 
@@ -1325,11 +1450,9 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
 	desc->hw_context_desc = ce->lrc.lrca;
-	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
-	desc->priority = ce->guc_prio;
+	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
-	init_sched_state(ce);
 
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
@@ -1340,26 +1463,23 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	 * registering this context.
 	 */
 	if (context_registered) {
+		bool disabled;
+		unsigned long flags;
+
 		trace_intel_context_steal_guc_id(ce);
-		if (!loop) {
+		GEM_BUG_ON(!loop);
+
+		/* Seal race with Reset */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		disabled = submission_disabled(guc);
+		if (likely(!disabled)) {
 			set_context_wait_for_deregister_to_register(ce);
 			intel_context_get(ce);
-		} else {
-			bool disabled;
-			unsigned long flags;
-
-			/* Seal race with Reset */
-			spin_lock_irqsave(&ce->guc_state.lock, flags);
-			disabled = submission_disabled(guc);
-			if (likely(!disabled)) {
-				set_context_wait_for_deregister_to_register(ce);
-				intel_context_get(ce);
-			}
-			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-			if (unlikely(disabled)) {
-				reset_lrc_desc(guc, desc_idx);
-				return 0;	/* Will get registered later */
-			}
+		}
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		if (unlikely(disabled)) {
+			reset_lrc_desc(guc, desc_idx);
+			return 0;	/* Will get registered later */
 		}
 
 		/*
@@ -1367,20 +1487,19 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		 * context whose guc_id was stolen.
 		 */
 		with_intel_runtime_pm(runtime_pm, wakeref)
-			ret = deregister_context(ce, ce->guc_id, loop);
-		if (unlikely(ret == -EBUSY)) {
-			clr_context_wait_for_deregister_to_register(ce);
-			intel_context_put(ce);
-		} else if (unlikely(ret == -ENODEV)) {
+			ret = deregister_context(ce, ce->guc_id.id, loop);
+		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
 	} else {
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = register_context(ce, loop);
-		if (unlikely(ret == -EBUSY))
+		if (unlikely(ret == -EBUSY)) {
+			reset_lrc_desc(guc, desc_idx);
+		} else if (unlikely(ret == -ENODEV)) {
 			reset_lrc_desc(guc, desc_idx);
-		else if (unlikely(ret == -ENODEV))
 			ret = 0;	/* Will get registered later */
+		}
 	}
 
 	return ret;
@@ -1440,7 +1559,7 @@ static void __guc_context_sched_enable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		ce->guc_id,
+		ce->guc_id.id,
 		GUC_CONTEXT_ENABLE
 	};
 
@@ -1456,7 +1575,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET,
-		guc_id,	/* ce->guc_id not stable */
+		guc_id,	/* ce->guc_id.id not stable */
 		GUC_CONTEXT_DISABLE
 	};
 
@@ -1472,24 +1591,24 @@ static void guc_blocked_fence_complete(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (!i915_sw_fence_done(&ce->guc_blocked))
-		i915_sw_fence_complete(&ce->guc_blocked);
+	if (!i915_sw_fence_done(&ce->guc_state.blocked_fence))
+		i915_sw_fence_complete(&ce->guc_state.blocked_fence);
 }
 
 static void guc_blocked_fence_reinit(struct intel_context *ce)
 {
 	lockdep_assert_held(&ce->guc_state.lock);
-	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_blocked));
+	GEM_BUG_ON(!i915_sw_fence_done(&ce->guc_state.blocked_fence));
 
 	/*
 	 * This fence is always complete unless a pending schedule disable is
 	 * outstanding. We arm the fence here and complete it when we receive
 	 * the pending schedule disable complete message.
 	 */
-	i915_sw_fence_fini(&ce->guc_blocked);
-	i915_sw_fence_reinit(&ce->guc_blocked);
-	i915_sw_fence_await(&ce->guc_blocked);
-	i915_sw_fence_commit(&ce->guc_blocked);
+	i915_sw_fence_fini(&ce->guc_state.blocked_fence);
+	i915_sw_fence_reinit(&ce->guc_state.blocked_fence);
+	i915_sw_fence_await(&ce->guc_state.blocked_fence);
+	i915_sw_fence_commit(&ce->guc_state.blocked_fence);
 }
 
 static u16 prep_context_pending_disable(struct intel_context *ce)
@@ -1501,13 +1620,12 @@ static u16 prep_context_pending_disable(struct intel_context *ce)
 	guc_blocked_fence_reinit(ce);
 	intel_context_get(ce);
 
-	return ce->guc_id;
+	return ce->guc_id.id;
 }
 
 static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1516,20 +1634,14 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
-	/*
-	 * Sync with submission path, increment before below changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	incr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
 		if (enabled)
 			clr_context_enabled(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-		return &ce->guc_blocked;
+		return &ce->guc_state.blocked_fence;
 	}
 
 	/*
@@ -1545,13 +1657,12 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	with_intel_runtime_pm(runtime_pm, wakeref)
 		__guc_context_sched_disable(guc, ce, guc_id);
 
-	return &ce->guc_blocked;
+	return &ce->guc_state.blocked_fence;
 }
 
 static void guc_context_unblock(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
 	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	intel_wakeref_t wakeref;
@@ -1562,6 +1673,9 @@ static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
+		     context_guc_id_invalid(ce) ||
+		     !lrc_desc_registered(guc, ce->guc_id.id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {
@@ -1573,13 +1687,7 @@ static void guc_context_unblock(struct intel_context *ce)
 		intel_context_get(ce);
 	}
 
-	/*
-	 * Sync with submission path, decrement after above changes to context
-	 * state.
-	 */
-	spin_lock(&sched_engine->lock);
 	decr_context_blocked(ce);
-	spin_unlock(&sched_engine->lock);
 
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
@@ -1593,15 +1701,25 @@ static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
 	if (i915_sw_fence_signaled(&rq->submit)) {
-		struct i915_sw_fence *fence = guc_context_block(ce);
+		struct i915_sw_fence *fence;
 
+		intel_context_get(ce);
+		fence = guc_context_block(ce);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
 			guc_reset_state(ce, intel_ring_wrap(ce->ring, rq->head),
 					true);
 		}
+
+		/*
+		 * XXX: Racey if context is reset, see comment in
+		 * __guc_reset_context().
+		 */
+		flush_work(&ce_to_guc(ce)->ct.requests.worker);
+
 		guc_context_unblock(ce);
+		intel_context_put(ce);
 	}
 }
 
@@ -1662,7 +1780,7 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 		if (!context_guc_id_invalid(ce))
 			with_intel_runtime_pm(runtime_pm, wakeref)
 				__guc_context_set_preemption_timeout(guc,
-								     ce->guc_id,
+								     ce->guc_id.id,
 								     1);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	}
@@ -1678,8 +1796,10 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	bool enabled;
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
-	    !lrc_desc_registered(guc, ce->guc_id)) {
+	    !lrc_desc_registered(guc, ce->guc_id.id)) {
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_enabled(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
 
@@ -1689,14 +1809,11 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
-	 * We have to check if the context has been disabled by another thread.
-	 * We also have to check if the context has been pinned again as another
-	 * pin operation is allowed to pass this function. Checking the pin
-	 * count, within ce->guc_state.lock, synchronizes this function with
-	 * guc_request_alloc ensuring a request doesn't slip through the
-	 * 'context_pending_disable' fence. Checking within the spin lock (can't
-	 * sleep) ensures another process doesn't pin this context and generate
-	 * a request before we set the 'context_pending_disable' flag here.
+	 * We have to check if the context has been disabled by another thread,
+	 * check if submssion has been disabled to seal a race with reset and
+	 * finally check if any more requests have been committed to the
+	 * context ensursing that a request doesn't slip through the
+	 * 'context_pending_disable' fence.
 	 */
 	enabled = context_enabled(ce);
 	if (unlikely(!enabled || submission_disabled(guc))) {
@@ -1705,7 +1822,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		goto unpin;
 	}
-	if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+	if (unlikely(context_has_committed_requests(ce))) {
+		intel_context_sched_disable_unpin(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 		return;
 	}
@@ -1721,24 +1839,24 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_context_sched_disable_unpin(ce);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
-	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
-	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
+	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
+	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
-	clr_context_registered(ce);
-	deregister_context(ce, ce->guc_id, true);
+	deregister_context(ce, ce->guc_id.id, true);
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
 {
-	GEM_BUG_ON(ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
-		   ce->guc_prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_HIGH] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_KMD_NORMAL] ||
+		   ce->guc_state.prio_count[GUC_CLIENT_PRIORITY_NORMAL]);
+	GEM_BUG_ON(ce->guc_state.number_committed_requests);
 
 	lrc_fini(ce);
 	intel_context_fini(ce);
@@ -1774,7 +1892,7 @@ static void guc_context_destroy(struct kref *kref)
 		__guc_context_destroy(ce);
 		return;
 	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id)) {
+		   !lrc_desc_registered(guc, ce->guc_id.id)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 		return;
@@ -1783,10 +1901,10 @@ static void guc_context_destroy(struct kref *kref)
 	/*
 	 * We have to acquire the context spinlock and check guc_id again, if it
 	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_ids available to
+	 * delete this context from the list of unpinned guc_id available to
 	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
 	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_ids.
+	 * returned to the pool of available guc_id.
 	 */
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 	if (context_guc_id_invalid(ce)) {
@@ -1795,15 +1913,17 @@ static void guc_context_destroy(struct kref *kref)
 		return;
 	}
 
-	if (!list_empty(&ce->guc_id_link))
-		list_del_init(&ce->guc_id_link);
+	if (!list_empty(&ce->guc_id.link))
+		list_del_init(&ce->guc_id.link);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
-	if (likely(!disabled))
+	if (likely(!disabled)) {
 		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
@@ -1839,24 +1959,27 @@ static void guc_context_set_prio(struct intel_guc *guc,
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_SET_CONTEXT_PRIORITY,
-		ce->guc_id,
+		ce->guc_id.id,
 		prio,
 	};
 
 	GEM_BUG_ON(prio < GUC_CLIENT_PRIORITY_KMD_HIGH ||
 		   prio > GUC_CLIENT_PRIORITY_NORMAL);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	if (ce->guc_prio == prio || submission_disabled(guc) ||
-	    !context_registered(ce))
+	if (ce->guc_state.prio == prio || submission_disabled(guc) ||
+	    !context_registered(ce)) {
+		ce->guc_state.prio = prio;
 		return;
+	}
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action), 0, true);
 
-	ce->guc_prio = prio;
+	ce->guc_state.prio = prio;
 	trace_intel_context_set_prio(ce);
 }
 
-static inline u8 map_i915_prio_to_guc_prio(int prio)
+static u8 map_i915_prio_to_guc_prio(int prio)
 {
 	if (prio == I915_PRIORITY_NORMAL)
 		return GUC_CLIENT_PRIORITY_KMD_NORMAL;
@@ -1868,31 +1991,29 @@ static inline u8 map_i915_prio_to_guc_prio(int prio)
 		return GUC_CLIENT_PRIORITY_KMD_HIGH;
 }
 
-static inline void add_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void add_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
-	++ce->guc_prio_count[guc_prio];
+	++ce->guc_state.prio_count[guc_prio];
 
 	/* Overflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 }
 
-static inline void sub_context_inflight_prio(struct intel_context *ce,
-					     u8 guc_prio)
+static void sub_context_inflight_prio(struct intel_context *ce, u8 guc_prio)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
-	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_prio_count));
+	lockdep_assert_held(&ce->guc_state.lock);
+	GEM_BUG_ON(guc_prio >= ARRAY_SIZE(ce->guc_state.prio_count));
 
 	/* Underflow protection */
-	GEM_WARN_ON(!ce->guc_prio_count[guc_prio]);
+	GEM_WARN_ON(!ce->guc_state.prio_count[guc_prio]);
 
-	--ce->guc_prio_count[guc_prio];
+	--ce->guc_state.prio_count[guc_prio];
 }
 
-static inline void update_context_prio(struct intel_context *ce)
+static void update_context_prio(struct intel_context *ce)
 {
 	struct intel_guc *guc = &ce->engine->gt->uc.guc;
 	int i;
@@ -1900,17 +2021,17 @@ static inline void update_context_prio(struct intel_context *ce)
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);
 	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH > GUC_CLIENT_PRIORITY_NORMAL);
 
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
-	for (i = 0; i < ARRAY_SIZE(ce->guc_prio_count); ++i) {
-		if (ce->guc_prio_count[i]) {
+	for (i = 0; i < ARRAY_SIZE(ce->guc_state.prio_count); ++i) {
+		if (ce->guc_state.prio_count[i]) {
 			guc_context_set_prio(guc, ce, i);
 			break;
 		}
 	}
 }
 
-static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
+static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 {
 	/* Lower value is higher priority */
 	return new_guc_prio < old_guc_prio;
@@ -1923,8 +2044,8 @@ static void add_to_context(struct i915_request *rq)
 
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
-	spin_lock(&ce->guc_active.lock);
-	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
+	spin_lock(&ce->guc_state.lock);
+	list_move_tail(&rq->sched.link, &ce->guc_state.requests);
 
 	if (rq->guc_prio == GUC_PRIO_INIT) {
 		rq->guc_prio = new_guc_prio;
@@ -1936,12 +2057,12 @@ static void add_to_context(struct i915_request *rq)
 	}
 	update_context_prio(ce);
 
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 {
-	lockdep_assert_held(&ce->guc_active.lock);
+	lockdep_assert_held(&ce->guc_state.lock);
 
 	if (rq->guc_prio != GUC_PRIO_INIT &&
 	    rq->guc_prio != GUC_PRIO_FINI) {
@@ -1955,7 +2076,7 @@ static void remove_from_context(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock_irq(&ce->guc_active.lock);
+	spin_lock_irq(&ce->guc_state.lock);
 
 	list_del_init(&rq->sched.link);
 	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
@@ -1965,9 +2086,11 @@ static void remove_from_context(struct i915_request *rq)
 
 	guc_prio_fini(rq, ce);
 
-	spin_unlock_irq(&ce->guc_active.lock);
+	decr_context_committed_requests(ce);
 
-	atomic_dec(&ce->guc_id_ref);
+	spin_unlock_irq(&ce->guc_state.lock);
+
+	atomic_dec(&ce->guc_id.ref);
 	i915_request_notify_execute_cb_imm(rq);
 }
 
@@ -1994,6 +2117,14 @@ static const struct intel_context_ops guc_context_ops = {
 	.create_virtual = guc_create_virtual,
 };
 
+static void submit_work_cb(struct irq_work *wrk)
+{
+	struct i915_request *rq = container_of(wrk, typeof(*rq), submit_work);
+
+	might_lock(&rq->engine->sched_engine->lock);
+	i915_sw_fence_complete(&rq->submit);
+}
+
 static void __guc_signal_context_fence(struct intel_context *ce)
 {
 	struct i915_request *rq;
@@ -2003,8 +2134,12 @@ static void __guc_signal_context_fence(struct intel_context *ce)
 	if (!list_empty(&ce->guc_state.fences))
 		trace_intel_context_fence_release(ce);
 
+	/*
+	 * Use an IRQ to ensure locking order of sched_engine->lock ->
+	 * ce->guc_state.lock is preserved.
+	 */
 	list_for_each_entry(rq, &ce->guc_state.fences, guc_fence_link)
-		i915_sw_fence_complete(&rq->submit);
+		irq_work_queue(&rq->submit_work);
 
 	INIT_LIST_HEAD(&ce->guc_state.fences);
 }
@@ -2022,10 +2157,24 @@ static void guc_signal_context_fence(struct intel_context *ce)
 static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
 {
 	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
-		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
+		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id)) &&
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static void guc_context_init(struct intel_context *ce)
+{
+	const struct i915_gem_context *ctx;
+	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(ce->gem_context);
+	if (ctx)
+		prio = ctx->sched.priority;
+	rcu_read_unlock();
+
+	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
+}
+
 static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
@@ -2057,14 +2206,17 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
+		guc_context_init(ce);
+
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
-	 * guc_ids and creating horrible race conditions. This is especially bad
-	 * when guc_ids are being stolen due to over subscription. By the time
+	 * guc_id and creating horrible race conditions. This is especially bad
+	 * when guc_id are being stolen due to over subscription. By the time
 	 * this function is reached, it is guaranteed that the guc_id will be
 	 * persistent until the generated request is retired. Thus, sealing these
-	 * race conditions. It is still safe to fail here if guc_ids are
+	 * race conditions. It is still safe to fail here if guc_id are
 	 * exhausted and return -EAGAIN to the user indicating that they can try
 	 * again in the future.
 	 *
@@ -2074,7 +2226,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * decremented on each retire. When it is zero, a lock around the
 	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
 	 */
-	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
+	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
 		goto out;
 
 	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
@@ -2087,7 +2239,7 @@ static int guc_request_alloc(struct i915_request *rq)
 				disable_submission(guc);
 				goto out;	/* GPU will be reset */
 			}
-			atomic_dec(&ce->guc_id_ref);
+			atomic_dec(&ce->guc_id.ref);
 			unpin_guc_id(guc, ce);
 			return ret;
 		}
@@ -2102,22 +2254,16 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * schedule enable or context registration if either G2H is pending
 	 * respectfully. Once a G2H returns, the fence is released that is
 	 * blocking these requests (see guc_signal_context_fence).
-	 *
-	 * We can safely check the below fields outside of the lock as it isn't
-	 * possible for these fields to transition from being clear to set but
-	 * converse is possible, hence the need for the check within the lock.
 	 */
-	if (likely(!context_wait_for_deregister_to_register(ce) &&
-		   !context_pending_disable(ce)))
-		return 0;
-
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	if (context_wait_for_deregister_to_register(ce) ||
 	    context_pending_disable(ce)) {
+		init_irq_work(&rq->submit_work, submit_work_cb);
 		i915_sw_fence_await(&rq->submit);
 
 		list_add_tail(&rq->guc_fence_link, &ce->guc_state.fences);
 	}
+	incr_context_committed_requests(ce);
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	return 0;
@@ -2259,7 +2405,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 	     !new_guc_prio_higher(rq->guc_prio, new_guc_prio)))
 		return;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	if (rq->guc_prio != GUC_PRIO_FINI) {
 		if (rq->guc_prio != GUC_PRIO_INIT)
 			sub_context_inflight_prio(ce, rq->guc_prio);
@@ -2267,16 +2413,16 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 		add_context_inflight_prio(ce, rq->guc_prio);
 		update_context_prio(ce);
 	}
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 
-	spin_lock(&ce->guc_active.lock);
+	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
-	spin_unlock(&ce->guc_active.lock);
+	spin_unlock(&ce->guc_state.lock);
 }
 
 static void sanitize_hwsp(struct intel_engine_cs *engine)
@@ -2355,15 +2501,15 @@ static void guc_set_default_submission(struct intel_engine_cs *engine)
 	engine->submit_request = guc_submit_request;
 }
 
-static inline void guc_kernel_context_pin(struct intel_guc *guc,
-					  struct intel_context *ce)
+static void guc_kernel_context_pin(struct intel_guc *guc,
+				   struct intel_context *ce)
 {
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 	guc_lrc_desc_pin(ce, true);
 }
 
-static inline void guc_init_lrc_mapping(struct intel_guc *guc)
+static void guc_init_lrc_mapping(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct intel_engine_cs *engine;
@@ -2466,7 +2612,7 @@ static void rcs_submission_override(struct intel_engine_cs *engine)
 	}
 }
 
-static inline void guc_default_irqs(struct intel_engine_cs *engine)
+static void guc_default_irqs(struct intel_engine_cs *engine)
 {
 	engine->irq_keep_mask = GT_RENDER_USER_INTERRUPT;
 	intel_engine_set_irq_handler(engine, cs_irq_handler);
@@ -2562,7 +2708,7 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
-static inline struct intel_context *
+static struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
@@ -2583,12 +2729,6 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 	return ce;
 }
 
-static void decr_outstanding_submission_g2h(struct intel_guc *guc)
-{
-	if (atomic_dec_and_test(&guc->outstanding_submission_g2h))
-		wake_up_all(&guc->ct.wq);
-}
-
 int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 					  const u32 *msg,
 					  u32 len)
@@ -2607,6 +2747,13 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 
 	trace_intel_context_deregister_done(ce);
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+	if (unlikely(ce->drop_deregister)) {
+		ce->drop_deregister = false;
+		return 0;
+	}
+#endif
+
 	if (context_wait_for_deregister_to_register(ce)) {
 		struct intel_runtime_pm *runtime_pm =
 			&ce->engine->gt->i915->runtime_pm;
@@ -2652,8 +2799,7 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		     (!context_pending_enable(ce) &&
 		     !context_pending_disable(ce)))) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Bad context sched_state 0x%x, 0x%x, desc_idx %u",
-			atomic_read(&ce->guc_sched_state_no_lock),
+			"Bad context sched_state 0x%x, desc_idx %u",
 			ce->guc_state.sched_state, desc_idx);
 		return -EPROTO;
 	}
@@ -2661,10 +2807,26 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	trace_intel_context_sched_done(ce);
 
 	if (context_pending_enable(ce)) {
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_enable)) {
+			ce->drop_schedule_enable = false;
+			return 0;
+		}
+#endif
+
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
 		clr_context_pending_enable(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
+#ifdef CONFIG_DRM_I915_SELFTEST
+		if (unlikely(ce->drop_schedule_disable)) {
+			ce->drop_schedule_disable = false;
+			return 0;
+		}
+#endif
+
 		/*
 		 * Unpin must be done before __guc_signal_context_fence,
 		 * otherwise a race exists between the requests getting
@@ -2721,7 +2883,12 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
-	if (likely(!intel_context_is_banned(ce))) {
+	/*
+	 * XXX: Racey if request cancellation has occurred, see comment in
+	 * __guc_reset_context().
+	 */
+	if (likely(!intel_context_is_banned(ce) &&
+		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
 	}
@@ -2797,33 +2964,48 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 	struct intel_context *ce;
 	struct i915_request *rq;
 	unsigned long index;
+	unsigned long flags;
 
 	/* Reset called during driver load? GuC not yet initialised! */
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
-		list_for_each_entry(rq, &ce->guc_active.requests, sched.link) {
+		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
-			return;
+			intel_context_put(ce);
+			xa_lock(&guc->context_lookup);
+			goto done;
 		}
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+
 	}
+done:
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
@@ -2839,23 +3021,34 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 	if (unlikely(!guc_submission_initialized(guc)))
 		return;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		if (!intel_context_is_pinned(ce))
+		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
+		xa_unlock(&guc->context_lookup);
+
+		if (!intel_context_is_pinned(ce))
+			goto next;
+
 		if (intel_engine_is_virtual(ce->engine)) {
 			if (!(ce->engine->mask & engine->mask))
-				continue;
+				goto next;
 		} else {
 			if (ce->engine != engine)
-				continue;
+				goto next;
 		}
 
-		spin_lock_irqsave(&ce->guc_active.lock, flags);
-		intel_engine_dump_active_requests(&ce->guc_active.requests,
+		spin_lock(&ce->guc_state.lock);
+		intel_engine_dump_active_requests(&ce->guc_state.requests,
 						  hung_rq, m);
-		spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+		spin_unlock(&ce->guc_state.lock);
+
+next:
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 void intel_guc_submission_print_info(struct intel_guc *guc,
@@ -2881,25 +3074,24 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 		priolist_for_each_request(rq, pl)
 			drm_printf(p, "guc_id=%u, seqno=%llu\n",
-				   rq->context->guc_id,
+				   rq->context->guc_id.id,
 				   rq->fence.seqno);
 	}
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 	drm_printf(p, "\n");
 }
 
-static inline void guc_log_context_priority(struct drm_printer *p,
-					    struct intel_context *ce)
+static void guc_log_context_priority(struct drm_printer *p,
+				     struct intel_context *ce)
 {
 	int i;
 
-	drm_printf(p, "\t\tPriority: %d\n",
-		   ce->guc_prio);
+	drm_printf(p, "\t\tPriority: %d\n", ce->guc_state.prio);
 	drm_printf(p, "\t\tNumber Requests (lower index == higher priority)\n");
 	for (i = GUC_CLIENT_PRIORITY_KMD_HIGH;
 	     i < GUC_CLIENT_PRIORITY_NUM; ++i) {
 		drm_printf(p, "\t\tNumber requests in priority band[%d]: %d\n",
-			   i, ce->guc_prio_count[i]);
+			   i, ce->guc_state.prio_count[i]);
 	}
 	drm_printf(p, "\n");
 }
@@ -2909,9 +3101,11 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	unsigned long index;
+	unsigned long flags;
 
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
+		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
 		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
 			   ce->ring->head,
@@ -2922,13 +3116,13 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 		drm_printf(p, "\t\tContext Pin Count: %u\n",
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id_ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
-			   ce->guc_state.sched_state,
-			   atomic_read(&ce->guc_sched_state_no_lock));
+			   atomic_read(&ce->guc_id.ref));
+		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+			   ce->guc_state.sched_state);
 
 		guc_log_context_priority(p, ce);
 	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static struct intel_context *
@@ -3036,3 +3230,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
new file mode 100644
index 000000000000..264e2f705c17
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/intel_scheduler_helpers.h"
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int intel_guc_scrub_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct i915_request *last[3] = {NULL, NULL, NULL}, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/* Submit requests and inject errors forcing G2H to be dropped */
+	for (i = 0; i < 3; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		switch (i) {
+		case 0:
+			ce->drop_schedule_enable = true;
+			break;
+		case 1:
+			ce->drop_schedule_disable = true;
+			break;
+		case 2:
+			ce->drop_deregister = true;
+			break;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		last[i] = rq;
+	}
+
+	for (i = 0; i < 3; ++i) {
+		ret = i915_request_wait(last[i], 0, HZ);
+		if (ret < 0) {
+			pr_err("Last request failed to complete: %d\n", ret);
+			goto err;
+		}
+		i915_request_put(last[i]);
+		last[i] = NULL;
+	}
+
+	/* Force all H2G / G2H to be submitted / processed */
+	intel_gt_retire_requests(gt);
+	msleep(500);
+
+	/* Scrub missing G2H */
+	intel_gt_handle_error(engine->gt, -1, 0, "selftest reset");
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+err:
+	for (i = 0; i < 3; ++i)
+		if (last[i])
+			i915_request_put(last[i]);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+int intel_guc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_scrub_ctbs),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 0f08bcfbe964..2819b69fbb72 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -49,8 +49,7 @@
 #include "i915_memcpy.h"
 #include "i915_scatterlist.h"
 
-#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
-#define ATOMIC_MAYFAIL (GFP_ATOMIC | __GFP_NOWARN)
+#define ATOMIC_MAYFAIL (GFP_NOWAIT | __GFP_NOWARN)
 
 static void __sg_set_buf(struct scatterlist *sg,
 			 void *addr, unsigned int len, loff_t it)
@@ -79,7 +78,7 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	if (e->cur == e->end) {
 		struct scatterlist *sgl;
 
-		sgl = (typeof(sgl))__get_free_page(ALLOW_FAIL);
+		sgl = (typeof(sgl))__get_free_page(ATOMIC_MAYFAIL);
 		if (!sgl) {
 			e->err = -ENOMEM;
 			return false;
@@ -99,10 +98,10 @@ static bool __i915_error_grow(struct drm_i915_error_state_buf *e, size_t len)
 	}
 
 	e->size = ALIGN(len + 1, SZ_64K);
-	e->buf = kmalloc(e->size, ALLOW_FAIL);
+	e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	if (!e->buf) {
 		e->size = PAGE_ALIGN(len + 1);
-		e->buf = kmalloc(e->size, GFP_KERNEL);
+		e->buf = kmalloc(e->size, ATOMIC_MAYFAIL);
 	}
 	if (!e->buf) {
 		e->err = -ENOMEM;
@@ -243,12 +242,12 @@ static bool compress_init(struct i915_vma_compress *c)
 {
 	struct z_stream_s *zstream = &c->zstream;
 
-	if (pool_init(&c->pool, ALLOW_FAIL))
+	if (pool_init(&c->pool, ATOMIC_MAYFAIL))
 		return false;
 
 	zstream->workspace =
 		kmalloc(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
-			ALLOW_FAIL);
+			ATOMIC_MAYFAIL);
 	if (!zstream->workspace) {
 		pool_fini(&c->pool);
 		return false;
@@ -256,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c)
 
 	c->tmp = NULL;
 	if (i915_has_memcpy_from_wc())
-		c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);
+		c->tmp = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 
 	return true;
 }
@@ -280,7 +279,7 @@ static void *compress_next_page(struct i915_vma_compress *c,
 	if (dst->page_count >= dst->num_pages)
 		return ERR_PTR(-ENOSPC);
 
-	page = pool_alloc(&c->pool, ALLOW_FAIL);
+	page = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 
@@ -376,7 +375,7 @@ struct i915_vma_compress {
 
 static bool compress_init(struct i915_vma_compress *c)
 {
-	return pool_init(&c->pool, ALLOW_FAIL) == 0;
+	return pool_init(&c->pool, ATOMIC_MAYFAIL) == 0;
 }
 
 static bool compress_start(struct i915_vma_compress *c)
@@ -391,7 +390,7 @@ static int compress_page(struct i915_vma_compress *c,
 {
 	void *ptr;
 
-	ptr = pool_alloc(&c->pool, ALLOW_FAIL);
+	ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL);
 	if (!ptr)
 		return -ENOMEM;
 
@@ -997,7 +996,7 @@ i915_vma_coredump_create(const struct intel_gt *gt,
 
 	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
 	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
-	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL);
+	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL);
 	if (!dst)
 		return NULL;
 
@@ -1433,7 +1432,7 @@ capture_engine(struct intel_engine_cs *engine,
 	struct i915_request *rq = NULL;
 	unsigned long flags;
 
-	ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
+	ee = intel_engine_coredump_alloc(engine, ATOMIC_MAYFAIL);
 	if (!ee)
 		return NULL;
 
@@ -1481,7 +1480,7 @@ gt_record_engines(struct intel_gt_coredump *gt,
 		struct intel_engine_coredump *ee;
 
 		/* Refill our page pool before entering atomic section */
-		pool_refill(&compress->pool, ALLOW_FAIL);
+		pool_refill(&compress->pool, ATOMIC_MAYFAIL);
 
 		ee = capture_engine(engine, compress);
 		if (!ee)
@@ -1507,7 +1506,7 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	const struct intel_uc *uc = &gt->_gt->uc;
 	struct intel_uc_coredump *error_uc;
 
-	error_uc = kzalloc(sizeof(*error_uc), ALLOW_FAIL);
+	error_uc = kzalloc(sizeof(*error_uc), ATOMIC_MAYFAIL);
 	if (!error_uc)
 		return NULL;
 
@@ -1518,8 +1517,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	 * As modparams are generally accesible from the userspace make
 	 * explicit copies of the firmware paths.
 	 */
-	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ALLOW_FAIL);
-	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ALLOW_FAIL);
+	error_uc->guc_fw.path = kstrdup(uc->guc.fw.path, ATOMIC_MAYFAIL);
+	error_uc->huc_fw.path = kstrdup(uc->huc.fw.path, ATOMIC_MAYFAIL);
 	error_uc->guc_log =
 		i915_vma_coredump_create(gt->_gt,
 					 uc->guc.log.vma, "GuC log buffer",
@@ -1778,7 +1777,7 @@ i915_vma_capture_prepare(struct intel_gt_coredump *gt)
 {
 	struct i915_vma_compress *compress;
 
-	compress = kmalloc(sizeof(*compress), ALLOW_FAIL);
+	compress = kmalloc(sizeof(*compress), ATOMIC_MAYFAIL);
 	if (!compress)
 		return NULL;
 
@@ -1811,11 +1810,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 	if (IS_ERR(error))
 		return error;
 
-	error = i915_gpu_coredump_alloc(i915, ALLOW_FAIL);
+	error = i915_gpu_coredump_alloc(i915, ATOMIC_MAYFAIL);
 	if (!error)
 		return ERR_PTR(-ENOMEM);
 
-	error->gt = intel_gt_coredump_alloc(gt, ALLOW_FAIL);
+	error->gt = intel_gt_coredump_alloc(gt, ATOMIC_MAYFAIL);
 	if (error->gt) {
 		struct i915_vma_compress *compress;
 
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 1bc1349ba3c2..177eaf55adff 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -218,6 +218,11 @@ struct i915_request {
 	};
 	struct llist_head execute_cb;
 	struct i915_sw_fence semaphore;
+	/**
+	 * @submit_work: complete submit fence from an IRQ if needed for
+	 * locking hierarchy reasons.
+	 */
+	struct irq_work submit_work;
 
 	/*
 	 * A list of everyone we wait upon, and everyone who waits upon us.
@@ -285,18 +290,20 @@ struct i915_request {
 		struct hrtimer timer;
 	} watchdog;
 
-	/*
-	 * Requests may need to be stalled when using GuC submission waiting for
-	 * certain GuC operations to complete. If that is the case, stalled
-	 * requests are added to a per context list of stalled requests. The
-	 * below list_head is the link in that list.
+	/**
+	 * @guc_fence_link: Requests may need to be stalled when using GuC
+	 * submission waiting for certain GuC operations to complete. If that is
+	 * the case, stalled requests are added to a per context list of stalled
+	 * requests. The below list_head is the link in that list. Protected by
+	 * ce->guc_state.lock.
 	 */
 	struct list_head guc_fence_link;
 
 	/**
-	 * Priority level while the request is inflight. Differs from i915
-	 * scheduler priority. See comment above
-	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details.
+	 * @guc_prio: Priority level while the request is inflight. Differs from
+	 * i915 scheduler priority. See comment above
+	 * I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP for details. Protected by
+	 * ce->guc_active.lock.
 	 */
 #define	GUC_PRIO_INIT	0xff
 #define	GUC_PRIO_FINI	0xfe
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 806ad688274b..ec7fe12b94aa 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -805,7 +805,7 @@ DECLARE_EVENT_CLASS(i915_request,
 			   __entry->dev = rq->engine->i915->drm.primary->index;
 			   __entry->class = rq->engine->uabi_class;
 			   __entry->instance = rq->engine->uabi_instance;
-			   __entry->guc_id = rq->context->guc_id;
+			   __entry->guc_id = rq->context->guc_id.id;
 			   __entry->ctx = rq->fence.context;
 			   __entry->seqno = rq->fence.seqno;
 			   __entry->tail = rq->tail;
@@ -903,23 +903,19 @@ DECLARE_EVENT_CLASS(intel_context,
 			     __field(u32, guc_id)
 			     __field(int, pin_count)
 			     __field(u32, sched_state)
-			     __field(u32, guc_sched_state_no_lock)
 			     __field(u8, guc_prio)
 			     ),
 
 		    TP_fast_assign(
-			   __entry->guc_id = ce->guc_id;
+			   __entry->guc_id = ce->guc_id.id;
 			   __entry->pin_count = atomic_read(&ce->pin_count);
 			   __entry->sched_state = ce->guc_state.sched_state;
-			   __entry->guc_sched_state_no_lock =
-			   atomic_read(&ce->guc_sched_state_no_lock);
-			   __entry->guc_prio = ce->guc_prio;
+			   __entry->guc_prio = ce->guc_state.prio;
 			   ),
 
-		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x,0x%x, guc_prio=%u",
+		    TP_printk("guc_id=%d, pin_count=%d sched_state=0x%x, guc_prio=%u",
 			      __entry->guc_id, __entry->pin_count,
 			      __entry->sched_state,
-			      __entry->guc_sched_state_no_lock,
 			      __entry->guc_prio)
 );
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index cfa5c4165a4f..3cf6758931f9 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
+selftest(guc, intel_guc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index d67710d10615..e2c5db77f087 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -772,6 +772,98 @@ static int __cancel_completed(struct intel_engine_cs *engine)
 	return err;
 }
 
+static int __cancel_reset(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq, *nop;
+	unsigned long preempt_timeout_ms;
+	int err = 0;
+
+	preempt_timeout_ms = engine->props.preempt_timeout_ms;
+	engine->props.preempt_timeout_ms = 100;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		goto out_restore;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_NOOP);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	nop = intel_context_create_request(ce);
+	if (IS_ERR(nop))
+		goto out_nop;
+	i915_request_get(nop);
+	i915_request_add(nop);
+
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel hung request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+		goto out_nop;
+	}
+
+	if (i915_request_wait(nop, 0, HZ) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to complete nop request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_nop;
+	}
+
+	if (nop->fence.error != 0) {
+		pr_err("%s: Nop request errored (%u)\n",
+		       engine->name, nop->fence.error);
+		err = -EINVAL;
+	}
+
+out_nop:
+	i915_request_put(nop);
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+out_restore:
+	engine->props.preempt_timeout_ms = preempt_timeout_ms;
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
 static int live_cancel_request(void *arg)
 {
 	struct drm_i915_private *i915 = arg;
@@ -804,6 +896,14 @@ static int live_cancel_request(void *arg)
 			return err;
 		if (err2)
 			return err2;
+
+		/* Expects reset so call outside of igt_live_test_* */
+		err = __cancel_reset(engine);
+		if (err)
+			return err;
+
+		if (igt_flush_test(i915))
+			return -EIO;
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..ae60bb507f45 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -23,6 +24,7 @@ enum selftest_scheduler_modify {
 	SELFTEST_SCHEDULER_MODIFY_FAST_RESET,
 };
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 enum selftest_scheduler_modify modify_type);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 02/27] drm/i915/guc: Allow flexible number of context ids
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Number of available GuC contexts ids might be limited.
Stop referring in code to macro and use variable instead.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h            |  4 ++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++++++------
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 112dd29a63fe..6fd2719d1b75 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -60,6 +60,10 @@ struct intel_guc {
 	spinlock_t contexts_lock;
 	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/** @num_guc_ids: number of guc_ids that can be used */
+	u32 num_guc_ids;
+	/** @max_guc_ids: max number of guc_ids that can be used */
+	u32 max_guc_ids;
 	/**
 	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
 	 */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 46158d996bf6..8235e49bb347 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
 
-	GEM_BUG_ON(index >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(index >= guc->max_guc_ids);
 
 	return &base[index];
 }
@@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	GEM_BUG_ON(id >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(id >= guc->max_guc_ids);
 
 	return ce;
 }
@@ -363,8 +363,7 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
 	u32 size;
 	int ret;
 
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
-			  GUC_MAX_LRC_DESCRIPTORS);
+	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
 	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
 					     (void **)&guc->lrc_desc_pool_vaddr);
 	if (ret)
@@ -1193,7 +1192,7 @@ static void guc_submit_request(struct i915_request *rq)
 static int new_guc_id(struct intel_guc *guc)
 {
 	return ida_simple_get(&guc->guc_ids, 0,
-			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
+			      guc->num_guc_ids, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
 
@@ -2704,6 +2703,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
+	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
@@ -2713,7 +2714,7 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
 
-	if (unlikely(desc_idx >= GUC_MAX_LRC_DESCRIPTORS)) {
+	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
 			"Invalid desc_idx %u", desc_idx);
 		return NULL;
@@ -3063,6 +3064,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
+	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
+	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
 	drm_printf(p, "GuC tasklet count: %u\n\n",
 		   atomic_read(&sched_engine->tasklet.count));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 02/27] drm/i915/guc: Allow flexible number of context ids
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Number of available GuC contexts ids might be limited.
Stop referring in code to macro and use variable instead.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h            |  4 ++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++++++------
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 112dd29a63fe..6fd2719d1b75 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -60,6 +60,10 @@ struct intel_guc {
 	spinlock_t contexts_lock;
 	/** @guc_ids: used to allocate new guc_ids */
 	struct ida guc_ids;
+	/** @num_guc_ids: number of guc_ids that can be used */
+	u32 num_guc_ids;
+	/** @max_guc_ids: max number of guc_ids that can be used */
+	u32 max_guc_ids;
 	/**
 	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
 	 */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 46158d996bf6..8235e49bb347 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
 
-	GEM_BUG_ON(index >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(index >= guc->max_guc_ids);
 
 	return &base[index];
 }
@@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	GEM_BUG_ON(id >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(id >= guc->max_guc_ids);
 
 	return ce;
 }
@@ -363,8 +363,7 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
 	u32 size;
 	int ret;
 
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
-			  GUC_MAX_LRC_DESCRIPTORS);
+	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
 	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
 					     (void **)&guc->lrc_desc_pool_vaddr);
 	if (ret)
@@ -1193,7 +1192,7 @@ static void guc_submit_request(struct i915_request *rq)
 static int new_guc_id(struct intel_guc *guc)
 {
 	return ida_simple_get(&guc->guc_ids, 0,
-			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
+			      guc->num_guc_ids, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
 
@@ -2704,6 +2703,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
+	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
@@ -2713,7 +2714,7 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
 
-	if (unlikely(desc_idx >= GUC_MAX_LRC_DESCRIPTORS)) {
+	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
 			"Invalid desc_idx %u", desc_idx);
 		return NULL;
@@ -3063,6 +3064,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
+	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
+	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
 	drm_printf(p, "GuC tasklet count: %u\n\n",
 		   atomic_read(&sched_engine->tasklet.count));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 03/27] drm/i915/guc: Connect the number of guc_ids to debugfs
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

For testing purposes it may make sense to reduce the number of guc_ids
available to be allocated. Add debugfs support for setting the number of
guc_ids.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    | 31 +++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  3 +-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index 887c8c8f35db..b88d343ee432 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -71,12 +71,43 @@ static bool intel_eval_slpc_support(void *data)
 	return intel_guc_slpc_is_used(guc);
 }
 
+static int guc_num_id_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->num_guc_ids;
+
+	return 0;
+}
+
+static int guc_num_id_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	if (val > guc->max_guc_ids)
+		val = guc->max_guc_ids;
+	else if (val < 256)
+		val = 256;
+
+	guc->num_guc_ids = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
+
 void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
 {
 	static const struct debugfs_gt_file files[] = {
 		{ "guc_info", &guc_info_fops, NULL },
 		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
 		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
+		{ "guc_num_id", &guc_num_id_fops, NULL },
 	};
 
 	if (!intel_guc_is_supported(guc))
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8235e49bb347..68742b612692 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2716,7 +2716,8 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 
 	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Invalid desc_idx %u", desc_idx);
+			"Invalid desc_idx %u, max %u",
+			desc_idx, guc->max_guc_ids);
 		return NULL;
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 03/27] drm/i915/guc: Connect the number of guc_ids to debugfs
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

For testing purposes it may make sense to reduce the number of guc_ids
available to be allocated. Add debugfs support for setting the number of
guc_ids.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    | 31 +++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  3 +-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index 887c8c8f35db..b88d343ee432 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -71,12 +71,43 @@ static bool intel_eval_slpc_support(void *data)
 	return intel_guc_slpc_is_used(guc);
 }
 
+static int guc_num_id_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->num_guc_ids;
+
+	return 0;
+}
+
+static int guc_num_id_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	if (val > guc->max_guc_ids)
+		val = guc->max_guc_ids;
+	else if (val < 256)
+		val = 256;
+
+	guc->num_guc_ids = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
+
 void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
 {
 	static const struct debugfs_gt_file files[] = {
 		{ "guc_info", &guc_info_fops, NULL },
 		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
 		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
+		{ "guc_num_id", &guc_num_id_fops, NULL },
 	};
 
 	if (!intel_guc_is_supported(guc))
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8235e49bb347..68742b612692 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2716,7 +2716,8 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 
 	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Invalid desc_idx %u", desc_idx);
+			"Invalid desc_idx %u, max %u",
+			desc_idx, guc->max_guc_ids);
 		return NULL;
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight.

FIXME: Move locking / structure changes into different patch

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
 8 files changed, 199 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index adfe49b53b1b..c8595da64ad8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 80bbdc7810f6..fd338a30617e 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -190,22 +190,29 @@ struct intel_context {
 		/**
 		 * @id: unique handle which is used to communicate information
 		 * with the GuC about this context, protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		u16 id;
 		/**
 		 * @ref: the number of references to the guc_id, when
 		 * transitioning in and out of zero protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		atomic_t ref;
 		/**
 		 * @link: in guc->guc_id_list when the guc_id has no refs but is
-		 * still valid, protected by guc->contexts_lock
+		 * still valid, protected by guc->submission_state.lock
 		 */
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 70ea46d6cfb0..17a5028ea177 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..a17bf0d4592b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ba10bd374cee 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -142,6 +142,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
+	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_LIMIT
 };
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 6fd2719d1b75..7358883f1540 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -53,21 +53,37 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/**
-	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
-	 * ce->guc_id.ref when transitioning in and out of zero
-	 */
-	spinlock_t contexts_lock;
-	/** @guc_ids: used to allocate new guc_ids */
-	struct ida guc_ids;
-	/** @num_guc_ids: number of guc_ids that can be used */
-	u32 num_guc_ids;
-	/** @max_guc_ids: max number of guc_ids that can be used */
-	u32 max_guc_ids;
-	/**
-	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
-	 */
-	struct list_head guc_id_list;
+	struct {
+		/**
+		 * @lock: protects everything in submission_state, ce->guc_id,
+		 * and ce->destroyed_link
+		 */
+		spinlock_t lock;
+		/**
+		 * @guc_ids: used to allocate new guc_ids
+		 */
+		struct ida guc_ids;
+		/** @num_guc_ids: number of guc_ids that can be used */
+		u32 num_guc_ids;
+		/** @max_guc_ids: max number of guc_ids that can be used */
+		u32 max_guc_ids;
+		/**
+		 * @guc_id_list: list of intel_context with valid guc_ids but no
+		 * refs
+		 */
+		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
+	} submission_state;
 
 	bool submission_supported;
 	bool submission_selected;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index b88d343ee432..27655460ee84 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
 	if (!intel_guc_submission_is_used(guc))
 		return -ENODEV;
 
-	*val = guc->num_guc_ids;
+	*val = guc->submission_state.num_guc_ids;
 
 	return 0;
 }
@@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
 static int guc_num_id_set(void *data, u64 val)
 {
 	struct intel_guc *guc = data;
+	unsigned long flags;
 
 	if (!intel_guc_submission_is_used(guc))
 		return -ENODEV;
 
-	if (val > guc->max_guc_ids)
-		val = guc->max_guc_ids;
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+
+	if (val > guc->submission_state.max_guc_ids)
+		val = guc->submission_state.max_guc_ids;
 	else if (val < 256)
 		val = 256;
 
-	guc->num_guc_ids = val;
+	guc->submission_state.num_guc_ids = val;
+
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 68742b612692..f835e06e5f9f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -86,9 +86,9 @@
  * submitting at a time. Currently only 1 sched_engine used for all of GuC
  * submission but that could change in the future.
  *
- * guc->contexts_lock
- * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
- * submission can hold this at a time.
+ * guc->submission_state.lock
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -100,7 +100,7 @@
  *
  * Lock ordering rules:
  * sched_engine->lock -> ce->guc_state.lock
- * guc->contexts_lock -> ce->guc_state.lock
+ * guc->submission_state.lock -> ce->guc_state.lock
  *
  * Reset races:
  * When a GPU full reset is triggered it is assumed that some G2H responses to
@@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
 
-	GEM_BUG_ON(index >= guc->max_guc_ids);
+	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
 
 	return &base[index];
 }
@@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	GEM_BUG_ON(id >= guc->max_guc_ids);
+	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
 
 	return ce;
 }
@@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
 	u32 size;
 	int ret;
 
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
+	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
+			  guc->submission_state.max_guc_ids);
 	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
 					     (void **)&guc->lrc_desc_pool_vaddr);
 	if (ret)
@@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	if (unlikely(!guc_submission_initialized(guc))) {
@@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	flush_work(&guc->ct.requests.worker);
+	guc_flush_destroyed_contexts(guc);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
@@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
-	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
-	ida_init(&guc->guc_ids);
+	spin_lock_init(&guc->submission_state.lock);
+	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
+	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
 
 	return 0;
 }
@@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 		return;
 
 	guc_lrc_desc_pool_destroy(guc);
+	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
 
@@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
 
 static int new_guc_id(struct intel_guc *guc)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
-			      guc->num_guc_ids, GFP_KERNEL |
+	return ida_simple_get(&guc->submission_state.guc_ids, 0,
+			      guc->submission_state.num_guc_ids, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		ida_simple_remove(&guc->submission_state.guc_ids,
+				  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	__release_guc_id(guc, ce);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int steal_guc_id(struct intel_guc *guc)
@@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
 	struct intel_context *ce;
 	int guc_id;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(&guc->submission_state.guc_id_list)) {
+		ce = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
@@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
@@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 
 	might_lock(&ce->guc_state.lock);
 
@@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/*
 	 * -EAGAIN indicates no guc_id are available, let's retire any
@@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
 	    !atomic_read(&ce->guc_id.ref))
-		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		list_add_tail(&ce->guc_id.link,
+			      &guc->submission_state.guc_id_list);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
@@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id, true);
 }
 
@@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+		guc_lrc_desc_unpin(ce);
+		spin_lock_irqsave(&guc->submission_state.lock, flags);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
-	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->contexts_lock, flags);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
-	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
-	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
@@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
 
-	if (unlikely(desc_idx >= guc->max_guc_ids)) {
+	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
 			"Invalid desc_idx %u, max %u",
-			desc_idx, guc->max_guc_ids);
+			desc_idx, guc->submission_state.max_guc_ids);
 		return NULL;
 	}
 
@@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
@@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
-	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
-	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
+	drm_printf(p, "GuC Number GuC IDs: %u\n",
+		   guc->submission_state.num_guc_ids);
+	drm_printf(p, "GuC Max GuC IDs: %u\n",
+		   guc->submission_state.max_guc_ids);
 	drm_printf(p, "GuC tasklet count: %u\n\n",
 		   atomic_read(&sched_engine->tasklet.count));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight.

FIXME: Move locking / structure changes into different patch

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
 8 files changed, 199 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index adfe49b53b1b..c8595da64ad8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->destroyed_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 80bbdc7810f6..fd338a30617e 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -190,22 +190,29 @@ struct intel_context {
 		/**
 		 * @id: unique handle which is used to communicate information
 		 * with the GuC about this context, protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		u16 id;
 		/**
 		 * @ref: the number of references to the guc_id, when
 		 * transitioning in and out of zero protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		atomic_t ref;
 		/**
 		 * @link: in guc->guc_id_list when the guc_id has no refs but is
-		 * still valid, protected by guc->contexts_lock
+		 * still valid, protected by guc->submission_state.lock
 		 */
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
+	 * list when context is pending to be destroyed (deregistered with the
+	 * GuC), protected by guc->submission_state.lock
+	 */
+	struct list_head destroyed_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 70ea46d6cfb0..17a5028ea177 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..a17bf0d4592b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ba10bd374cee 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -142,6 +142,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
+	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_LIMIT
 };
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 6fd2719d1b75..7358883f1540 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -53,21 +53,37 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/**
-	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
-	 * ce->guc_id.ref when transitioning in and out of zero
-	 */
-	spinlock_t contexts_lock;
-	/** @guc_ids: used to allocate new guc_ids */
-	struct ida guc_ids;
-	/** @num_guc_ids: number of guc_ids that can be used */
-	u32 num_guc_ids;
-	/** @max_guc_ids: max number of guc_ids that can be used */
-	u32 max_guc_ids;
-	/**
-	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
-	 */
-	struct list_head guc_id_list;
+	struct {
+		/**
+		 * @lock: protects everything in submission_state, ce->guc_id,
+		 * and ce->destroyed_link
+		 */
+		spinlock_t lock;
+		/**
+		 * @guc_ids: used to allocate new guc_ids
+		 */
+		struct ida guc_ids;
+		/** @num_guc_ids: number of guc_ids that can be used */
+		u32 num_guc_ids;
+		/** @max_guc_ids: max number of guc_ids that can be used */
+		u32 max_guc_ids;
+		/**
+		 * @guc_id_list: list of intel_context with valid guc_ids but no
+		 * refs
+		 */
+		struct list_head guc_id_list;
+		/**
+		 * @destroyed_contexts: list of contexts waiting to be destroyed
+		 * (deregistered with the GuC)
+		 */
+		struct list_head destroyed_contexts;
+		/**
+		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * need to take a GT PM reference and can't from destroy
+		 * function as it might be in an atomic context (no sleeping)
+		 */
+		struct work_struct destroyed_worker;
+	} submission_state;
 
 	bool submission_supported;
 	bool submission_selected;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index b88d343ee432..27655460ee84 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
 	if (!intel_guc_submission_is_used(guc))
 		return -ENODEV;
 
-	*val = guc->num_guc_ids;
+	*val = guc->submission_state.num_guc_ids;
 
 	return 0;
 }
@@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
 static int guc_num_id_set(void *data, u64 val)
 {
 	struct intel_guc *guc = data;
+	unsigned long flags;
 
 	if (!intel_guc_submission_is_used(guc))
 		return -ENODEV;
 
-	if (val > guc->max_guc_ids)
-		val = guc->max_guc_ids;
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+
+	if (val > guc->submission_state.max_guc_ids)
+		val = guc->submission_state.max_guc_ids;
 	else if (val < 256)
 		val = 256;
 
-	guc->num_guc_ids = val;
+	guc->submission_state.num_guc_ids = val;
+
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 68742b612692..f835e06e5f9f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -86,9 +86,9 @@
  * submitting at a time. Currently only 1 sched_engine used for all of GuC
  * submission but that could change in the future.
  *
- * guc->contexts_lock
- * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
- * submission can hold this at a time.
+ * guc->submission_state.lock
+ * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
+ * list.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -100,7 +100,7 @@
  *
  * Lock ordering rules:
  * sched_engine->lock -> ce->guc_state.lock
- * guc->contexts_lock -> ce->guc_state.lock
+ * guc->submission_state.lock -> ce->guc_state.lock
  *
  * Reset races:
  * When a GPU full reset is triggered it is assumed that some G2H responses to
@@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
 
-	GEM_BUG_ON(index >= guc->max_guc_ids);
+	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
 
 	return &base[index];
 }
@@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	GEM_BUG_ON(id >= guc->max_guc_ids);
+	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
 
 	return ce;
 }
@@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
 	u32 size;
 	int ret;
 
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
+	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
+			  guc->submission_state.max_guc_ids);
 	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
 					     (void **)&guc->lrc_desc_pool_vaddr);
 	if (ret)
@@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	if (unlikely(!guc_submission_initialized(guc))) {
@@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	flush_work(&guc->ct.requests.worker);
+	guc_flush_destroyed_contexts(guc);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
@@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void destroyed_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
-	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
-	ida_init(&guc->guc_ids);
+	spin_lock_init(&guc->submission_state.lock);
+	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
+	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
+	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
 
 	return 0;
 }
@@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 		return;
 
 	guc_lrc_desc_pool_destroy(guc);
+	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
 }
 
@@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
 
 static int new_guc_id(struct intel_guc *guc)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
-			      guc->num_guc_ids, GFP_KERNEL |
+	return ida_simple_get(&guc->submission_state.guc_ids, 0,
+			      guc->submission_state.num_guc_ids, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		ida_simple_remove(&guc->submission_state.guc_ids,
+				  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	__release_guc_id(guc, ce);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int steal_guc_id(struct intel_guc *guc)
@@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
 	struct intel_context *ce;
 	int guc_id;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(&guc->submission_state.guc_id_list)) {
+		ce = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
@@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
@@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 
 	might_lock(&ce->guc_state.lock);
 
@@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/*
 	 * -EAGAIN indicates no guc_id are available, let's retire any
@@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
 	    !atomic_read(&ce->guc_id.ref))
-		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		list_add_tail(&ce->guc_id.link,
+			      &guc->submission_state.guc_id_list);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
@@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+		clr_context_registered(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	deregister_context(ce, ce->guc_id.id, true);
 }
 
@@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	GEM_BUG_ON(!submission_disabled(guc) &&
+		   guc_submission_initialized(guc));
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		__release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->submission_state.destroyed_contexts,
+				 destroyed_link) {
+		list_del_init(&ce->destroyed_link);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+		guc_lrc_desc_unpin(ce);
+		spin_lock_irqsave(&guc->submission_state.lock, flags);
+	}
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+}
+
+static void destroyed_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.destroyed_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
-	bool disabled;
+	bool destroy;
 
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
 	 * registered with the GuC or the GuC is in the middle of a reset.
 	 */
-	if (context_guc_id_invalid(ce)) {
-		__guc_context_destroy(ce);
-		return;
-	} else if (submission_disabled(guc) ||
-		   !lrc_desc_registered(guc, ce->guc_id.id)) {
-		release_guc_id(guc, ce);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	/*
-	 * We have to acquire the context spinlock and check guc_id again, if it
-	 * is valid it hasn't been stolen and needs to be deregistered. We
-	 * delete this context from the list of unpinned guc_id available to
-	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
-	 * returns indicating this context has been deregistered the guc_id is
-	 * returned to the pool of available guc_id.
-	 */
-	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->contexts_lock, flags);
-		__guc_context_destroy(ce);
-		return;
-	}
-
-	if (!list_empty(&ce->guc_id.link))
-		list_del_init(&ce->guc_id.link);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
-
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	disabled = submission_disabled(guc);
-	if (likely(!disabled)) {
-		set_context_destroyed(ce);
-		clr_context_registered(ce);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+	if (likely(!destroy)) {
+		if (!list_empty(&ce->guc_id.link))
+			list_del_init(&ce->guc_id.link);
+		list_add_tail(&ce->destroyed_link,
+			      &guc->submission_state.destroyed_contexts);
+	} else {
+		__release_guc_id(guc, ce);
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-	if (unlikely(disabled)) {
-		release_guc_id(guc, ce);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+	if (unlikely(destroy)) {
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
-	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
-	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
@@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
 
-	if (unlikely(desc_idx >= guc->max_guc_ids)) {
+	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
 			"Invalid desc_idx %u, max %u",
-			desc_idx, guc->max_guc_ids);
+			desc_idx, guc->submission_state.max_guc_ids);
 		return NULL;
 	}
 
@@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
@@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
-	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
-	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
+	drm_printf(p, "GuC Number GuC IDs: %u\n",
+		   guc->submission_state.num_guc_ids);
+	drm_printf(p, "GuC Max GuC IDs: %u\n",
+		   guc->submission_state.max_guc_ids);
 	drm_printf(p, "GuC tasklet count: %u\n\n",
 		   atomic_read(&sched_engine->tasklet.count));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Sometimes it is desirable to queue work up for later if the GT PM isn't
held and run that work on next GT PM unpark.

Implemented with a list in the GT of all pending work, workqueues in
the list, a callback to add a workqueue to the list, and finally a
wakeref post_get callback that iterates / drains the list + queues the
workqueues.

First user of this is deregistration of GuC contexts.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |  1 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
 drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
 drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
 10 files changed, 119 insertions(+), 7 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 642a5b5a1b81..579bdc069f25 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -103,6 +103,7 @@ gt-y += \
 	gt/intel_gt_clock_utils.o \
 	gt/intel_gt_irq.o \
 	gt/intel_gt_pm.o \
+	gt/intel_gt_pm_unpark_work.o \
 	gt/intel_gt_pm_irq.o \
 	gt/intel_gt_requests.o \
 	gt/intel_gtt.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 62d40c986642..7e690e74baa2 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
 
 	spin_lock_init(&gt->irq_lock);
 
+	spin_lock_init(&gt->pm_unpark_work_lock);
+	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
+
 	INIT_LIST_HEAD(&gt->closed_vma);
 	spin_lock_init(&gt->closed_lock);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dea8e2479897..564c11a3748b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
 	return 0;
 }
 
+static void __gt_unpark_work_queue(struct intel_wakeref *wf)
+{
+	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
+
+	intel_gt_pm_unpark_work_queue(gt);
+}
+
 static int __gt_park(struct intel_wakeref *wf)
 {
 	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
@@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
 
 static const struct intel_wakeref_ops wf_ops = {
 	.get = __gt_unpark,
+	.post_get = __gt_unpark_work_queue,
 	.put = __gt_park,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
new file mode 100644
index 000000000000..23162dbd0c35
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "intel_runtime_pm.h"
+#include "intel_gt_pm.h"
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
+{
+	struct intel_gt_pm_unpark_work *work, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	list_for_each_entry_safe(work, next,
+				 &gt->pm_unpark_work_list, link) {
+		list_del_init(&work->link);
+		queue_work(system_unbound_wq, &work->worker);
+	}
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	if (intel_gt_pm_is_awake(gt))
+		queue_work(system_unbound_wq, &work->worker);
+	else if (list_empty(&work->link))
+		list_add_tail(&work->link, &gt->pm_unpark_work_list);
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
new file mode 100644
index 000000000000..eaf1dc313aa2
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#ifndef INTEL_GT_PM_UNPARK_WORK_H
+#define INTEL_GT_PM_UNPARK_WORK_H
+
+#include <linux/list.h>
+#include <linux/workqueue.h>
+
+struct intel_gt;
+
+/**
+ * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
+ */
+struct intel_gt_pm_unpark_work {
+	/**
+	 * @link: link into gt->pm_unpark_work_list of workers that need to be
+	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
+	 */
+	struct list_head link;
+	/** @worker: will be scheduled when GT unparked */
+	struct work_struct worker;
+};
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work);
+
+static inline void
+intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
+			     work_func_t fn)
+{
+	INIT_LIST_HEAD(&work->link);
+	INIT_WORK(&work->worker, fn);
+}
+
+#endif /* INTEL_GT_PM_UNPARK_WORK_H */
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index a81e21bf1bd1..4480312f0add 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -96,6 +96,16 @@ struct intel_gt {
 	struct intel_wakeref wakeref;
 	atomic_t user_wakeref;
 
+	/**
+	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
+	 * unparked, protected by pm_unpark_work_lock
+	 */
+	struct list_head pm_unpark_work_list;
+	/**
+	 * @pm_unpark_work_lock: protects pm_unpark_work_list
+	 */
+	spinlock_t pm_unpark_work_lock;
+
 	struct list_head closed_vma;
 	spinlock_t closed_lock; /* guards the list of closed_vma */
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 7358883f1540..023953e77553 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -19,6 +19,7 @@
 #include "intel_uc_fw.h"
 #include "i915_utils.h"
 #include "i915_vma.h"
+#include "gt/intel_gt_pm_unpark_work.h"
 
 struct __guc_ads_blob;
 
@@ -78,11 +79,12 @@ struct intel_guc {
 		 */
 		struct list_head destroyed_contexts;
 		/**
-		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * @destroyed_worker: Worker to deregister contexts, need as we
 		 * need to take a GT PM reference and can't from destroy
-		 * function as it might be in an atomic context (no sleeping)
+		 * function as it might be in an atomic context (no sleeping).
+		 * Worker only issues deregister when GT is unparked.
 		 */
-		struct work_struct destroyed_worker;
+		struct intel_gt_pm_unpark_work destroyed_worker;
 	} submission_state;
 
 	bool submission_supported;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f835e06e5f9f..dbf919801de2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
-	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
+	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
+				     destroyed_worker_func);
 
 	return 0;
 }
@@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
 
 static void destroyed_worker_func(struct work_struct *w)
 {
-	struct intel_guc *guc = container_of(w, struct intel_guc,
+	struct intel_gt_pm_unpark_work *destroyed_worker =
+		container_of(w, struct intel_gt_pm_unpark_work, worker);
+	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
 					     submission_state.destroyed_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	int tmp;
 
-	with_intel_gt_pm(gt, tmp)
+	with_intel_gt_pm_if_awake(gt, tmp)
 		deregister_destroyed_contexts(guc);
+
+	if (!list_empty(&guc->submission_state.destroyed_contexts))
+		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
 }
 
 static void guc_context_destroy(struct kref *kref)
@@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
 	 * take the GT PM for the first time which isn't allowed from an atomic
 	 * context.
 	 */
-	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
+	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
+				    &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
index dfd87d082218..282fc4f312e3 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
 
 int __intel_wakeref_get_first(struct intel_wakeref *wf)
 {
+	bool do_post = false;
+
 	/*
 	 * Treat get/put as different subclasses, as we may need to run
 	 * the put callback from under the shrinker and do not want to
@@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
 		}
 
 		smp_mb__before_atomic(); /* release wf->count */
+		do_post = true;
 	}
 	atomic_inc(&wf->count);
+	if (do_post && wf->ops->post_get)
+		wf->ops->post_get(wf);
 	mutex_unlock(&wf->mutex);
 
 	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index 545c8f277c46..ef7e6a698e8a 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
 
 struct intel_wakeref_ops {
 	int (*get)(struct intel_wakeref *wf);
+	void (*post_get)(struct intel_wakeref *wf);
 	int (*put)(struct intel_wakeref *wf);
 };
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Sometimes it is desirable to queue work up for later if the GT PM isn't
held and run that work on next GT PM unpark.

Implemented with a list in the GT of all pending work, workqueues in
the list, a callback to add a workqueue to the list, and finally a
wakeref post_get callback that iterates / drains the list + queues the
workqueues.

First user of this is deregistration of GuC contexts.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |  1 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
 drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
 drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
 10 files changed, 119 insertions(+), 7 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 642a5b5a1b81..579bdc069f25 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -103,6 +103,7 @@ gt-y += \
 	gt/intel_gt_clock_utils.o \
 	gt/intel_gt_irq.o \
 	gt/intel_gt_pm.o \
+	gt/intel_gt_pm_unpark_work.o \
 	gt/intel_gt_pm_irq.o \
 	gt/intel_gt_requests.o \
 	gt/intel_gtt.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 62d40c986642..7e690e74baa2 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
 
 	spin_lock_init(&gt->irq_lock);
 
+	spin_lock_init(&gt->pm_unpark_work_lock);
+	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
+
 	INIT_LIST_HEAD(&gt->closed_vma);
 	spin_lock_init(&gt->closed_lock);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dea8e2479897..564c11a3748b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
 	return 0;
 }
 
+static void __gt_unpark_work_queue(struct intel_wakeref *wf)
+{
+	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
+
+	intel_gt_pm_unpark_work_queue(gt);
+}
+
 static int __gt_park(struct intel_wakeref *wf)
 {
 	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
@@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
 
 static const struct intel_wakeref_ops wf_ops = {
 	.get = __gt_unpark,
+	.post_get = __gt_unpark_work_queue,
 	.put = __gt_park,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
new file mode 100644
index 000000000000..23162dbd0c35
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "intel_runtime_pm.h"
+#include "intel_gt_pm.h"
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
+{
+	struct intel_gt_pm_unpark_work *work, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	list_for_each_entry_safe(work, next,
+				 &gt->pm_unpark_work_list, link) {
+		list_del_init(&work->link);
+		queue_work(system_unbound_wq, &work->worker);
+	}
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	if (intel_gt_pm_is_awake(gt))
+		queue_work(system_unbound_wq, &work->worker);
+	else if (list_empty(&work->link))
+		list_add_tail(&work->link, &gt->pm_unpark_work_list);
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
new file mode 100644
index 000000000000..eaf1dc313aa2
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#ifndef INTEL_GT_PM_UNPARK_WORK_H
+#define INTEL_GT_PM_UNPARK_WORK_H
+
+#include <linux/list.h>
+#include <linux/workqueue.h>
+
+struct intel_gt;
+
+/**
+ * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
+ */
+struct intel_gt_pm_unpark_work {
+	/**
+	 * @link: link into gt->pm_unpark_work_list of workers that need to be
+	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
+	 */
+	struct list_head link;
+	/** @worker: will be scheduled when GT unparked */
+	struct work_struct worker;
+};
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work);
+
+static inline void
+intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
+			     work_func_t fn)
+{
+	INIT_LIST_HEAD(&work->link);
+	INIT_WORK(&work->worker, fn);
+}
+
+#endif /* INTEL_GT_PM_UNPARK_WORK_H */
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index a81e21bf1bd1..4480312f0add 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -96,6 +96,16 @@ struct intel_gt {
 	struct intel_wakeref wakeref;
 	atomic_t user_wakeref;
 
+	/**
+	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
+	 * unparked, protected by pm_unpark_work_lock
+	 */
+	struct list_head pm_unpark_work_list;
+	/**
+	 * @pm_unpark_work_lock: protects pm_unpark_work_list
+	 */
+	spinlock_t pm_unpark_work_lock;
+
 	struct list_head closed_vma;
 	spinlock_t closed_lock; /* guards the list of closed_vma */
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 7358883f1540..023953e77553 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -19,6 +19,7 @@
 #include "intel_uc_fw.h"
 #include "i915_utils.h"
 #include "i915_vma.h"
+#include "gt/intel_gt_pm_unpark_work.h"
 
 struct __guc_ads_blob;
 
@@ -78,11 +79,12 @@ struct intel_guc {
 		 */
 		struct list_head destroyed_contexts;
 		/**
-		 * @destroyed_worker: worker to deregister contexts, need as we
+		 * @destroyed_worker: Worker to deregister contexts, need as we
 		 * need to take a GT PM reference and can't from destroy
-		 * function as it might be in an atomic context (no sleeping)
+		 * function as it might be in an atomic context (no sleeping).
+		 * Worker only issues deregister when GT is unparked.
 		 */
-		struct work_struct destroyed_worker;
+		struct intel_gt_pm_unpark_work destroyed_worker;
 	} submission_state;
 
 	bool submission_supported;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f835e06e5f9f..dbf919801de2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
-	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
+	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
+				     destroyed_worker_func);
 
 	return 0;
 }
@@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
 
 static void destroyed_worker_func(struct work_struct *w)
 {
-	struct intel_guc *guc = container_of(w, struct intel_guc,
+	struct intel_gt_pm_unpark_work *destroyed_worker =
+		container_of(w, struct intel_gt_pm_unpark_work, worker);
+	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
 					     submission_state.destroyed_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	int tmp;
 
-	with_intel_gt_pm(gt, tmp)
+	with_intel_gt_pm_if_awake(gt, tmp)
 		deregister_destroyed_contexts(guc);
+
+	if (!list_empty(&guc->submission_state.destroyed_contexts))
+		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
 }
 
 static void guc_context_destroy(struct kref *kref)
@@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
 	 * take the GT PM for the first time which isn't allowed from an atomic
 	 * context.
 	 */
-	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
+	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
+				    &guc->submission_state.destroyed_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
index dfd87d082218..282fc4f312e3 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
 
 int __intel_wakeref_get_first(struct intel_wakeref *wf)
 {
+	bool do_post = false;
+
 	/*
 	 * Treat get/put as different subclasses, as we may need to run
 	 * the put callback from under the shrinker and do not want to
@@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
 		}
 
 		smp_mb__before_atomic(); /* release wf->count */
+		do_post = true;
 	}
 	atomic_inc(&wf->count);
+	if (do_post && wf->ops->post_get)
+		wf->ops->post_get(wf);
 	mutex_unlock(&wf->mutex);
 
 	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index 545c8f277c46..ef7e6a698e8a 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
 
 struct intel_wakeref_ops {
 	int (*get)(struct intel_wakeref *wf);
+	void (*post_get)(struct intel_wakeref *wf);
 	int (*put)(struct intel_wakeref *wf);
 };
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a scheduling of user context could be enabled.

v2:
 (Daniel Vetter)
  - Add might_lock annotations to pin / unpin function

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
 drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
 5 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index c8595da64ad8..508cfe5770c0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	if (err)
 		goto err_post_unpin;
 
+	intel_engine_pm_might_get(ce->engine);
+
 	if (unlikely(intel_context_is_closed(ce))) {
 		err = -ENOENT;
 		goto err_unlock;
@@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
 		return;
 
 	CE_TRACE(ce, "unpin\n");
+	intel_engine_pm_might_put(ce->engine);
 	ce->ops->unpin(ce);
 	ce->ops->post_unpin(ce);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 17a5028ea177..3fe2ae1bcc26 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -9,6 +9,7 @@
 #include "i915_request.h"
 #include "intel_engine_types.h"
 #include "intel_wakeref.h"
+#include "intel_gt_pm.h"
 
 static inline bool
 intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
@@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
 	return intel_wakeref_get_if_active(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine))
+		intel_wakeref_might_get(&engine->wakeref);
+	intel_gt_pm_might_get(engine->gt);
+}
+
 static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
 {
 	intel_wakeref_put(&engine->wakeref);
@@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
 	intel_wakeref_unlock_wait(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine))
+		intel_wakeref_might_put(&engine->wakeref);
+	intel_gt_pm_might_put(engine->gt);
+}
+
 static inline struct i915_request *
 intel_engine_create_kernel_request(struct intel_engine_cs *engine)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index a17bf0d4592b..3c173033ce23 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
 	return intel_wakeref_get_if_active(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_get(struct intel_gt *gt)
+{
+	intel_wakeref_might_get(&gt->wakeref);
+}
+
 static inline void intel_gt_pm_put(struct intel_gt *gt)
 {
 	intel_wakeref_put(&gt->wakeref);
@@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_put(struct intel_gt *gt)
+{
+	intel_wakeref_might_put(&gt->wakeref);
+}
+
 #define with_intel_gt_pm(gt, tmp) \
 	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
 	     intel_gt_pm_put(gt), tmp = 0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index dbf919801de2..e0eed70f9b92 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
 
 static int guc_context_pin(struct intel_context *ce, void *vaddr)
 {
-	return __guc_context_pin(ce, ce->engine, vaddr);
+	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+
+	if (likely(!ret && !intel_context_is_barrier(ce)))
+		intel_engine_pm_get(ce->engine);
+
+	return ret;
 }
 
 static void guc_context_unpin(struct intel_context *ce)
@@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
 
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
+
+	if (likely(!intel_context_is_barrier(ce)))
+		intel_engine_pm_put_async(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
 static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	int ret = __guc_context_pin(ce, engine, vaddr);
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+
+	if (likely(!ret))
+		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+			intel_engine_pm_get(engine);
 
-	return __guc_context_pin(ce, engine, vaddr);
+	return ret;
+}
+
+static void guc_virtual_context_unpin(struct intel_context *ce)
+{
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+	struct intel_engine_cs *engine;
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+		intel_engine_pm_put_async(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
-	.unpin = guc_context_unpin,
+	.unpin = guc_virtual_context_unpin,
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index ef7e6a698e8a..dd530ae028e0 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -124,6 +124,12 @@ enum {
 	__INTEL_WAKEREF_PUT_LAST_BIT__
 };
 
+static inline void
+intel_wakeref_might_get(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_put_flags: Release the wakeref
  * @wf: the wakeref
@@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
 			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
 }
 
+static inline void
+intel_wakeref_might_put(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_lock: Lock the wakeref (mutex)
  * @wf: the wakeref
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a scheduling of user context could be enabled.

v2:
 (Daniel Vetter)
  - Add might_lock annotations to pin / unpin function

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
 drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
 5 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index c8595da64ad8..508cfe5770c0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	if (err)
 		goto err_post_unpin;
 
+	intel_engine_pm_might_get(ce->engine);
+
 	if (unlikely(intel_context_is_closed(ce))) {
 		err = -ENOENT;
 		goto err_unlock;
@@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
 		return;
 
 	CE_TRACE(ce, "unpin\n");
+	intel_engine_pm_might_put(ce->engine);
 	ce->ops->unpin(ce);
 	ce->ops->post_unpin(ce);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 17a5028ea177..3fe2ae1bcc26 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -9,6 +9,7 @@
 #include "i915_request.h"
 #include "intel_engine_types.h"
 #include "intel_wakeref.h"
+#include "intel_gt_pm.h"
 
 static inline bool
 intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
@@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
 	return intel_wakeref_get_if_active(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine))
+		intel_wakeref_might_get(&engine->wakeref);
+	intel_gt_pm_might_get(engine->gt);
+}
+
 static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
 {
 	intel_wakeref_put(&engine->wakeref);
@@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
 	intel_wakeref_unlock_wait(&engine->wakeref);
 }
 
+static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
+{
+	if (!intel_engine_is_virtual(engine))
+		intel_wakeref_might_put(&engine->wakeref);
+	intel_gt_pm_might_put(engine->gt);
+}
+
 static inline struct i915_request *
 intel_engine_create_kernel_request(struct intel_engine_cs *engine)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index a17bf0d4592b..3c173033ce23 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
 	return intel_wakeref_get_if_active(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_get(struct intel_gt *gt)
+{
+	intel_wakeref_might_get(&gt->wakeref);
+}
+
 static inline void intel_gt_pm_put(struct intel_gt *gt)
 {
 	intel_wakeref_put(&gt->wakeref);
@@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+static inline void intel_gt_pm_might_put(struct intel_gt *gt)
+{
+	intel_wakeref_might_put(&gt->wakeref);
+}
+
 #define with_intel_gt_pm(gt, tmp) \
 	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
 	     intel_gt_pm_put(gt), tmp = 0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index dbf919801de2..e0eed70f9b92 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
 
 static int guc_context_pin(struct intel_context *ce, void *vaddr)
 {
-	return __guc_context_pin(ce, ce->engine, vaddr);
+	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+
+	if (likely(!ret && !intel_context_is_barrier(ce)))
+		intel_engine_pm_get(ce->engine);
+
+	return ret;
 }
 
 static void guc_context_unpin(struct intel_context *ce)
@@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
 
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
+
+	if (likely(!intel_context_is_barrier(ce)))
+		intel_engine_pm_put_async(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
 static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	int ret = __guc_context_pin(ce, engine, vaddr);
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+
+	if (likely(!ret))
+		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+			intel_engine_pm_get(engine);
 
-	return __guc_context_pin(ce, engine, vaddr);
+	return ret;
+}
+
+static void guc_virtual_context_unpin(struct intel_context *ce)
+{
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+	struct intel_engine_cs *engine;
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+		intel_engine_pm_put_async(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
-	.unpin = guc_context_unpin,
+	.unpin = guc_virtual_context_unpin,
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index ef7e6a698e8a..dd530ae028e0 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -124,6 +124,12 @@ enum {
 	__INTEL_WAKEREF_PUT_LAST_BIT__
 };
 
+static inline void
+intel_wakeref_might_get(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_put_flags: Release the wakeref
  * @wf: the wakeref
@@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
 			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
 }
 
+static inline void
+intel_wakeref_might_put(struct intel_wakeref *wf)
+{
+	might_lock(&wf->mutex);
+}
+
 /**
  * intel_wakeref_lock: Lock the wakeref (mutex)
  * @wf: the wakeref
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all contexts are pinned. By not calling
switch_to_kernel_context we save on issuing a request to the engine.

v2:
 (Daniel Vetter)
  - Add FIXME comment about pushing switch_to_kernel_context to backend

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 1f07ac4e0672..11fee66daf60 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	unsigned long flags;
 	bool result = true;
 
+	/*
+	 * No need to switch_to_kernel_context if GuC submission
+	 *
+	 * FIXME: This execlists specific backend behavior in generic code, this
+	 * should be pushed to the backend.
+	 */
+	if (intel_engine_uses_guc(engine))
+		return true;
+
 	/* GPU is pointing to the void, as good as in the kernel context. */
 	if (intel_gt_is_wedged(engine->gt))
 		return true;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all contexts are pinned. By not calling
switch_to_kernel_context we save on issuing a request to the engine.

v2:
 (Daniel Vetter)
  - Add FIXME comment about pushing switch_to_kernel_context to backend

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 1f07ac4e0672..11fee66daf60 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	unsigned long flags;
 	bool result = true;
 
+	/*
+	 * No need to switch_to_kernel_context if GuC submission
+	 *
+	 * FIXME: This execlists specific backend behavior in generic code, this
+	 * should be pushed to the backend.
+	 */
+	if (intel_engine_uses_guc(engine))
+		return true;
+
 	/* GPU is pointing to the void, as good as in the kernel context. */
 	if (intel_gt_is_wedged(engine->gt))
 		return true;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

v2:
 (Daniel Vetter)
  - Add kernel doc for new fields

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
 .../drm/i915/gt/intel_execlists_submission.c  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
 5 files changed, 60 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 0d9105a31d84..4d790f9a65dd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
 	GEM_DEBUG_WARN_ON(iir);
 }
 
-static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
+static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
+			      u8 logical_instance)
 {
 	const struct engine_info *info = &intel_engines[id];
 	struct drm_i915_private *i915 = gt->i915;
@@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 
 	engine->class = info->class;
 	engine->instance = info->instance;
+	engine->logical_mask = BIT(logical_instance);
 	__sprint_engine_name(engine);
 
 	engine->props.heartbeat_interval_ms =
@@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
 	return info->engine_mask;
 }
 
+static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
+				 u8 class, const u8 *map, u8 num_instances)
+{
+	int i, j;
+	u8 current_logical_id = 0;
+
+	for (j = 0; j < num_instances; ++j) {
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			if (!HAS_ENGINE(gt, i) ||
+			    intel_engines[i].class != class)
+				continue;
+
+			if (intel_engines[i].instance == map[j]) {
+				logical_ids[intel_engines[i].instance] =
+					current_logical_id++;
+				break;
+			}
+		}
+	}
+}
+
+static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
+{
+	int i;
+	u8 map[MAX_ENGINE_INSTANCE + 1];
+
+	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
+		map[i] = i;
+	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
+}
+
 /**
  * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
  * @gt: pointer to struct intel_gt
@@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	struct drm_i915_private *i915 = gt->i915;
 	const unsigned int engine_mask = init_engine_mask(gt);
 	unsigned int mask = 0;
-	unsigned int i;
+	unsigned int i, class;
+	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
 	int err;
 
 	drm_WARN_ON(&i915->drm, engine_mask == 0);
@@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	if (i915_inject_probe_failure(i915))
 		return -ENODEV;
 
-	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
-		if (!HAS_ENGINE(gt, i))
-			continue;
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		setup_logical_ids(gt, logical_ids, class);
 
-		err = intel_engine_setup(gt, i);
-		if (err)
-			goto cleanup;
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			u8 instance = intel_engines[i].instance;
+
+			if (intel_engines[i].class != class ||
+			    !HAS_ENGINE(gt, i))
+				continue;
 
-		mask |= BIT(i);
+			err = intel_engine_setup(gt, i,
+						 logical_ids[instance]);
+			if (err)
+				goto cleanup;
+
+			mask |= BIT(i);
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index ed91bcff20eb..fddf35546b58 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -266,6 +266,11 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 
 	intel_engine_mask_t mask;
+	/**
+	 * @logical_mask: logical mask of engine, reported to user space via
+	 * query IOCTL and used to communicate with the GuC in logical space
+	 */
+	intel_engine_mask_t logical_mask;
 
 	u8 class;
 	u8 instance;
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index cafb0608ffb4..813a6de01382 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 		ve->siblings[ve->num_siblings++] = sibling;
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		/*
 		 * All physical engines must be compatible for their emission
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 6926919bcac6..9f5f43a16182 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
 	for_each_engine(engine, gt, id) {
 		u8 guc_class = engine_class_to_guc_class(engine->class);
 
-		system_info->mapping_table[guc_class][engine->instance] =
+		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
 			engine->instance;
 	}
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e0eed70f9b92..ffafbac7335e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
-static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
-{
-	switch (class) {
-	case RENDER_CLASS:
-		return mask >> RCS0;
-	case VIDEO_ENHANCEMENT_CLASS:
-		return mask >> VECS0;
-	case VIDEO_DECODE_CLASS:
-		return mask >> VCS0;
-	case COPY_ENGINE_CLASS:
-		return mask >> BCS0;
-	default:
-		MISSING_CASE(class);
-		return 0;
-	}
-}
-
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	desc = __get_lrc_desc(guc, desc_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-	desc->engine_submit_mask = adjust_engine_mask(engine->class,
-						      engine->mask);
+	desc->engine_submit_mask = engine->logical_mask;
 	desc->hw_context_desc = ce->lrc.lrca;
 	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
@@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		}
 
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		if (n != 0 && ve->base.class != sibling->class) {
 			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

v2:
 (Daniel Vetter)
  - Add kernel doc for new fields

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
 .../drm/i915/gt/intel_execlists_submission.c  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
 5 files changed, 60 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 0d9105a31d84..4d790f9a65dd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
 	GEM_DEBUG_WARN_ON(iir);
 }
 
-static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
+static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
+			      u8 logical_instance)
 {
 	const struct engine_info *info = &intel_engines[id];
 	struct drm_i915_private *i915 = gt->i915;
@@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 
 	engine->class = info->class;
 	engine->instance = info->instance;
+	engine->logical_mask = BIT(logical_instance);
 	__sprint_engine_name(engine);
 
 	engine->props.heartbeat_interval_ms =
@@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
 	return info->engine_mask;
 }
 
+static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
+				 u8 class, const u8 *map, u8 num_instances)
+{
+	int i, j;
+	u8 current_logical_id = 0;
+
+	for (j = 0; j < num_instances; ++j) {
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			if (!HAS_ENGINE(gt, i) ||
+			    intel_engines[i].class != class)
+				continue;
+
+			if (intel_engines[i].instance == map[j]) {
+				logical_ids[intel_engines[i].instance] =
+					current_logical_id++;
+				break;
+			}
+		}
+	}
+}
+
+static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
+{
+	int i;
+	u8 map[MAX_ENGINE_INSTANCE + 1];
+
+	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
+		map[i] = i;
+	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
+}
+
 /**
  * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
  * @gt: pointer to struct intel_gt
@@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	struct drm_i915_private *i915 = gt->i915;
 	const unsigned int engine_mask = init_engine_mask(gt);
 	unsigned int mask = 0;
-	unsigned int i;
+	unsigned int i, class;
+	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
 	int err;
 
 	drm_WARN_ON(&i915->drm, engine_mask == 0);
@@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	if (i915_inject_probe_failure(i915))
 		return -ENODEV;
 
-	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
-		if (!HAS_ENGINE(gt, i))
-			continue;
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		setup_logical_ids(gt, logical_ids, class);
 
-		err = intel_engine_setup(gt, i);
-		if (err)
-			goto cleanup;
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			u8 instance = intel_engines[i].instance;
+
+			if (intel_engines[i].class != class ||
+			    !HAS_ENGINE(gt, i))
+				continue;
 
-		mask |= BIT(i);
+			err = intel_engine_setup(gt, i,
+						 logical_ids[instance]);
+			if (err)
+				goto cleanup;
+
+			mask |= BIT(i);
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index ed91bcff20eb..fddf35546b58 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -266,6 +266,11 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 
 	intel_engine_mask_t mask;
+	/**
+	 * @logical_mask: logical mask of engine, reported to user space via
+	 * query IOCTL and used to communicate with the GuC in logical space
+	 */
+	intel_engine_mask_t logical_mask;
 
 	u8 class;
 	u8 instance;
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index cafb0608ffb4..813a6de01382 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 		ve->siblings[ve->num_siblings++] = sibling;
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		/*
 		 * All physical engines must be compatible for their emission
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 6926919bcac6..9f5f43a16182 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
 	for_each_engine(engine, gt, id) {
 		u8 guc_class = engine_class_to_guc_class(engine->class);
 
-		system_info->mapping_table[guc_class][engine->instance] =
+		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
 			engine->instance;
 	}
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e0eed70f9b92..ffafbac7335e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
-static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
-{
-	switch (class) {
-	case RENDER_CLASS:
-		return mask >> RCS0;
-	case VIDEO_ENHANCEMENT_CLASS:
-		return mask >> VECS0;
-	case VIDEO_DECODE_CLASS:
-		return mask >> VCS0;
-	case COPY_ENGINE_CLASS:
-		return mask >> BCS0;
-	default:
-		MISSING_CASE(class);
-		return 0;
-	}
-}
-
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	desc = __get_lrc_desc(guc, desc_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-	desc->engine_submit_mask = adjust_engine_mask(engine->class,
-						      engine->mask);
+	desc->engine_submit_mask = engine->logical_mask;
 	desc->hw_context_desc = ce->lrc.lrca;
 	desc->priority = ce->guc_state.prio;
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
@@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		}
 
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		if (n != 0 && ve->base.class != sibling->class) {
 			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 09/27] drm/i915: Expose logical engine instance to user
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these needs to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
media UMD: link coming soon

v2:
 (Daniel Vetter)
  - Add IGT link, placeholder for media UMD

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_query.c | 2 ++
 include/uapi/drm/i915_drm.h       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
index e49da36c62fb..8a72923fbdba 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
 	for_each_uabi_engine(engine, i915) {
 		info.engine.engine_class = engine->uabi_class;
 		info.engine.engine_instance = engine->uabi_instance;
+		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
 		info.capabilities = engine->uabi_capabilities;
+		info.logical_instance = ilog2(engine->logical_mask);
 
 		if (copy_to_user(info_ptr, &info, sizeof(info)))
 			return -EFAULT;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bde5860b3686..b1248a67b4f8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** @logical_instance: Logical instance of engine */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	/** @rsvd2: Reserved fields. */
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 09/27] drm/i915: Expose logical engine instance to user
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these needs to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
media UMD: link coming soon

v2:
 (Daniel Vetter)
  - Add IGT link, placeholder for media UMD

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_query.c | 2 ++
 include/uapi/drm/i915_drm.h       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
index e49da36c62fb..8a72923fbdba 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
 	for_each_uabi_engine(engine, i915) {
 		info.engine.engine_class = engine->uabi_class;
 		info.engine.engine_instance = engine->uabi_instance;
+		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
 		info.capabilities = engine->uabi_capabilities;
+		info.logical_instance = ilog2(engine->logical_mask);
 
 		if (copy_to_user(info_ptr, &info, sizeof(info)))
 			return -EFAULT;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bde5860b3686..b1248a67b4f8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** @logical_instance: Logical instance of engine */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	/** @rsvd2: Reserved fields. */
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 10/27] drm/i915/guc: Introduce context parent-child relationship
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its'
children and itself.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.

Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.

v2:
 (Daniel Vetter)
  - Add kernel doc, add wrapper to access parent to ensure safety

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 29 ++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 39 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context_types.h | 23 +++++++++++
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 508cfe5770c0..00d1aee6d199 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -404,6 +404,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->destroyed_link);
 
+	INIT_LIST_HEAD(&ce->guc_child_list);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
@@ -418,10 +420,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 void intel_context_fini(struct intel_context *ce)
 {
+	struct intel_context *child, *next;
+
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
 
+	/* Need to put the creation ref for the children */
+	if (intel_context_is_parent(ce))
+		for_each_child_safe(ce, child, next)
+			intel_context_put(child);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 }
@@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child)
+{
+	/*
+	 * Callers responsibility to validate that this function is used
+	 * correctly but we use GEM_BUG_ON here ensure that they do.
+	 */
+	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
+	GEM_BUG_ON(intel_context_is_pinned(parent));
+	GEM_BUG_ON(intel_context_is_child(parent));
+	GEM_BUG_ON(intel_context_is_pinned(child));
+	GEM_BUG_ON(intel_context_is_child(child));
+	GEM_BUG_ON(intel_context_is_parent(child));
+
+	parent->guc_number_children++;
+	list_add_tail(&child->guc_child_link,
+		      &parent->guc_child_list);
+	child->parent = parent;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c41098950746..c2985822ab74 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -44,6 +44,45 @@ void intel_context_free(struct intel_context *ce);
 int intel_context_reconfigure_sseu(struct intel_context *ce,
 				   const struct intel_sseu sseu);
 
+static inline bool intel_context_is_child(struct intel_context *ce)
+{
+	return !!ce->parent;
+}
+
+static inline bool intel_context_is_parent(struct intel_context *ce)
+{
+	return !!ce->guc_number_children;
+}
+
+static inline bool intel_context_is_pinned(struct intel_context *ce);
+
+static inline struct intel_context *
+intel_context_to_parent(struct intel_context *ce)
+{
+        if (intel_context_is_child(ce)) {
+		/*
+		 * The parent holds ref count to the child so it is always safe
+		 * for the parent to access the child, but the child has pointer
+		 * to the parent without a ref. To ensure this is safe the child
+		 * should only access the parent pointer while the parent is
+		 * pinned.
+		 */
+                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
+
+                return ce->parent;
+        } else {
+                return ce;
+        }
+}
+
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child);
+
+#define for_each_child(parent, ce)\
+	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
+#define for_each_child_safe(parent, ce, cn)\
+	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
+
 /**
  * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
  * @ce - the context
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index fd338a30617e..0fafc178cf2c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,29 @@ struct intel_context {
 	 */
 	struct list_head destroyed_link;
 
+	/** anonymous struct for parent / children only members */
+	struct {
+		union {
+			/**
+			 * @guc_child_list: parent's list of of children
+			 * contexts, no protection as immutable after context
+			 * creation
+			 */
+			struct list_head guc_child_list;
+			/**
+			 * @guc_child_link: child's link into parent's list of
+			 * children
+			 */
+			struct list_head guc_child_link;
+		};
+
+		/** @parent: pointer to parent if child */
+		struct intel_context *parent;
+
+		/** @guc_number_children: number of children if parent */
+		u8 guc_number_children;
+	};
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 10/27] drm/i915/guc: Introduce context parent-child relationship
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its'
children and itself.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.

Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.

v2:
 (Daniel Vetter)
  - Add kernel doc, add wrapper to access parent to ensure safety

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 29 ++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 39 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context_types.h | 23 +++++++++++
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 508cfe5770c0..00d1aee6d199 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -404,6 +404,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 	INIT_LIST_HEAD(&ce->destroyed_link);
 
+	INIT_LIST_HEAD(&ce->guc_child_list);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
@@ -418,10 +420,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 void intel_context_fini(struct intel_context *ce)
 {
+	struct intel_context *child, *next;
+
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
 
+	/* Need to put the creation ref for the children */
+	if (intel_context_is_parent(ce))
+		for_each_child_safe(ce, child, next)
+			intel_context_put(child);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 }
@@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child)
+{
+	/*
+	 * Callers responsibility to validate that this function is used
+	 * correctly but we use GEM_BUG_ON here ensure that they do.
+	 */
+	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
+	GEM_BUG_ON(intel_context_is_pinned(parent));
+	GEM_BUG_ON(intel_context_is_child(parent));
+	GEM_BUG_ON(intel_context_is_pinned(child));
+	GEM_BUG_ON(intel_context_is_child(child));
+	GEM_BUG_ON(intel_context_is_parent(child));
+
+	parent->guc_number_children++;
+	list_add_tail(&child->guc_child_link,
+		      &parent->guc_child_list);
+	child->parent = parent;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c41098950746..c2985822ab74 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -44,6 +44,45 @@ void intel_context_free(struct intel_context *ce);
 int intel_context_reconfigure_sseu(struct intel_context *ce,
 				   const struct intel_sseu sseu);
 
+static inline bool intel_context_is_child(struct intel_context *ce)
+{
+	return !!ce->parent;
+}
+
+static inline bool intel_context_is_parent(struct intel_context *ce)
+{
+	return !!ce->guc_number_children;
+}
+
+static inline bool intel_context_is_pinned(struct intel_context *ce);
+
+static inline struct intel_context *
+intel_context_to_parent(struct intel_context *ce)
+{
+        if (intel_context_is_child(ce)) {
+		/*
+		 * The parent holds ref count to the child so it is always safe
+		 * for the parent to access the child, but the child has pointer
+		 * to the parent without a ref. To ensure this is safe the child
+		 * should only access the parent pointer while the parent is
+		 * pinned.
+		 */
+                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
+
+                return ce->parent;
+        } else {
+                return ce;
+        }
+}
+
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child);
+
+#define for_each_child(parent, ce)\
+	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
+#define for_each_child_safe(parent, ce, cn)\
+	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
+
 /**
  * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
  * @ce - the context
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index fd338a30617e..0fafc178cf2c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -213,6 +213,29 @@ struct intel_context {
 	 */
 	struct list_head destroyed_link;
 
+	/** anonymous struct for parent / children only members */
+	struct {
+		union {
+			/**
+			 * @guc_child_list: parent's list of of children
+			 * contexts, no protection as immutable after context
+			 * creation
+			 */
+			struct list_head guc_child_list;
+			/**
+			 * @guc_child_link: child's link into parent's list of
+			 * children
+			 */
+			struct list_head guc_child_link;
+		};
+
+		/** @parent: pointer to parent if child */
+		struct intel_context *parent;
+
+		/** @guc_number_children: number of children if parent */
+		u8 guc_number_children;
+	};
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 11/27] drm/i915/guc: Implement parallel context pin / unpin functions
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Parallel contexts are perma-pinned by the upper layers which makes the
backend implementation rather simple. The parent pins the guc_id and
children increment the parent's pin count on pin to ensure all the
contexts are unpinned before we disable scheduling with the GuC / or
deregister the context.

v2:
 (Daniel Vetter)
  - Perma-pin parallel contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 70 +++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ffafbac7335e..14b24298cdd7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2395,6 +2395,76 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	struct intel_guc *guc = ce_to_guc(ce);
+	int ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	ret = pin_guc_id(guc, ce);
+	if (unlikely(ret < 0))
+		return ret;
+
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	__intel_context_pin(ce->parent);
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_post_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_post_unpin(ce);
+	intel_context_unpin(ce->parent);
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 11/27] drm/i915/guc: Implement parallel context pin / unpin functions
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Parallel contexts are perma-pinned by the upper layers which makes the
backend implementation rather simple. The parent pins the guc_id and
children increment the parent's pin count on pin to ensure all the
contexts are unpinned before we disable scheduling with the GuC / or
deregister the context.

v2:
 (Daniel Vetter)
  - Perma-pin parallel contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 70 +++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ffafbac7335e..14b24298cdd7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2395,6 +2395,76 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	struct intel_guc *guc = ce_to_guc(ce);
+	int ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	ret = pin_guc_id(guc, ce);
+	if (unlikely(ret < 0))
+		return ret;
+
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
+{
+	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	__intel_context_pin(ce->parent);
+	return __guc_context_pin(ce, engine, vaddr);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	unpin_guc_id(guc, ce);
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_post_unpin(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
+	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
+
+	lrc_post_unpin(ce);
+	intel_context_unpin(ce->parent);
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
 4 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 0fafc178cf2c..6f567ebeb039 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -232,8 +232,20 @@ struct intel_context {
 		/** @parent: pointer to parent if child */
 		struct intel_context *parent;
 
+
+		/** @guc_wqi_head: head pointer in work queue */
+		u16 guc_wqi_head;
+		/** @guc_wqi_tail: tail pointer in work queue */
+		u16 guc_wqi_tail;
+
 		/** @guc_number_children: number of children if parent */
 		u8 guc_number_children;
+
+		/**
+		 * @parent_page: page in context used by parent for work queue,
+		 * work queue descriptor
+		 */
+		u8 parent_page;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index bb4af4977920..0ddbad4e062a 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
 		context_size += PAGE_SIZE;
 	}
 
+	if (intel_context_is_parent(ce)) {
+		ce->parent_page = context_size / PAGE_SIZE;
+		context_size += PAGE_SIZE;
+	}
+
 	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
 	if (IS_ERR(obj))
 		obj = i915_gem_object_create_shmem(engine->i915, context_size);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..0e600a3b8f1e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -52,7 +52,7 @@
 
 #define GUC_DOORBELL_INVALID		256
 
-#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
+#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
 
 /* Work queue item header definitions */
 #define WQ_STATUS_ACTIVE		1
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 14b24298cdd7..dbcb9ab28a9a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
+/*
+ * When using multi-lrc submission an extra page in the context state is
+ * reserved for the process descriptor and work queue.
+ *
+ * The layout of this page is below:
+ * 0						guc_process_desc
+ * ...						unused
+ * PAGE_SIZE / 2				work queue start
+ * ...						work queue
+ * PAGE_SIZE - 1				work queue end
+ */
+#define WQ_OFFSET	(PAGE_SIZE / 2)
+static u32 __get_process_desc_offset(struct intel_context *ce)
+{
+	GEM_BUG_ON(!ce->parent_page);
+
+	return ce->parent_page * PAGE_SIZE;
+}
+
+static u32 __get_wq_offset(struct intel_context *ce)
+{
+	return __get_process_desc_offset(ce) + WQ_OFFSET;
+}
+
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	return (struct guc_process_desc *)
+		(ce->lrc_reg_state +
+		 ((__get_process_desc_offset(ce) -
+		   LRC_STATE_OFFSET) / sizeof(u32)));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
+static int __guc_action_register_multi_lrc(struct intel_guc *guc,
+					   struct intel_context *ce,
+					   u32 guc_id,
+					   u32 offset,
+					   bool loop)
+{
+	struct intel_context *child;
+	u32 action[4 + MAX_ENGINE_INSTANCE];
+	int len = 0;
+
+	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
+
+	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
+	action[len++] = guc_id;
+	action[len++] = ce->guc_number_children + 1;
+	action[len++] = offset;
+	for_each_child(ce, child) {
+		offset += sizeof(struct guc_lrc_desc);
+		action[len++] = offset;
+	}
+
+	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
+}
+
 static int __guc_action_register_context(struct intel_guc *guc,
 					 u32 guc_id,
 					 u32 offset,
@@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
 		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (intel_context_is_parent(ce))
+		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
+						      offset, loop);
+	else
+		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
+						    loop);
 	if (likely(!ret)) {
 		unsigned long flags;
 
@@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_deregister(ce);
 
 	return __guc_action_deregister_context(guc, guc_id, loop);
@@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct guc_lrc_desc *desc;
 	bool context_registered;
 	intel_wakeref_t wakeref;
+	struct intel_context *child;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
@@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
+	/*
+	 * Context is a parent, we need to register a process descriptor
+	 * describing a work queue and register all child contexts.
+	 */
+	if (intel_context_is_parent(ce)) {
+		struct guc_process_desc *pdesc;
+
+		ce->guc_wqi_tail = 0;
+		ce->guc_wqi_head = 0;
+
+		desc->process_desc = i915_ggtt_offset(ce->state) +
+			__get_process_desc_offset(ce);
+		desc->wq_addr = i915_ggtt_offset(ce->state) +
+			__get_wq_offset(ce);
+		desc->wq_size = GUC_WQ_SIZE;
+
+		pdesc = __get_process_desc(ce);
+		memset(pdesc, 0, sizeof(*(pdesc)));
+		pdesc->stage_id = ce->guc_id.id;
+		pdesc->wq_base_addr = desc->wq_addr;
+		pdesc->wq_size_bytes = desc->wq_size;
+		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+		pdesc->wq_status = WQ_STATUS_ACTIVE;
+
+		for_each_child(ce, child) {
+			desc = __get_lrc_desc(guc, child->guc_id.id);
+
+			desc->engine_class =
+				engine_class_to_guc_class(engine->class);
+			desc->hw_context_desc = child->lrc.lrca;
+			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
+			guc_context_policy_init(engine, desc);
+		}
+	}
+
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
 	 * context is currently registered. There are two cases in which it
@@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 		return NULL;
 	}
 
+	if (unlikely(intel_context_is_child(ce))) {
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"Context is child, desc_idx %u", desc_idx);
+		return NULL;
+	}
+
 	return ce;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
 4 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 0fafc178cf2c..6f567ebeb039 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -232,8 +232,20 @@ struct intel_context {
 		/** @parent: pointer to parent if child */
 		struct intel_context *parent;
 
+
+		/** @guc_wqi_head: head pointer in work queue */
+		u16 guc_wqi_head;
+		/** @guc_wqi_tail: tail pointer in work queue */
+		u16 guc_wqi_tail;
+
 		/** @guc_number_children: number of children if parent */
 		u8 guc_number_children;
+
+		/**
+		 * @parent_page: page in context used by parent for work queue,
+		 * work queue descriptor
+		 */
+		u8 parent_page;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index bb4af4977920..0ddbad4e062a 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
 		context_size += PAGE_SIZE;
 	}
 
+	if (intel_context_is_parent(ce)) {
+		ce->parent_page = context_size / PAGE_SIZE;
+		context_size += PAGE_SIZE;
+	}
+
 	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
 	if (IS_ERR(obj))
 		obj = i915_gem_object_create_shmem(engine->i915, context_size);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..0e600a3b8f1e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -52,7 +52,7 @@
 
 #define GUC_DOORBELL_INVALID		256
 
-#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
+#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
 
 /* Work queue item header definitions */
 #define WQ_STATUS_ACTIVE		1
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 14b24298cdd7..dbcb9ab28a9a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
+/*
+ * When using multi-lrc submission an extra page in the context state is
+ * reserved for the process descriptor and work queue.
+ *
+ * The layout of this page is below:
+ * 0						guc_process_desc
+ * ...						unused
+ * PAGE_SIZE / 2				work queue start
+ * ...						work queue
+ * PAGE_SIZE - 1				work queue end
+ */
+#define WQ_OFFSET	(PAGE_SIZE / 2)
+static u32 __get_process_desc_offset(struct intel_context *ce)
+{
+	GEM_BUG_ON(!ce->parent_page);
+
+	return ce->parent_page * PAGE_SIZE;
+}
+
+static u32 __get_wq_offset(struct intel_context *ce)
+{
+	return __get_process_desc_offset(ce) + WQ_OFFSET;
+}
+
+static struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	return (struct guc_process_desc *)
+		(ce->lrc_reg_state +
+		 ((__get_process_desc_offset(ce) -
+		   LRC_STATE_OFFSET) / sizeof(u32)));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
+static int __guc_action_register_multi_lrc(struct intel_guc *guc,
+					   struct intel_context *ce,
+					   u32 guc_id,
+					   u32 offset,
+					   bool loop)
+{
+	struct intel_context *child;
+	u32 action[4 + MAX_ENGINE_INSTANCE];
+	int len = 0;
+
+	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
+
+	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
+	action[len++] = guc_id;
+	action[len++] = ce->guc_number_children + 1;
+	action[len++] = offset;
+	for_each_child(ce, child) {
+		offset += sizeof(struct guc_lrc_desc);
+		action[len++] = offset;
+	}
+
+	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
+}
+
 static int __guc_action_register_context(struct intel_guc *guc,
 					 u32 guc_id,
 					 u32 offset,
@@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
 		ce->guc_id.id * sizeof(struct guc_lrc_desc);
 	int ret;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
+	if (intel_context_is_parent(ce))
+		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
+						      offset, loop);
+	else
+		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
+						    loop);
 	if (likely(!ret)) {
 		unsigned long flags;
 
@@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_deregister(ce);
 
 	return __guc_action_deregister_context(guc, guc_id, loop);
@@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	struct guc_lrc_desc *desc;
 	bool context_registered;
 	intel_wakeref_t wakeref;
+	struct intel_context *child;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
@@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 	guc_context_policy_init(engine, desc);
 
+	/*
+	 * Context is a parent, we need to register a process descriptor
+	 * describing a work queue and register all child contexts.
+	 */
+	if (intel_context_is_parent(ce)) {
+		struct guc_process_desc *pdesc;
+
+		ce->guc_wqi_tail = 0;
+		ce->guc_wqi_head = 0;
+
+		desc->process_desc = i915_ggtt_offset(ce->state) +
+			__get_process_desc_offset(ce);
+		desc->wq_addr = i915_ggtt_offset(ce->state) +
+			__get_wq_offset(ce);
+		desc->wq_size = GUC_WQ_SIZE;
+
+		pdesc = __get_process_desc(ce);
+		memset(pdesc, 0, sizeof(*(pdesc)));
+		pdesc->stage_id = ce->guc_id.id;
+		pdesc->wq_base_addr = desc->wq_addr;
+		pdesc->wq_size_bytes = desc->wq_size;
+		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+		pdesc->wq_status = WQ_STATUS_ACTIVE;
+
+		for_each_child(ce, child) {
+			desc = __get_lrc_desc(guc, child->guc_id.id);
+
+			desc->engine_class =
+				engine_class_to_guc_class(engine->class);
+			desc->hw_context_desc = child->lrc.lrca;
+			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
+			guc_context_policy_init(engine, desc);
+		}
+	}
+
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
 	 * context is currently registered. There are two cases in which it
@@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 		return NULL;
 	}
 
+	if (unlikely(intel_context_is_child(ce))) {
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"Context is child, desc_idx %u", desc_idx);
+		return NULL;
+	}
+
 	return ce;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 13/27] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++++++-----
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index dbcb9ab28a9a..00d54bb00bfb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -320,6 +320,12 @@ static void decr_context_committed_requests(struct intel_context *ce)
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
+static struct intel_context *
+request_to_scheduling_context(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id.id == GUC_INVALID_LRC_ID;
@@ -1684,6 +1690,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_sched_disable(ce);
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -1898,6 +1905,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id.id)) {
 		spin_lock_irqsave(&ce->guc_state.lock, flags);
@@ -2286,6 +2295,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	clr_context_wait_for_deregister_to_register(ce);
 	__guc_signal_context_fence(ce);
@@ -2315,7 +2326,7 @@ static void guc_context_init(struct intel_context *ce)
 
 static int guc_request_alloc(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = ce_to_guc(ce);
 	unsigned long flags;
 	int ret;
@@ -2358,11 +2369,12 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * exhausted and return -EAGAIN to the user indicating that they can try
 	 * again in the future.
 	 *
-	 * There is no need for a lock here as the timeline mutex ensures at
-	 * most one context can be executing this code path at once. The
-	 * guc_id_ref is incremented once for every request in flight and
-	 * decremented on each retire. When it is zero, a lock around the
-	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
+	 * There is no need for a lock here as the timeline mutex (or
+	 * parallel_submit mutex in the case of multi-lrc) ensures at most one
+	 * context can be executing this code path at once. The guc_id_ref is
+	 * incremented once for every request in flight and decremented on each
+	 * retire. When it is zero, a lock around the increment (in pin_guc_id)
+	 * is needed to seal a race with unpin_guc_id.
 	 */
 	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
 		goto out;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 13/27] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++++++-----
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index dbcb9ab28a9a..00d54bb00bfb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -320,6 +320,12 @@ static void decr_context_committed_requests(struct intel_context *ce)
 	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
 }
 
+static struct intel_context *
+request_to_scheduling_context(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id.id == GUC_INVALID_LRC_ID;
@@ -1684,6 +1690,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_sched_disable(ce);
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -1898,6 +1905,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id.id)) {
 		spin_lock_irqsave(&ce->guc_state.lock, flags);
@@ -2286,6 +2295,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	clr_context_wait_for_deregister_to_register(ce);
 	__guc_signal_context_fence(ce);
@@ -2315,7 +2326,7 @@ static void guc_context_init(struct intel_context *ce)
 
 static int guc_request_alloc(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = ce_to_guc(ce);
 	unsigned long flags;
 	int ret;
@@ -2358,11 +2369,12 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * exhausted and return -EAGAIN to the user indicating that they can try
 	 * again in the future.
 	 *
-	 * There is no need for a lock here as the timeline mutex ensures at
-	 * most one context can be executing this code path at once. The
-	 * guc_id_ref is incremented once for every request in flight and
-	 * decremented on each retire. When it is zero, a lock around the
-	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
+	 * There is no need for a lock here as the timeline mutex (or
+	 * parallel_submit mutex in the case of multi-lrc) ensures at most one
+	 * context can be executing this code path at once. The guc_id_ref is
+	 * incremented once for every request in flight and decremented on each
+	 * retire. When it is zero, a lock around the increment (in pin_guc_id)
+	 * is needed to seal a race with unpin_guc_id.
 	 */
 	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
 		goto out;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 14/27] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

v2:
 (Daniel Vetter)
  - Explictly state why we assign consecutive guc_ids

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 +++++++++++++-----
 2 files changed, 86 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 023953e77553..3f95b1b4f15c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -61,9 +61,13 @@ struct intel_guc {
 		 */
 		spinlock_t lock;
 		/**
-		 * @guc_ids: used to allocate new guc_ids
+		 * @guc_ids: used to allocate new guc_ids, single-lrc
 		 */
 		struct ida guc_ids;
+		/**
+		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
+		 */
+		unsigned long *guc_ids_bitmap;
 		/** @num_guc_ids: number of guc_ids that can be used */
 		u32 num_guc_ids;
 		/** @max_guc_ids: max number of guc_ids that can be used */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 00d54bb00bfb..e9dfd43d29a0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -125,6 +125,18 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+/*
+ * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
+ * per the GuC submission interface. A different allocation algorithm is used
+ * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
+ * partition the guc_id space. We believe the number of multi-lrc contexts in
+ * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
+ * multi-lrc.
+ */
+#define NUMBER_MULTI_LRC_GUC_ID(guc) \
+	((guc)->submission_state.num_guc_ids / 16 > 32 ? \
+	 (guc)->submission_state.num_guc_ids / 16 : 32)
+
 /*
  * Below is a set of functions which control the GuC scheduling state which
  * require a lock.
@@ -1176,6 +1188,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
 				     destroyed_worker_func);
+	guc->submission_state.guc_ids_bitmap =
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
+	if (!guc->submission_state.guc_ids_bitmap)
+		return -ENOMEM;
 
 	return 0;
 }
@@ -1188,6 +1204,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	guc_lrc_desc_pool_destroy(guc);
 	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
+	bitmap_free(guc->submission_state.guc_ids_bitmap);
 }
 
 static void queue_request(struct i915_sched_engine *sched_engine,
@@ -1239,18 +1256,43 @@ static void guc_submit_request(struct i915_request *rq)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
-static int new_guc_id(struct intel_guc *guc)
+static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	return ida_simple_get(&guc->submission_state.guc_ids, 0,
-			      guc->submission_state.num_guc_ids, GFP_KERNEL |
-			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	int ret;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (intel_context_is_parent(ce))
+		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
+					      NUMBER_MULTI_LRC_GUC_ID(guc),
+					      order_base_2(ce->guc_number_children
+							   + 1));
+	else
+		ret = ida_simple_get(&guc->submission_state.guc_ids,
+				     NUMBER_MULTI_LRC_GUC_ID(guc),
+				     guc->submission_state.num_guc_ids,
+				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
+				     __GFP_NOWARN);
+	if (unlikely(ret < 0))
+		return ret;
+
+	ce->guc_id.id = ret;
+	return 0;
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->submission_state.guc_ids,
-				  ce->guc_id.id);
+		if (intel_context_is_parent(ce))
+			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
+					      ce->guc_id.id,
+					      order_base_2(ce->guc_number_children
+							   + 1));
+		else
+			ida_simple_remove(&guc->submission_state.guc_ids,
+					  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1267,49 +1309,60 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
-static int steal_guc_id(struct intel_guc *guc)
+static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	struct intel_context *ce;
-	int guc_id;
+	struct intel_context *cn;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(intel_context_is_parent(ce));
 
 	if (!list_empty(&guc->submission_state.guc_id_list)) {
-		ce = list_first_entry(&guc->submission_state.guc_id_list,
+		cn = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
-		GEM_BUG_ON(context_guc_id_invalid(ce));
-
-		list_del_init(&ce->guc_id.link);
-		guc_id = ce->guc_id.id;
+		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
+		GEM_BUG_ON(context_guc_id_invalid(cn));
+		GEM_BUG_ON(intel_context_is_child(cn));
+		GEM_BUG_ON(intel_context_is_parent(cn));
 
-		spin_lock(&ce->guc_state.lock);
-		clr_context_registered(ce);
-		spin_unlock(&ce->guc_state.lock);
+		list_del_init(&cn->guc_id.link);
+		ce->guc_id = cn->guc_id;
+		clr_context_registered(cn);
+		set_context_guc_id_invalid(cn);
 
-		set_context_guc_id_invalid(ce);
-		return guc_id;
+		return 0;
 	} else {
 		return -EAGAIN;
 	}
 }
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out)
+static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	int ret;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	ret = new_guc_id(guc);
+	ret = new_guc_id(guc, ce);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc);
+		if (intel_context_is_parent(ce))
+			return -ENOSPC;
+
+		ret = steal_guc_id(guc, ce);
 		if (ret < 0)
 			return ret;
 	}
 
-	*out = ret;
+	if (intel_context_is_parent(ce)) {
+		struct intel_context *child;
+		int i = 1;
+
+		for_each_child(ce, child)
+			child->guc_id.id = ce->guc_id.id + i++;
+	}
+
 	return 0;
 }
 
@@ -1327,7 +1380,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	might_lock(&ce->guc_state.lock);
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id.id);
+		ret = assign_guc_id(guc, ce);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
@@ -1369,8 +1422,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	unsigned long flags;
 
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	if (unlikely(context_guc_id_invalid(ce)))
+	if (unlikely(context_guc_id_invalid(ce) ||
+		     intel_context_is_parent(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 14/27] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

v2:
 (Daniel Vetter)
  - Explictly state why we assign consecutive guc_ids

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 +++++++++++++-----
 2 files changed, 86 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 023953e77553..3f95b1b4f15c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -61,9 +61,13 @@ struct intel_guc {
 		 */
 		spinlock_t lock;
 		/**
-		 * @guc_ids: used to allocate new guc_ids
+		 * @guc_ids: used to allocate new guc_ids, single-lrc
 		 */
 		struct ida guc_ids;
+		/**
+		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
+		 */
+		unsigned long *guc_ids_bitmap;
 		/** @num_guc_ids: number of guc_ids that can be used */
 		u32 num_guc_ids;
 		/** @max_guc_ids: max number of guc_ids that can be used */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 00d54bb00bfb..e9dfd43d29a0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -125,6 +125,18 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+/*
+ * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
+ * per the GuC submission interface. A different allocation algorithm is used
+ * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
+ * partition the guc_id space. We believe the number of multi-lrc contexts in
+ * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
+ * multi-lrc.
+ */
+#define NUMBER_MULTI_LRC_GUC_ID(guc) \
+	((guc)->submission_state.num_guc_ids / 16 > 32 ? \
+	 (guc)->submission_state.num_guc_ids / 16 : 32)
+
 /*
  * Below is a set of functions which control the GuC scheduling state which
  * require a lock.
@@ -1176,6 +1188,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
 	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
 				     destroyed_worker_func);
+	guc->submission_state.guc_ids_bitmap =
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
+	if (!guc->submission_state.guc_ids_bitmap)
+		return -ENOMEM;
 
 	return 0;
 }
@@ -1188,6 +1204,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	guc_lrc_desc_pool_destroy(guc);
 	guc_flush_destroyed_contexts(guc);
 	i915_sched_engine_put(guc->sched_engine);
+	bitmap_free(guc->submission_state.guc_ids_bitmap);
 }
 
 static void queue_request(struct i915_sched_engine *sched_engine,
@@ -1239,18 +1256,43 @@ static void guc_submit_request(struct i915_request *rq)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
-static int new_guc_id(struct intel_guc *guc)
+static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	return ida_simple_get(&guc->submission_state.guc_ids, 0,
-			      guc->submission_state.num_guc_ids, GFP_KERNEL |
-			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	int ret;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (intel_context_is_parent(ce))
+		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
+					      NUMBER_MULTI_LRC_GUC_ID(guc),
+					      order_base_2(ce->guc_number_children
+							   + 1));
+	else
+		ret = ida_simple_get(&guc->submission_state.guc_ids,
+				     NUMBER_MULTI_LRC_GUC_ID(guc),
+				     guc->submission_state.num_guc_ids,
+				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
+				     __GFP_NOWARN);
+	if (unlikely(ret < 0))
+		return ret;
+
+	ce->guc_id.id = ret;
+	return 0;
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->submission_state.guc_ids,
-				  ce->guc_id.id);
+		if (intel_context_is_parent(ce))
+			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
+					      ce->guc_id.id,
+					      order_base_2(ce->guc_number_children
+							   + 1));
+		else
+			ida_simple_remove(&guc->submission_state.guc_ids,
+					  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1267,49 +1309,60 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
-static int steal_guc_id(struct intel_guc *guc)
+static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	struct intel_context *ce;
-	int guc_id;
+	struct intel_context *cn;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(intel_context_is_parent(ce));
 
 	if (!list_empty(&guc->submission_state.guc_id_list)) {
-		ce = list_first_entry(&guc->submission_state.guc_id_list,
+		cn = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
-		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
-		GEM_BUG_ON(context_guc_id_invalid(ce));
-
-		list_del_init(&ce->guc_id.link);
-		guc_id = ce->guc_id.id;
+		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
+		GEM_BUG_ON(context_guc_id_invalid(cn));
+		GEM_BUG_ON(intel_context_is_child(cn));
+		GEM_BUG_ON(intel_context_is_parent(cn));
 
-		spin_lock(&ce->guc_state.lock);
-		clr_context_registered(ce);
-		spin_unlock(&ce->guc_state.lock);
+		list_del_init(&cn->guc_id.link);
+		ce->guc_id = cn->guc_id;
+		clr_context_registered(cn);
+		set_context_guc_id_invalid(cn);
 
-		set_context_guc_id_invalid(ce);
-		return guc_id;
+		return 0;
 	} else {
 		return -EAGAIN;
 	}
 }
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out)
+static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	int ret;
 
 	lockdep_assert_held(&guc->submission_state.lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	ret = new_guc_id(guc);
+	ret = new_guc_id(guc, ce);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc);
+		if (intel_context_is_parent(ce))
+			return -ENOSPC;
+
+		ret = steal_guc_id(guc, ce);
 		if (ret < 0)
 			return ret;
 	}
 
-	*out = ret;
+	if (intel_context_is_parent(ce)) {
+		struct intel_context *child;
+		int i = 1;
+
+		for_each_child(ce, child)
+			child->guc_id.id = ce->guc_id.id + i++;
+	}
+
 	return 0;
 }
 
@@ -1327,7 +1380,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	might_lock(&ce->guc_state.lock);
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id.id);
+		ret = assign_guc_id(guc, ce);
 		if (ret)
 			goto out_unlock;
 		ret = 1;	/* Indidcates newly assigned guc_id */
@@ -1369,8 +1422,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	unsigned long flags;
 
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	if (unlikely(context_guc_id_invalid(ce)))
+	if (unlikely(context_guc_id_invalid(ce) ||
+		     intel_context_is_parent(ce)))
 		return;
 
 	spin_lock_irqsave(&guc->submission_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
 drivers/gpu/drm/i915/i915_request.h           |   8 +
 6 files changed, 317 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index fbfcae727d7f..879aef662b2e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
 		}
 	}
 }
+
+void intel_guc_write_barrier(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
+		GEM_BUG_ON(guc->send_regs.fw_domains);
+		/*
+		 * This register is used by the i915 and GuC for MMIO based
+		 * communication. Once we are in this code CTBs are the only
+		 * method the i915 uses to communicate with the GuC so it is
+		 * safe to write to this register (a value of 0 is NOP for MMIO
+		 * communication). If we ever start mixing CTBs and MMIOs a new
+		 * register will have to be chosen.
+		 */
+		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
+	} else {
+		/* wmb() sufficient for a barrier if in smem */
+		wmb();
+	}
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 3f95b1b4f15c..0ead2406d03c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -37,6 +37,12 @@ struct intel_guc {
 	/* Global engine used to submit requests to GuC */
 	struct i915_sched_engine *sched_engine;
 	struct i915_request *stalled_request;
+	enum {
+		STALL_NONE,
+		STALL_REGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
 
 	/* intel_guc_recv interrupt related state */
 	spinlock_t irq_lock;
@@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
 
 void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
 
+void intel_guc_write_barrier(struct intel_guc *guc);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 20c710a74498..10d1878d2826 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
 	return ++ct->requests.last_fence;
 }
 
-static void write_barrier(struct intel_guc_ct *ct)
-{
-	struct intel_guc *guc = ct_to_guc(ct);
-	struct intel_gt *gt = guc_to_gt(guc);
-
-	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
-		GEM_BUG_ON(guc->send_regs.fw_domains);
-		/*
-		 * This register is used by the i915 and GuC for MMIO based
-		 * communication. Once we are in this code CTBs are the only
-		 * method the i915 uses to communicate with the GuC so it is
-		 * safe to write to this register (a value of 0 is NOP for MMIO
-		 * communication). If we ever start mixing CTBs and MMIOs a new
-		 * register will have to be chosen.
-		 */
-		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
-	} else {
-		/* wmb() sufficient for a barrier if in smem */
-		wmb();
-	}
-}
-
 static int ct_write(struct intel_guc_ct *ct,
 		    const u32 *action,
 		    u32 len /* in dwords */,
@@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
 	 * make sure H2G buffer update and LRC tail update (if this triggering a
 	 * submission) are visible before updating the descriptor tail
 	 */
-	write_barrier(ct);
+	intel_guc_write_barrier(ct_to_guc(ct));
 
 	/* update local copies */
 	ctb->tail = tail;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 0e600a3b8f1e..6cd26dc060d1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -65,12 +65,14 @@
 #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
-#define WQ_TARGET_SHIFT			10
+#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
+#define WQ_TARGET_SHIFT			8
 #define WQ_LEN_SHIFT			16
 #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
 #define WQ_PRESENT_WORKLOAD		(1 << 28)
 
-#define WQ_RING_TAIL_SHIFT		20
+#define WQ_GUC_ID_SHIFT			0
+#define WQ_RING_TAIL_SHIFT		18
 #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
 #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e9dfd43d29a0..b107ad095248 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static u32 *get_wq_pointer(struct guc_process_desc *desc,
+			   struct intel_context *ce,
+			   u32 wqi_size)
+{
+	/*
+	 * Check for space in work queue. Caching a value of head pointer in
+	 * intel_context structure in order reduce the number accesses to shared
+	 * GPU memory which may be across a PCIe bus.
+	 */
+#define AVAILABLE_SPACE	\
+	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
+	if (wqi_size > AVAILABLE_SPACE) {
+		ce->guc_wqi_head = READ_ONCE(desc->head);
+
+		if (wqi_size > AVAILABLE_SPACE)
+			return NULL;
+	}
+#undef AVAILABLE_SPACE
+
+	return ((u32 *)__get_process_desc(ce)) +
+		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u32 action[3];
 	int len = 0;
 	u32 g2h_len_dw = 0;
@@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
-	/*
-	 * Corner case where the GuC firmware was blown away and reloaded while
-	 * this context was pinned.
-	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
-		err = guc_lrc_desc_pin(ce, false);
-		if (unlikely(err))
-			return err;
-	}
-
 	spin_lock(&ce->guc_state.lock);
 
 	/*
 	 * The request / context will be run on the hardware when scheduling
-	 * gets enabled in the unblock.
+	 * gets enabled in the unblock. For multi-lrc we still submit the
+	 * context to move the LRC tails.
 	 */
-	if (unlikely(context_blocked(ce)))
+	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
 		goto out;
 
-	enabled = context_enabled(ce);
+	enabled = context_enabled(ce) || context_blocked(ce);
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
@@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_intel_context_sched_enable(ce);
 		atomic_inc(&guc->outstanding_submission_g2h);
 		set_context_enabled(ce);
+
+		/*
+		 * Without multi-lrc KMD does the submission step (moving the
+		 * lrc tail) so enabling scheduling is sufficient to submit the
+		 * context. This isn't the case in multi-lrc submission as the
+		 * GuC needs to move the tails, hence the need for another H2G
+		 * to submit a multi-lrc context after enabling scheduling.
+		 */
+		if (intel_context_is_parent(ce)) {
+			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
+			err = intel_guc_send_nb(guc, action, len - 1, 0);
+		}
 	} else if (!enabled) {
 		clr_context_pending_enable(ce);
 		intel_context_put(ce);
@@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
+static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+{
+	int ret = __guc_add_request(guc, rq);
+
+	if (unlikely(ret == -EBUSY)) {
+		guc->stalled_request= rq;
+		guc->submission_stall_reason = STALL_ADD_REQUEST;
+	}
+
+	return ret;
+}
+
 static void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
@@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static bool is_multi_lrc_rq(struct i915_request *rq)
+{
+	return intel_context_is_child(rq->context) ||
+		intel_context_is_parent(rq->context);
+}
+
+static bool can_merge_rq(struct i915_request *rq,
+			 struct i915_request *last)
+{
+	return request_to_scheduling_context(rq) ==
+		request_to_scheduling_context(last);
+}
+
+static u32 wq_space_until_wrap(struct intel_context *ce)
+{
+	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
+}
+
+static void write_wqi(struct guc_process_desc *desc,
+		      struct intel_context *ce,
+		      u32 wqi_size)
+{
+	/*
+	 * Ensure WQE are visible before updating tail
+	 */
+	intel_guc_write_barrier(ce_to_guc(ce));
+
+	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
+	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
+}
+
+static int guc_wq_noop_append(struct intel_context *ce)
+{
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
+
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi = WQ_TYPE_NOOP |
+		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	ce->guc_wqi_tail = 0;
+
+	return 0;
+}
+
+static int __guc_wq_item_append(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	struct intel_context *child;
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	unsigned int wqi_size = (ce->guc_number_children + 4) *
+		sizeof(u32);
+	u32 *wqi;
+	int ret;
+
+	/* Ensure context is in correct state updating work queue */
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
+
+	/* Insert NOOP if this work queue item will wrap the tail pointer. */
+	if (wqi_size > wq_space_until_wrap(ce)) {
+		ret = guc_wq_noop_append(ce);
+		if (ret)
+			return ret;
+	}
+
+	wqi = get_wq_pointer(desc, ce, wqi_size);
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi++ = WQ_TYPE_MULTI_LRC |
+		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	*wqi++ = ce->lrc.lrca;
+	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
+		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
+	*wqi++ = 0;	/* fence_id */
+	for_each_child(ce, child)
+		*wqi++ = child->ring->tail / sizeof(u64);
+
+	write_wqi(desc, ce, wqi_size);
+
+	return 0;
+}
+
+static int guc_wq_item_append(struct intel_guc *guc,
+			      struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	int ret = 0;
+
+	if (likely(!intel_context_is_banned(ce))) {
+		ret = __guc_wq_item_append(rq);
+
+		if (unlikely(ret == -EBUSY)) {
+			guc->stalled_request = rq;
+			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+		}
+	}
+
+	return ret;
+}
+
+static bool multi_lrc_submit(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	intel_ring_set_tail(rq->ring, rq->tail);
+
+	/*
+	 * We expect the front end (execbuf IOCTL) to set this flag on the last
+	 * request generated from a multi-BB submission. This indicates to the
+	 * backend (GuC interface) that we should submit this context thus
+	 * submitting all the requests generated in parallel.
+	 */
+	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
+		intel_context_is_banned(ce);
+}
+
 static int guc_dequeue_one_context(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	if (guc->stalled_request) {
 		submit = true;
 		last = guc->stalled_request;
-		goto resubmit;
+
+		switch (guc->submission_stall_reason) {
+		case STALL_REGISTER_CONTEXT:
+			goto register_context;
+		case STALL_MOVE_LRC_TAIL:
+			goto move_lrc_tail;
+		case STALL_ADD_REQUEST:
+			goto add_request;
+		default:
+			MISSING_CASE(guc->submission_stall_reason);
+		}
 	}
 
 	while ((rb = rb_first_cached(&sched_engine->queue))) {
@@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		struct i915_request *rq, *rn;
 
 		priolist_for_each_request_consume(rq, rn, p) {
-			if (last && rq->context != last->context)
-				goto done;
+			if (last && !can_merge_rq(rq, last))
+				goto register_context;
 
 			list_del_init(&rq->sched.link);
 
@@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 
 			trace_i915_request_in(rq, 0);
 			last = rq;
-			submit = true;
+
+			if (is_multi_lrc_rq(rq)) {
+				/*
+				 * We need to coalesce all multi-lrc requests in
+				 * a relationship into a single H2G. We are
+				 * guaranteed that all of these requests will be
+				 * submitted sequentially.
+				 */
+				if (multi_lrc_submit(rq)) {
+					submit = true;
+					goto register_context;
+				}
+			} else {
+				submit = true;
+			}
 		}
 
 		rb_erase_cached(&p->node, &sched_engine->queue);
 		i915_priolist_free(p);
 	}
-done:
+
+register_context:
 	if (submit) {
-		guc_set_lrc_tail(last);
-resubmit:
+		struct intel_context *ce = request_to_scheduling_context(last);
+
+		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
+			     !intel_context_is_banned(ce))) {
+			ret = guc_lrc_desc_pin(ce, false);
+			if (unlikely(ret == -EPIPE)) {
+				goto deadlk;
+			} else if (ret == -EBUSY) {
+				guc->stalled_request = last;
+				guc->submission_stall_reason =
+					STALL_REGISTER_CONTEXT;
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		}
+
+move_lrc_tail:
+		if (is_multi_lrc_rq(last)) {
+			ret = guc_wq_item_append(guc, last);
+			if (ret == -EBUSY) {
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		} else {
+			guc_set_lrc_tail(last);
+		}
+
+add_request:
 		ret = guc_add_request(guc, last);
-		if (unlikely(ret == -EPIPE))
+		if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (ret != 0) {
+			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
-		else if (ret == -EBUSY) {
-			tasklet_schedule(&sched_engine->tasklet);
-			guc->stalled_request = last;
-			return false;
 		}
 	}
 
 	guc->stalled_request = NULL;
+	guc->submission_stall_reason = STALL_NONE;
 	return submit;
 
 deadlk:
 	sched_engine->tasklet.callback = NULL;
 	tasklet_disable_nosync(&sched_engine->tasklet);
 	return false;
+
+schedule_tasklet:
+	tasklet_schedule(&sched_engine->tasklet);
+	return false;
 }
 
 static void guc_submission_tasklet(struct tasklet_struct *t)
@@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 
 	trace_i915_request_in(rq, 0);
 
-	guc_set_lrc_tail(rq);
-	ret = guc_add_request(guc, rq);
-	if (ret == -EBUSY)
-		guc->stalled_request = rq;
+	if (is_multi_lrc_rq(rq)) {
+		if (multi_lrc_submit(rq)) {
+			ret = guc_wq_item_append(guc, rq);
+			if (!ret)
+				ret = guc_add_request(guc, rq);
+		}
+	} else {
+		guc_set_lrc_tail(rq);
+		ret = guc_add_request(guc, rq);
+	}
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(guc);
@@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 	return ret;
 }
 
+bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
+{
+	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	return submission_disabled(guc) || guc->stalled_request ||
+		!i915_sched_engine_is_empty(sched_engine) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+}
+
 static void guc_submit_request(struct i915_request *rq)
 {
 	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
@@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (submission_disabled(guc) || guc->stalled_request ||
-	    !i915_sched_engine_is_empty(sched_engine))
+	if (need_tasklet(guc, rq))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
 		tasklet_hi_schedule(&sched_engine->tasklet);
@@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 
 static void add_to_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
 	spin_lock(&ce->guc_state.lock);
@@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 
 static void remove_from_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irq(&ce->guc_state.lock);
 
@@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
 static void guc_bump_inflight_request_prio(struct i915_request *rq,
 					   int prio)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
 
 	/* Short circuit function */
@@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 
 	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 177eaf55adff..8f0073e19079 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -139,6 +139,14 @@ enum {
 	 * the GPU. Here we track such boost requests on a per-request basis.
 	 */
 	I915_FENCE_FLAG_BOOST,
+
+	/*
+	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) should
+	 * trigger a submission to the GuC rather than just moving the context
+	 * tail.
+	 */
+	I915_FENCE_FLAG_SUBMIT_PARALLEL,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
 drivers/gpu/drm/i915/i915_request.h           |   8 +
 6 files changed, 317 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index fbfcae727d7f..879aef662b2e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
 		}
 	}
 }
+
+void intel_guc_write_barrier(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
+		GEM_BUG_ON(guc->send_regs.fw_domains);
+		/*
+		 * This register is used by the i915 and GuC for MMIO based
+		 * communication. Once we are in this code CTBs are the only
+		 * method the i915 uses to communicate with the GuC so it is
+		 * safe to write to this register (a value of 0 is NOP for MMIO
+		 * communication). If we ever start mixing CTBs and MMIOs a new
+		 * register will have to be chosen.
+		 */
+		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
+	} else {
+		/* wmb() sufficient for a barrier if in smem */
+		wmb();
+	}
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 3f95b1b4f15c..0ead2406d03c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -37,6 +37,12 @@ struct intel_guc {
 	/* Global engine used to submit requests to GuC */
 	struct i915_sched_engine *sched_engine;
 	struct i915_request *stalled_request;
+	enum {
+		STALL_NONE,
+		STALL_REGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
 
 	/* intel_guc_recv interrupt related state */
 	spinlock_t irq_lock;
@@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
 
 void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
 
+void intel_guc_write_barrier(struct intel_guc *guc);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 20c710a74498..10d1878d2826 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
 	return ++ct->requests.last_fence;
 }
 
-static void write_barrier(struct intel_guc_ct *ct)
-{
-	struct intel_guc *guc = ct_to_guc(ct);
-	struct intel_gt *gt = guc_to_gt(guc);
-
-	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
-		GEM_BUG_ON(guc->send_regs.fw_domains);
-		/*
-		 * This register is used by the i915 and GuC for MMIO based
-		 * communication. Once we are in this code CTBs are the only
-		 * method the i915 uses to communicate with the GuC so it is
-		 * safe to write to this register (a value of 0 is NOP for MMIO
-		 * communication). If we ever start mixing CTBs and MMIOs a new
-		 * register will have to be chosen.
-		 */
-		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
-	} else {
-		/* wmb() sufficient for a barrier if in smem */
-		wmb();
-	}
-}
-
 static int ct_write(struct intel_guc_ct *ct,
 		    const u32 *action,
 		    u32 len /* in dwords */,
@@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
 	 * make sure H2G buffer update and LRC tail update (if this triggering a
 	 * submission) are visible before updating the descriptor tail
 	 */
-	write_barrier(ct);
+	intel_guc_write_barrier(ct_to_guc(ct));
 
 	/* update local copies */
 	ctb->tail = tail;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 0e600a3b8f1e..6cd26dc060d1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -65,12 +65,14 @@
 #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
-#define WQ_TARGET_SHIFT			10
+#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
+#define WQ_TARGET_SHIFT			8
 #define WQ_LEN_SHIFT			16
 #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
 #define WQ_PRESENT_WORKLOAD		(1 << 28)
 
-#define WQ_RING_TAIL_SHIFT		20
+#define WQ_GUC_ID_SHIFT			0
+#define WQ_RING_TAIL_SHIFT		18
 #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
 #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e9dfd43d29a0..b107ad095248 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static u32 *get_wq_pointer(struct guc_process_desc *desc,
+			   struct intel_context *ce,
+			   u32 wqi_size)
+{
+	/*
+	 * Check for space in work queue. Caching a value of head pointer in
+	 * intel_context structure in order reduce the number accesses to shared
+	 * GPU memory which may be across a PCIe bus.
+	 */
+#define AVAILABLE_SPACE	\
+	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
+	if (wqi_size > AVAILABLE_SPACE) {
+		ce->guc_wqi_head = READ_ONCE(desc->head);
+
+		if (wqi_size > AVAILABLE_SPACE)
+			return NULL;
+	}
+#undef AVAILABLE_SPACE
+
+	return ((u32 *)__get_process_desc(ce)) +
+		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
+}
+
 static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
@@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u32 action[3];
 	int len = 0;
 	u32 g2h_len_dw = 0;
@@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
 
-	/*
-	 * Corner case where the GuC firmware was blown away and reloaded while
-	 * this context was pinned.
-	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
-		err = guc_lrc_desc_pin(ce, false);
-		if (unlikely(err))
-			return err;
-	}
-
 	spin_lock(&ce->guc_state.lock);
 
 	/*
 	 * The request / context will be run on the hardware when scheduling
-	 * gets enabled in the unblock.
+	 * gets enabled in the unblock. For multi-lrc we still submit the
+	 * context to move the LRC tails.
 	 */
-	if (unlikely(context_blocked(ce)))
+	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
 		goto out;
 
-	enabled = context_enabled(ce);
+	enabled = context_enabled(ce) || context_blocked(ce);
 
 	if (!enabled) {
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
@@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_intel_context_sched_enable(ce);
 		atomic_inc(&guc->outstanding_submission_g2h);
 		set_context_enabled(ce);
+
+		/*
+		 * Without multi-lrc KMD does the submission step (moving the
+		 * lrc tail) so enabling scheduling is sufficient to submit the
+		 * context. This isn't the case in multi-lrc submission as the
+		 * GuC needs to move the tails, hence the need for another H2G
+		 * to submit a multi-lrc context after enabling scheduling.
+		 */
+		if (intel_context_is_parent(ce)) {
+			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
+			err = intel_guc_send_nb(guc, action, len - 1, 0);
+		}
 	} else if (!enabled) {
 		clr_context_pending_enable(ce);
 		intel_context_put(ce);
@@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
+static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+{
+	int ret = __guc_add_request(guc, rq);
+
+	if (unlikely(ret == -EBUSY)) {
+		guc->stalled_request= rq;
+		guc->submission_stall_reason = STALL_ADD_REQUEST;
+	}
+
+	return ret;
+}
+
 static void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
@@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static bool is_multi_lrc_rq(struct i915_request *rq)
+{
+	return intel_context_is_child(rq->context) ||
+		intel_context_is_parent(rq->context);
+}
+
+static bool can_merge_rq(struct i915_request *rq,
+			 struct i915_request *last)
+{
+	return request_to_scheduling_context(rq) ==
+		request_to_scheduling_context(last);
+}
+
+static u32 wq_space_until_wrap(struct intel_context *ce)
+{
+	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
+}
+
+static void write_wqi(struct guc_process_desc *desc,
+		      struct intel_context *ce,
+		      u32 wqi_size)
+{
+	/*
+	 * Ensure WQE are visible before updating tail
+	 */
+	intel_guc_write_barrier(ce_to_guc(ce));
+
+	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
+	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
+}
+
+static int guc_wq_noop_append(struct intel_context *ce)
+{
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
+
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi = WQ_TYPE_NOOP |
+		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	ce->guc_wqi_tail = 0;
+
+	return 0;
+}
+
+static int __guc_wq_item_append(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	struct intel_context *child;
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	unsigned int wqi_size = (ce->guc_number_children + 4) *
+		sizeof(u32);
+	u32 *wqi;
+	int ret;
+
+	/* Ensure context is in correct state updating work queue */
+	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
+
+	/* Insert NOOP if this work queue item will wrap the tail pointer. */
+	if (wqi_size > wq_space_until_wrap(ce)) {
+		ret = guc_wq_noop_append(ce);
+		if (ret)
+			return ret;
+	}
+
+	wqi = get_wq_pointer(desc, ce, wqi_size);
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi++ = WQ_TYPE_MULTI_LRC |
+		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	*wqi++ = ce->lrc.lrca;
+	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
+		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
+	*wqi++ = 0;	/* fence_id */
+	for_each_child(ce, child)
+		*wqi++ = child->ring->tail / sizeof(u64);
+
+	write_wqi(desc, ce, wqi_size);
+
+	return 0;
+}
+
+static int guc_wq_item_append(struct intel_guc *guc,
+			      struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	int ret = 0;
+
+	if (likely(!intel_context_is_banned(ce))) {
+		ret = __guc_wq_item_append(rq);
+
+		if (unlikely(ret == -EBUSY)) {
+			guc->stalled_request = rq;
+			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+		}
+	}
+
+	return ret;
+}
+
+static bool multi_lrc_submit(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	intel_ring_set_tail(rq->ring, rq->tail);
+
+	/*
+	 * We expect the front end (execbuf IOCTL) to set this flag on the last
+	 * request generated from a multi-BB submission. This indicates to the
+	 * backend (GuC interface) that we should submit this context thus
+	 * submitting all the requests generated in parallel.
+	 */
+	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
+		intel_context_is_banned(ce);
+}
+
 static int guc_dequeue_one_context(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	if (guc->stalled_request) {
 		submit = true;
 		last = guc->stalled_request;
-		goto resubmit;
+
+		switch (guc->submission_stall_reason) {
+		case STALL_REGISTER_CONTEXT:
+			goto register_context;
+		case STALL_MOVE_LRC_TAIL:
+			goto move_lrc_tail;
+		case STALL_ADD_REQUEST:
+			goto add_request;
+		default:
+			MISSING_CASE(guc->submission_stall_reason);
+		}
 	}
 
 	while ((rb = rb_first_cached(&sched_engine->queue))) {
@@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		struct i915_request *rq, *rn;
 
 		priolist_for_each_request_consume(rq, rn, p) {
-			if (last && rq->context != last->context)
-				goto done;
+			if (last && !can_merge_rq(rq, last))
+				goto register_context;
 
 			list_del_init(&rq->sched.link);
 
@@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 
 			trace_i915_request_in(rq, 0);
 			last = rq;
-			submit = true;
+
+			if (is_multi_lrc_rq(rq)) {
+				/*
+				 * We need to coalesce all multi-lrc requests in
+				 * a relationship into a single H2G. We are
+				 * guaranteed that all of these requests will be
+				 * submitted sequentially.
+				 */
+				if (multi_lrc_submit(rq)) {
+					submit = true;
+					goto register_context;
+				}
+			} else {
+				submit = true;
+			}
 		}
 
 		rb_erase_cached(&p->node, &sched_engine->queue);
 		i915_priolist_free(p);
 	}
-done:
+
+register_context:
 	if (submit) {
-		guc_set_lrc_tail(last);
-resubmit:
+		struct intel_context *ce = request_to_scheduling_context(last);
+
+		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
+			     !intel_context_is_banned(ce))) {
+			ret = guc_lrc_desc_pin(ce, false);
+			if (unlikely(ret == -EPIPE)) {
+				goto deadlk;
+			} else if (ret == -EBUSY) {
+				guc->stalled_request = last;
+				guc->submission_stall_reason =
+					STALL_REGISTER_CONTEXT;
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		}
+
+move_lrc_tail:
+		if (is_multi_lrc_rq(last)) {
+			ret = guc_wq_item_append(guc, last);
+			if (ret == -EBUSY) {
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		} else {
+			guc_set_lrc_tail(last);
+		}
+
+add_request:
 		ret = guc_add_request(guc, last);
-		if (unlikely(ret == -EPIPE))
+		if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (ret != 0) {
+			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
-		else if (ret == -EBUSY) {
-			tasklet_schedule(&sched_engine->tasklet);
-			guc->stalled_request = last;
-			return false;
 		}
 	}
 
 	guc->stalled_request = NULL;
+	guc->submission_stall_reason = STALL_NONE;
 	return submit;
 
 deadlk:
 	sched_engine->tasklet.callback = NULL;
 	tasklet_disable_nosync(&sched_engine->tasklet);
 	return false;
+
+schedule_tasklet:
+	tasklet_schedule(&sched_engine->tasklet);
+	return false;
 }
 
 static void guc_submission_tasklet(struct tasklet_struct *t)
@@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 
 	trace_i915_request_in(rq, 0);
 
-	guc_set_lrc_tail(rq);
-	ret = guc_add_request(guc, rq);
-	if (ret == -EBUSY)
-		guc->stalled_request = rq;
+	if (is_multi_lrc_rq(rq)) {
+		if (multi_lrc_submit(rq)) {
+			ret = guc_wq_item_append(guc, rq);
+			if (!ret)
+				ret = guc_add_request(guc, rq);
+		}
+	} else {
+		guc_set_lrc_tail(rq);
+		ret = guc_add_request(guc, rq);
+	}
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(guc);
@@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 	return ret;
 }
 
+bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
+{
+	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	return submission_disabled(guc) || guc->stalled_request ||
+		!i915_sched_engine_is_empty(sched_engine) ||
+		!lrc_desc_registered(guc, ce->guc_id.id);
+}
+
 static void guc_submit_request(struct i915_request *rq)
 {
 	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
@@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (submission_disabled(guc) || guc->stalled_request ||
-	    !i915_sched_engine_is_empty(sched_engine))
+	if (need_tasklet(guc, rq))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
 		tasklet_hi_schedule(&sched_engine->tasklet);
@@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 
 static void add_to_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
 	spin_lock(&ce->guc_state.lock);
@@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 
 static void remove_from_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irq(&ce->guc_state.lock);
 
@@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
 static void guc_bump_inflight_request_prio(struct i915_request *rq,
 					   int prio)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
 
 	/* Short circuit function */
@@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 
 	spin_lock(&ce->guc_state.lock);
 	guc_prio_fini(rq, ce);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 177eaf55adff..8f0073e19079 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -139,6 +139,14 @@ enum {
 	 * the GPU. Here we track such boost requests on a per-request basis.
 	 */
 	I915_FENCE_FLAG_BOOST,
+
+	/*
+	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) should
+	 * trigger a submission to the GuC rather than just moving the context
+	 * tail.
+	 */
+	I915_FENCE_FLAG_SUBMIT_PARALLEL,
 };
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 16/27] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 +-
 drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
 4 files changed, 109 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c2985822ab74..9dcc1b14697b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -75,6 +75,11 @@ intel_context_to_parent(struct intel_context *ce)
         }
 }
 
+static inline bool intel_context_is_parallel(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) || intel_context_is_parent(ce);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 6f567ebeb039..a63329520c35 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -246,6 +246,13 @@ struct intel_context {
 		 * work queue descriptor
 		 */
 		u8 parent_page;
+
+		/**
+		 * @last_rq: last request submitted on a parallel context, used
+		 * to insert submit fences between request in the parallel
+		 * context.
+		 */
+		struct i915_request *last_rq;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b107ad095248..f0b60fecf253 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -672,8 +672,7 @@ static int rq_prio(const struct i915_request *rq)
 
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
-	return intel_context_is_child(rq->context) ||
-		intel_context_is_parent(rq->context);
+	return intel_context_is_parallel(rq->context);
 }
 
 static bool can_merge_rq(struct i915_request *rq,
@@ -2843,6 +2842,8 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
+	if (ce->last_rq)
+		i915_request_put(ce->last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index ce446716d092..2e51c8999088 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static struct i915_request *
-__i915_request_add_to_timeline(struct i915_request *rq)
+__i915_request_ensure_parallel_ordering(struct i915_request *rq,
+					struct intel_timeline *timeline)
 {
-	struct intel_timeline *timeline = i915_request_timeline(rq);
 	struct i915_request *prev;
 
-	/*
-	 * Dependency tracking and request ordering along the timeline
-	 * is special cased so that we can eliminate redundant ordering
-	 * operations while building the request (we know that the timeline
-	 * itself is ordered, and here we guarantee it).
-	 *
-	 * As we know we will need to emit tracking along the timeline,
-	 * we embed the hooks into our request struct -- at the cost of
-	 * having to have specialised no-allocation interfaces (which will
-	 * be beneficial elsewhere).
-	 *
-	 * A second benefit to open-coding i915_request_await_request is
-	 * that we can apply a slight variant of the rules specialised
-	 * for timelines that jump between engines (such as virtual engines).
-	 * If we consider the case of virtual engine, we must emit a dma-fence
-	 * to prevent scheduling of the second request until the first is
-	 * complete (to maximise our greedy late load balancing) and this
-	 * precludes optimising to use semaphores serialisation of a single
-	 * timeline across engines.
-	 */
+	GEM_BUG_ON(!is_parallel_rq(rq));
+
+	prev = request_to_parent(rq)->last_rq;
+	if (prev) {
+		if (!__i915_request_is_complete(prev)) {
+			i915_sw_fence_await_sw_fence(&rq->submit,
+						     &prev->submit,
+						     &rq->submitq);
+
+			if (rq->engine->sched_engine->schedule)
+				__i915_sched_node_add_dependency(&rq->sched,
+								 &prev->sched,
+								 &rq->dep,
+								 0);
+		}
+		i915_request_put(prev);
+	}
+
+	request_to_parent(rq)->last_rq = i915_request_get(rq);
+
+	return to_request(__i915_active_fence_set(&timeline->last_request,
+						  &rq->fence));
+}
+
+static struct i915_request *
+__i915_request_ensure_ordering(struct i915_request *rq,
+			       struct intel_timeline *timeline)
+{
+	struct i915_request *prev;
+
+	GEM_BUG_ON(is_parallel_rq(rq));
+
 	prev = to_request(__i915_active_fence_set(&timeline->last_request,
 						  &rq->fence));
+
 	if (prev && !__i915_request_is_complete(prev)) {
 		bool uses_guc = intel_engine_uses_guc(rq->engine);
+		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
+					  rq->engine->mask);
+		bool same_context = prev->context == rq->context;
 
 		/*
 		 * The requests are supposed to be kept in order. However,
@@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 		 * is used as a barrier for external modification to this
 		 * context.
 		 */
-		GEM_BUG_ON(prev->context == rq->context &&
+		GEM_BUG_ON(same_context &&
 			   i915_seqno_passed(prev->fence.seqno,
 					     rq->fence.seqno));
 
-		if ((!uses_guc &&
-		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
-		    (uses_guc && prev->context == rq->context))
+		if ((same_context && uses_guc) || (!uses_guc && pow2))
 			i915_sw_fence_await_sw_fence(&rq->submit,
 						     &prev->submit,
 						     &rq->submitq);
@@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 							 0);
 	}
 
+	return prev;
+}
+
+static struct i915_request *
+__i915_request_add_to_timeline(struct i915_request *rq)
+{
+	struct intel_timeline *timeline = i915_request_timeline(rq);
+	struct i915_request *prev;
+
+	/*
+	 * Dependency tracking and request ordering along the timeline
+	 * is special cased so that we can eliminate redundant ordering
+	 * operations while building the request (we know that the timeline
+	 * itself is ordered, and here we guarantee it).
+	 *
+	 * As we know we will need to emit tracking along the timeline,
+	 * we embed the hooks into our request struct -- at the cost of
+	 * having to have specialised no-allocation interfaces (which will
+	 * be beneficial elsewhere).
+	 *
+	 * A second benefit to open-coding i915_request_await_request is
+	 * that we can apply a slight variant of the rules specialised
+	 * for timelines that jump between engines (such as virtual engines).
+	 * If we consider the case of virtual engine, we must emit a dma-fence
+	 * to prevent scheduling of the second request until the first is
+	 * complete (to maximise our greedy late load balancing) and this
+	 * precludes optimising to use semaphores serialisation of a single
+	 * timeline across engines.
+	 *
+	 * We do not order parallel submission requests on the timeline as each
+	 * parallel submission context has its own timeline and the ordering
+	 * rules for parallel requests are that they must be submitted in the
+	 * order received from the execbuf IOCTL. So rather than using the
+	 * timeline we store a pointer to last request submitted in the
+	 * relationship in the gem context and insert a submission fence
+	 * between that request and request passed into this function or
+	 * alternatively we use completion fence if gem context has a single
+	 * timeline and this is the first submission of an execbuf IOCTL.
+	 */
+	if (likely(!is_parallel_rq(rq)))
+		prev = __i915_request_ensure_ordering(rq, timeline);
+	else
+		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+
 	/*
 	 * Make sure that no request gazumped us - if it was allocated after
 	 * our i915_request_alloc() and called __i915_request_add() before
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 16/27] drm/i915/guc: Insert submit fences between requests in parent-child relationship
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 +-
 drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
 4 files changed, 109 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c2985822ab74..9dcc1b14697b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -75,6 +75,11 @@ intel_context_to_parent(struct intel_context *ce)
         }
 }
 
+static inline bool intel_context_is_parallel(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) || intel_context_is_parent(ce);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 6f567ebeb039..a63329520c35 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -246,6 +246,13 @@ struct intel_context {
 		 * work queue descriptor
 		 */
 		u8 parent_page;
+
+		/**
+		 * @last_rq: last request submitted on a parallel context, used
+		 * to insert submit fences between request in the parallel
+		 * context.
+		 */
+		struct i915_request *last_rq;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b107ad095248..f0b60fecf253 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -672,8 +672,7 @@ static int rq_prio(const struct i915_request *rq)
 
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
-	return intel_context_is_child(rq->context) ||
-		intel_context_is_parent(rq->context);
+	return intel_context_is_parallel(rq->context);
 }
 
 static bool can_merge_rq(struct i915_request *rq,
@@ -2843,6 +2842,8 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
+	if (ce->last_rq)
+		i915_request_put(ce->last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index ce446716d092..2e51c8999088 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static struct i915_request *
-__i915_request_add_to_timeline(struct i915_request *rq)
+__i915_request_ensure_parallel_ordering(struct i915_request *rq,
+					struct intel_timeline *timeline)
 {
-	struct intel_timeline *timeline = i915_request_timeline(rq);
 	struct i915_request *prev;
 
-	/*
-	 * Dependency tracking and request ordering along the timeline
-	 * is special cased so that we can eliminate redundant ordering
-	 * operations while building the request (we know that the timeline
-	 * itself is ordered, and here we guarantee it).
-	 *
-	 * As we know we will need to emit tracking along the timeline,
-	 * we embed the hooks into our request struct -- at the cost of
-	 * having to have specialised no-allocation interfaces (which will
-	 * be beneficial elsewhere).
-	 *
-	 * A second benefit to open-coding i915_request_await_request is
-	 * that we can apply a slight variant of the rules specialised
-	 * for timelines that jump between engines (such as virtual engines).
-	 * If we consider the case of virtual engine, we must emit a dma-fence
-	 * to prevent scheduling of the second request until the first is
-	 * complete (to maximise our greedy late load balancing) and this
-	 * precludes optimising to use semaphores serialisation of a single
-	 * timeline across engines.
-	 */
+	GEM_BUG_ON(!is_parallel_rq(rq));
+
+	prev = request_to_parent(rq)->last_rq;
+	if (prev) {
+		if (!__i915_request_is_complete(prev)) {
+			i915_sw_fence_await_sw_fence(&rq->submit,
+						     &prev->submit,
+						     &rq->submitq);
+
+			if (rq->engine->sched_engine->schedule)
+				__i915_sched_node_add_dependency(&rq->sched,
+								 &prev->sched,
+								 &rq->dep,
+								 0);
+		}
+		i915_request_put(prev);
+	}
+
+	request_to_parent(rq)->last_rq = i915_request_get(rq);
+
+	return to_request(__i915_active_fence_set(&timeline->last_request,
+						  &rq->fence));
+}
+
+static struct i915_request *
+__i915_request_ensure_ordering(struct i915_request *rq,
+			       struct intel_timeline *timeline)
+{
+	struct i915_request *prev;
+
+	GEM_BUG_ON(is_parallel_rq(rq));
+
 	prev = to_request(__i915_active_fence_set(&timeline->last_request,
 						  &rq->fence));
+
 	if (prev && !__i915_request_is_complete(prev)) {
 		bool uses_guc = intel_engine_uses_guc(rq->engine);
+		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
+					  rq->engine->mask);
+		bool same_context = prev->context == rq->context;
 
 		/*
 		 * The requests are supposed to be kept in order. However,
@@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 		 * is used as a barrier for external modification to this
 		 * context.
 		 */
-		GEM_BUG_ON(prev->context == rq->context &&
+		GEM_BUG_ON(same_context &&
 			   i915_seqno_passed(prev->fence.seqno,
 					     rq->fence.seqno));
 
-		if ((!uses_guc &&
-		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
-		    (uses_guc && prev->context == rq->context))
+		if ((same_context && uses_guc) || (!uses_guc && pow2))
 			i915_sw_fence_await_sw_fence(&rq->submit,
 						     &prev->submit,
 						     &rq->submitq);
@@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 							 0);
 	}
 
+	return prev;
+}
+
+static struct i915_request *
+__i915_request_add_to_timeline(struct i915_request *rq)
+{
+	struct intel_timeline *timeline = i915_request_timeline(rq);
+	struct i915_request *prev;
+
+	/*
+	 * Dependency tracking and request ordering along the timeline
+	 * is special cased so that we can eliminate redundant ordering
+	 * operations while building the request (we know that the timeline
+	 * itself is ordered, and here we guarantee it).
+	 *
+	 * As we know we will need to emit tracking along the timeline,
+	 * we embed the hooks into our request struct -- at the cost of
+	 * having to have specialised no-allocation interfaces (which will
+	 * be beneficial elsewhere).
+	 *
+	 * A second benefit to open-coding i915_request_await_request is
+	 * that we can apply a slight variant of the rules specialised
+	 * for timelines that jump between engines (such as virtual engines).
+	 * If we consider the case of virtual engine, we must emit a dma-fence
+	 * to prevent scheduling of the second request until the first is
+	 * complete (to maximise our greedy late load balancing) and this
+	 * precludes optimising to use semaphores serialisation of a single
+	 * timeline across engines.
+	 *
+	 * We do not order parallel submission requests on the timeline as each
+	 * parallel submission context has its own timeline and the ordering
+	 * rules for parallel requests are that they must be submitted in the
+	 * order received from the execbuf IOCTL. So rather than using the
+	 * timeline we store a pointer to last request submitted in the
+	 * relationship in the gem context and insert a submission fence
+	 * between that request and request passed into this function or
+	 * alternatively we use completion fence if gem context has a single
+	 * timeline and this is the first submission of an execbuf IOCTL.
+	 */
+	if (likely(!is_parallel_rq(rq)))
+		prev = __i915_request_ensure_ordering(rq, timeline);
+	else
+		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+
 	/*
 	 * Make sure that no request gazumped us - if it was allocated after
 	 * our i915_request_alloc() and called __i915_request_add() before
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 17/27] drm/i915/guc: Implement multi-lrc reset
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its' children. The parent context owns the reset replaying / canceling
requests as needed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 11 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 63 +++++++++++++------
 2 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 00d1aee6d199..5615be32879c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,20 +528,21 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 
 struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 {
+	struct intel_context *parent = intel_context_to_parent(ce);
 	struct i915_request *rq, *active = NULL;
 	unsigned long flags;
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
+	spin_lock_irqsave(&parent->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
 				    sched.link) {
-		if (i915_request_completed(rq))
+		if (i915_request_completed(rq) && rq->context == ce)
 			break;
 
-		active = rq;
+		active = (rq->context == ce) ? rq : active;
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f0b60fecf253..e34e0ea9136a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -670,6 +670,11 @@ static int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc(struct intel_context *ce)
+{
+	return intel_context_is_parallel(ce);
+}
+
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
 	return intel_context_is_parallel(rq->context);
@@ -1179,10 +1184,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
+	bool local_stalled;
 	struct i915_request *rq;
 	unsigned long flags;
 	u32 head;
+	int i, number_children = ce->guc_number_children;
 	bool skip = false;
+	struct intel_context *parent = ce;
 
 	intel_context_get(ce);
 
@@ -1209,25 +1217,34 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 	if (unlikely(skip))
 		goto out_put;
 
-	rq = intel_context_find_active_request(ce);
-	if (!rq) {
-		head = ce->ring->tail;
-		stalled = false;
-		goto out_replay;
-	}
+	for (i = 0; i < number_children + 1; ++i) {
+		if (!intel_context_is_pinned(ce))
+			goto next_context;
+
+		local_stalled = false;
+		rq = intel_context_find_active_request(ce);
+		if (!rq) {
+			head = ce->ring->tail;
+			goto out_replay;
+		}
 
-	if (!i915_request_started(rq))
-		stalled = false;
+		GEM_BUG_ON(i915_active_is_idle(&ce->active));
+		head = intel_ring_wrap(ce->ring, rq->head);
 
-	GEM_BUG_ON(i915_active_is_idle(&ce->active));
-	head = intel_ring_wrap(ce->ring, rq->head);
-	__i915_request_reset(rq, stalled);
+		if (i915_request_started(rq))
+			local_stalled = true;
 
+		__i915_request_reset(rq, local_stalled && stalled);
 out_replay:
-	guc_reset_state(ce, head, stalled);
-	__unwind_incomplete_requests(ce);
+		guc_reset_state(ce, head, local_stalled && stalled);
+next_context:
+		if (i != number_children)
+			ce = list_next_entry(ce, guc_child_link);
+	}
+
+	__unwind_incomplete_requests(parent);
 out_put:
-	intel_context_put(ce);
+	intel_context_put(parent);
 }
 
 void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
@@ -1248,7 +1265,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			__guc_reset_context(ce, stalled);
 
 		intel_context_put(ce);
@@ -1340,7 +1358,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			guc_cancel_context_requests(ce);
 
 		intel_context_put(ce);
@@ -2031,6 +2050,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	incr_context_blocked(ce);
@@ -2068,6 +2089,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	bool enable;
 
 	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
@@ -2099,11 +2121,14 @@ static void guc_context_unblock(struct intel_context *ce)
 static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
+	struct intel_context *block_context =
+		request_to_scheduling_context(rq);
+
 	if (i915_sw_fence_signaled(&rq->submit)) {
 		struct i915_sw_fence *fence;
 
 		intel_context_get(ce);
-		fence = guc_context_block(ce);
+		fence = guc_context_block(block_context);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -2117,7 +2142,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		 */
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
-		guc_context_unblock(ce);
+		guc_context_unblock(block_context);
 		intel_context_put(ce);
 	}
 }
@@ -2143,6 +2168,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	guc_flush_submissions(guc);
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 17/27] drm/i915/guc: Implement multi-lrc reset
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its' children. The parent context owns the reset replaying / canceling
requests as needed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 11 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 63 +++++++++++++------
 2 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 00d1aee6d199..5615be32879c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,20 +528,21 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 
 struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 {
+	struct intel_context *parent = intel_context_to_parent(ce);
 	struct i915_request *rq, *active = NULL;
 	unsigned long flags;
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
+	spin_lock_irqsave(&parent->guc_state.lock, flags);
+	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
 				    sched.link) {
-		if (i915_request_completed(rq))
+		if (i915_request_completed(rq) && rq->context == ce)
 			break;
 
-		active = rq;
+		active = (rq->context == ce) ? rq : active;
 	}
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f0b60fecf253..e34e0ea9136a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -670,6 +670,11 @@ static int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc(struct intel_context *ce)
+{
+	return intel_context_is_parallel(ce);
+}
+
 static bool is_multi_lrc_rq(struct i915_request *rq)
 {
 	return intel_context_is_parallel(rq->context);
@@ -1179,10 +1184,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
+	bool local_stalled;
 	struct i915_request *rq;
 	unsigned long flags;
 	u32 head;
+	int i, number_children = ce->guc_number_children;
 	bool skip = false;
+	struct intel_context *parent = ce;
 
 	intel_context_get(ce);
 
@@ -1209,25 +1217,34 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 	if (unlikely(skip))
 		goto out_put;
 
-	rq = intel_context_find_active_request(ce);
-	if (!rq) {
-		head = ce->ring->tail;
-		stalled = false;
-		goto out_replay;
-	}
+	for (i = 0; i < number_children + 1; ++i) {
+		if (!intel_context_is_pinned(ce))
+			goto next_context;
+
+		local_stalled = false;
+		rq = intel_context_find_active_request(ce);
+		if (!rq) {
+			head = ce->ring->tail;
+			goto out_replay;
+		}
 
-	if (!i915_request_started(rq))
-		stalled = false;
+		GEM_BUG_ON(i915_active_is_idle(&ce->active));
+		head = intel_ring_wrap(ce->ring, rq->head);
 
-	GEM_BUG_ON(i915_active_is_idle(&ce->active));
-	head = intel_ring_wrap(ce->ring, rq->head);
-	__i915_request_reset(rq, stalled);
+		if (i915_request_started(rq))
+			local_stalled = true;
 
+		__i915_request_reset(rq, local_stalled && stalled);
 out_replay:
-	guc_reset_state(ce, head, stalled);
-	__unwind_incomplete_requests(ce);
+		guc_reset_state(ce, head, local_stalled && stalled);
+next_context:
+		if (i != number_children)
+			ce = list_next_entry(ce, guc_child_link);
+	}
+
+	__unwind_incomplete_requests(parent);
 out_put:
-	intel_context_put(ce);
+	intel_context_put(parent);
 }
 
 void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
@@ -1248,7 +1265,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			__guc_reset_context(ce, stalled);
 
 		intel_context_put(ce);
@@ -1340,7 +1358,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 
 		xa_unlock(&guc->context_lookup);
 
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			guc_cancel_context_requests(ce);
 
 		intel_context_put(ce);
@@ -2031,6 +2050,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	incr_context_blocked(ce);
@@ -2068,6 +2089,7 @@ static void guc_context_unblock(struct intel_context *ce)
 	bool enable;
 
 	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
@@ -2099,11 +2121,14 @@ static void guc_context_unblock(struct intel_context *ce)
 static void guc_context_cancel_request(struct intel_context *ce,
 				       struct i915_request *rq)
 {
+	struct intel_context *block_context =
+		request_to_scheduling_context(rq);
+
 	if (i915_sw_fence_signaled(&rq->submit)) {
 		struct i915_sw_fence *fence;
 
 		intel_context_get(ce);
-		fence = guc_context_block(ce);
+		fence = guc_context_block(block_context);
 		i915_sw_fence_wait(fence);
 		if (!i915_request_completed(rq)) {
 			__i915_request_skip(rq);
@@ -2117,7 +2142,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
 		 */
 		flush_work(&ce_to_guc(ce)->ct.requests.worker);
 
-		guc_context_unblock(ce);
+		guc_context_unblock(block_context);
 		intel_context_put(ce);
 	}
 }
@@ -2143,6 +2168,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	guc_flush_submissions(guc);
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 18/27] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 51 ++++++++++++++-----
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e34e0ea9136a..07eee9a399c8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3673,6 +3673,26 @@ static void guc_log_context_priority(struct drm_printer *p,
 	drm_printf(p, "\n");
 }
 
+
+static inline void guc_log_context(struct drm_printer *p,
+				   struct intel_context *ce)
+{
+	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
+	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
+	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
+		   ce->ring->head,
+		   ce->lrc_reg_state[CTX_RING_HEAD]);
+	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
+		   ce->ring->tail,
+		   ce->lrc_reg_state[CTX_RING_TAIL]);
+	drm_printf(p, "\t\tContext Pin Count: %u\n",
+		   atomic_read(&ce->pin_count));
+	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
+		   atomic_read(&ce->guc_id.ref));
+	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+		   ce->guc_state.sched_state);
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
@@ -3682,22 +3702,25 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
-		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
-			   ce->ring->head,
-			   ce->lrc_reg_state[CTX_RING_HEAD]);
-		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
-			   ce->ring->tail,
-			   ce->lrc_reg_state[CTX_RING_TAIL]);
-		drm_printf(p, "\t\tContext Pin Count: %u\n",
-			   atomic_read(&ce->pin_count));
-		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id.ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
-			   ce->guc_state.sched_state);
+		GEM_BUG_ON(intel_context_is_child(ce));
 
+		guc_log_context(p, ce);
 		guc_log_context_priority(p, ce);
+
+		if (intel_context_is_parent(ce)) {
+			struct guc_process_desc *desc = __get_process_desc(ce);
+			struct intel_context *child;
+
+			drm_printf(p, "\t\tWQI Head: %u\n",
+				   READ_ONCE(desc->head));
+			drm_printf(p, "\t\tWQI Tail: %u\n",
+				   READ_ONCE(desc->tail));
+			drm_printf(p, "\t\tWQI Status: %u\n\n",
+				   READ_ONCE(desc->wq_status));
+
+			for_each_child(ce, child)
+				guc_log_context(p, child);
+		}
 	}
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 18/27] drm/i915/guc: Update debugfs for GuC multi-lrc
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 51 ++++++++++++++-----
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e34e0ea9136a..07eee9a399c8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3673,6 +3673,26 @@ static void guc_log_context_priority(struct drm_printer *p,
 	drm_printf(p, "\n");
 }
 
+
+static inline void guc_log_context(struct drm_printer *p,
+				   struct intel_context *ce)
+{
+	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
+	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
+	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
+		   ce->ring->head,
+		   ce->lrc_reg_state[CTX_RING_HEAD]);
+	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
+		   ce->ring->tail,
+		   ce->lrc_reg_state[CTX_RING_TAIL]);
+	drm_printf(p, "\t\tContext Pin Count: %u\n",
+		   atomic_read(&ce->pin_count));
+	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
+		   atomic_read(&ce->guc_id.ref));
+	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+		   ce->guc_state.sched_state);
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
@@ -3682,22 +3702,25 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
-		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
-			   ce->ring->head,
-			   ce->lrc_reg_state[CTX_RING_HEAD]);
-		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
-			   ce->ring->tail,
-			   ce->lrc_reg_state[CTX_RING_TAIL]);
-		drm_printf(p, "\t\tContext Pin Count: %u\n",
-			   atomic_read(&ce->pin_count));
-		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id.ref));
-		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
-			   ce->guc_state.sched_state);
+		GEM_BUG_ON(intel_context_is_child(ce));
 
+		guc_log_context(p, ce);
 		guc_log_context_priority(p, ce);
+
+		if (intel_context_is_parent(ce)) {
+			struct guc_process_desc *desc = __get_process_desc(ce);
+			struct intel_context *child;
+
+			drm_printf(p, "\t\tWQI Head: %u\n",
+				   READ_ONCE(desc->head));
+			drm_printf(p, "\t\tWQI Tail: %u\n",
+				   READ_ONCE(desc->tail));
+			drm_printf(p, "\t\tWQI Status: %u\n\n",
+				   READ_ONCE(desc->wq_status));
+
+			for_each_child(ce, child)
+				guc_log_context(p, child);
+		}
 	}
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Set number of engines before attempting to create contexts so the
function free_engines can clean up properly.

Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dbaeb924a437..bcaaf514876b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -944,6 +944,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	unsigned int n;
 
 	e = alloc_engines(num_engines);
+	e->num_engines = num_engines;
 	for (n = 0; n < num_engines; n++) {
 		struct intel_context *ce;
 		int ret;
@@ -977,7 +978,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			goto free_engines;
 		}
 	}
-	e->num_engines = num_engines;
 
 	return e;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Set number of engines before attempting to create contexts so the
function free_engines can clean up properly.

Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dbaeb924a437..bcaaf514876b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -944,6 +944,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	unsigned int n;
 
 	e = alloc_engines(num_engines);
+	e->num_engines = num_engines;
 	for (n = 0; n < num_engines; n++) {
 		struct intel_context *ce;
 		int ret;
@@ -977,7 +978,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			goto free_engines;
 		}
 	}
-	e->num_engines = num_engines;
 
 	return e;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: link to come

v2:
 (Daniel Vetter)
  - Add IGT link and placeholder for media UMD link

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 220 +++++++++++++++++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
 include/uapi/drm/i915_drm.h                   | 128 ++++++++++
 9 files changed, 485 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index bcaaf514876b..de0fd145fb47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -522,9 +522,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
 	return 0;
 }
 
+static int
+set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
+				      void *data)
+{
+	struct i915_context_engines_parallel_submit __user *ext =
+		container_of_user(base, typeof(*ext), base);
+	const struct set_proto_ctx_engines *set = data;
+	struct drm_i915_private *i915 = set->i915;
+	u64 flags;
+	int err = 0, n, i, j;
+	u16 slot, width, num_siblings;
+	struct intel_engine_cs **siblings = NULL;
+	intel_engine_mask_t prev_mask;
+
+	/* Disabling for now */
+	return -ENODEV;
+
+	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
+		return -ENODEV;
+
+	if (get_user(slot, &ext->engine_index))
+		return -EFAULT;
+
+	if (get_user(width, &ext->width))
+		return -EFAULT;
+
+	if (get_user(num_siblings, &ext->num_siblings))
+		return -EFAULT;
+
+	if (slot >= set->num_engines) {
+		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
+			slot, set->num_engines);
+		return -EINVAL;
+	}
+
+	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
+		drm_dbg(&i915->drm,
+			"Invalid placement[%d], already occupied\n", slot);
+		return -EINVAL;
+	}
+
+	if (get_user(flags, &ext->flags))
+		return -EFAULT;
+
+	if (flags) {
+		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
+		return -EINVAL;
+	}
+
+	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
+		err = check_user_mbz(&ext->mbz64[n]);
+		if (err)
+			return err;
+	}
+
+	if (width < 2) {
+		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
+		return -EINVAL;
+	}
+
+	if (num_siblings < 1) {
+		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
+	siblings = kmalloc_array(num_siblings * width,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return -ENOMEM;
+
+	/* Create contexts / engines */
+	for (i = 0; i < width; ++i) {
+		intel_engine_mask_t current_mask = 0;
+		struct i915_engine_class_instance prev_engine;
+
+		for (j = 0; j < num_siblings; ++j) {
+			struct i915_engine_class_instance ci;
+
+			n = i * num_siblings + j;
+			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
+				err = -EFAULT;
+				goto out_err;
+			}
+
+			siblings[n] =
+				intel_engine_lookup_user(i915, ci.engine_class,
+							 ci.engine_instance);
+			if (!siblings[n]) {
+				drm_dbg(&i915->drm,
+					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
+					n, ci.engine_class, ci.engine_instance);
+				err = -EINVAL;
+				goto out_err;
+			}
+
+			if (n) {
+				if (prev_engine.engine_class !=
+				    ci.engine_class) {
+					drm_dbg(&i915->drm,
+						"Mismatched class %d, %d\n",
+						prev_engine.engine_class,
+						ci.engine_class);
+					err = -EINVAL;
+					goto out_err;
+				}
+			}
+
+			prev_engine = ci;
+			current_mask |= siblings[n]->logical_mask;
+		}
+
+		if (i > 0) {
+			if (current_mask != prev_mask << 1) {
+				drm_dbg(&i915->drm,
+					"Non contiguous logical mask 0x%x, 0x%x\n",
+					prev_mask, current_mask);
+				err = -EINVAL;
+				goto out_err;
+			}
+		}
+		prev_mask = current_mask;
+	}
+
+	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
+	set->engines[slot].num_siblings = num_siblings;
+	set->engines[slot].width = width;
+	set->engines[slot].siblings = siblings;
+
+	return 0;
+
+out_err:
+	kfree(siblings);
+
+	return err;
+}
+
 static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
 	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
 	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
+	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
+		set_proto_ctx_engines_parallel_submit,
 };
 
 static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
@@ -821,6 +961,25 @@ static int intel_context_set_gem(struct intel_context *ce,
 	return ret;
 }
 
+static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
+{
+	while (count--) {
+		struct intel_context *ce = e->engines[count], *child;
+
+		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
+			continue;
+
+		for_each_child(ce, child)
+			intel_context_unpin(child);
+		intel_context_unpin(ce);
+	}
+}
+
+static void unpin_engines(struct i915_gem_engines *e)
+{
+	__unpin_engines(e, e->num_engines);
+}
+
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
 {
 	while (count--) {
@@ -936,6 +1095,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+static int perma_pin_contexts(struct intel_context *ce)
+{
+	struct intel_context *child;
+	int i = 0, j = 0, ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	ret = intel_context_pin(ce);
+	if (unlikely(ret))
+		return ret;
+
+	for_each_child(ce, child) {
+		ret = intel_context_pin(child);
+		if (unlikely(ret))
+			goto unwind;
+		++i;
+	}
+
+	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
+
+	return 0;
+
+unwind:
+	intel_context_unpin(ce);
+	for_each_child(ce, child) {
+		if (j++ < i)
+			intel_context_unpin(child);
+		else
+			break;
+	}
+
+	return ret;
+}
+
 static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 					     unsigned int num_engines,
 					     struct i915_gem_proto_engine *pe)
@@ -946,7 +1139,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	e = alloc_engines(num_engines);
 	e->num_engines = num_engines;
 	for (n = 0; n < num_engines; n++) {
-		struct intel_context *ce;
+		struct intel_context *ce, *child;
 		int ret;
 
 		switch (pe[n].type) {
@@ -956,7 +1149,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 		case I915_GEM_ENGINE_TYPE_BALANCED:
 			ce = intel_engine_create_virtual(pe[n].siblings,
-							 pe[n].num_siblings);
+							 pe[n].num_siblings, 0);
+			break;
+
+		case I915_GEM_ENGINE_TYPE_PARALLEL:
+			ce = intel_engine_create_parallel(pe[n].siblings,
+							  pe[n].num_siblings,
+							  pe[n].width);
 			break;
 
 		case I915_GEM_ENGINE_TYPE_INVALID:
@@ -977,6 +1176,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			err = ERR_PTR(ret);
 			goto free_engines;
 		}
+		for_each_child(ce, child) {
+			ret = intel_context_set_gem(child, ctx, pe->sseu);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
+
+		/* XXX: Must be done after setting gem context */
+		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
+			ret = perma_pin_contexts(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
 	}
 
 	return e;
@@ -1200,6 +1415,7 @@ static void context_close(struct i915_gem_context *ctx)
 
 	/* Flush any concurrent set_engines() */
 	mutex_lock(&ctx->engines_mutex);
+	unpin_engines(ctx->engines);
 	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
 	i915_gem_context_set_closed(ctx);
 	mutex_unlock(&ctx->engines_mutex);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 94c03a97cb77..7b096d83bca1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -78,6 +78,9 @@ enum i915_gem_engine_type {
 
 	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
 	I915_GEM_ENGINE_TYPE_BALANCED,
+
+	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
+	I915_GEM_ENGINE_TYPE_PARALLEL,
 };
 
 /**
@@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
 	/** @num_siblings: Number of balanced siblings */
 	unsigned int num_siblings;
 
+	/** @width: Width of each sibling */
+	unsigned int width;
+
 	/** @siblings: Balanced siblings */
 	struct intel_engine_cs **siblings;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index a63329520c35..713d85b0b364 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -55,9 +55,13 @@ struct intel_context_ops {
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
-	/* virtual engine/context interface */
+	/* virtual/parallel engine/context interface */
 	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
-						unsigned int count);
+						unsigned int count,
+						unsigned long flags);
+	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
+						 unsigned int num_siblings,
+						 unsigned int width);
 	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
 					       unsigned int sibling);
 };
@@ -113,6 +117,7 @@ struct intel_context {
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
 #define CONTEXT_GUC_INIT		10
+#define CONTEXT_PERMA_PIN		11
 
 	struct {
 		u64 timeout_us;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 87579affb952..43f16a8347ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+#define FORCE_VIRTUAL	BIT(0)
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count);
+			    unsigned int count, unsigned long flags);
+
+static inline struct intel_context *
+intel_engine_create_parallel(struct intel_engine_cs **engines,
+			     unsigned int num_engines,
+			     unsigned int width)
+{
+	GEM_BUG_ON(!engines[0]->cops->create_parallel);
+	return engines[0]->cops->create_parallel(engines, num_engines, width);
+}
 
 static inline bool
 intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 4d790f9a65dd..f66c75c77584 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count)
+			    unsigned int count, unsigned long flags)
 {
 	if (count == 0)
 		return ERR_PTR(-EINVAL);
 
-	if (count == 1)
+	if (count == 1 && !(flags & FORCE_VIRTUAL))
 		return intel_context_create(siblings[0]);
 
 	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
-	return siblings[0]->cops->create_virtual(siblings, count);
+	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
 struct i915_request *
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 813a6de01382..d1e2d6f8ff81 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags);
 
 static struct i915_request *
 __active_request(const struct intel_timeline * const tl,
@@ -3782,7 +3783,8 @@ static void virtual_submit_request(struct i915_request *rq)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags)
 {
 	struct virtual_engine *ve;
 	unsigned int n;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index f12ffe797639..e876a9d88a5c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
 
 	for (n = 0; n < nctx; n++) {
-		ve[n] = intel_engine_create_virtual(siblings, nsibling);
+		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ve[n])) {
 			err = PTR_ERR(ve[n]);
 			nctx = n;
@@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
 	 * restrict it to our desired engine within the virtual engine.
 	 */
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_close;
@@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
 		i915_request_add(rq);
 	}
 
-	ce = intel_engine_create_virtual(siblings, nsibling);
+	ce = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ce)) {
 		err = PTR_ERR(ce);
 		goto out;
@@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
 
 	/* XXX We do not handle oversubscription and fairness with normal rq */
 	for (n = 0; n < nsibling; n++) {
-		ce = intel_engine_create_virtual(siblings, nsibling);
+		ce = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ce)) {
 			err = PTR_ERR(ce);
 			goto out;
@@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
 	if (err)
 		goto out_scratch;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_scratch;
@@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	if (igt_spinner_init(&spin, gt))
 		return -ENOMEM;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_spin;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 07eee9a399c8..2554d0eb4afd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -121,7 +121,13 @@ struct guc_virtual_engine {
 };
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags);
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
@@ -2581,6 +2587,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.destroy = guc_context_destroy,
 
 	.create_virtual = guc_create_virtual,
+	.create_parallel = guc_create_parallel,
 };
 
 static void submit_work_cb(struct irq_work *wrk)
@@ -2827,8 +2834,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2845,8 +2850,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2858,8 +2861,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -2875,8 +2876,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(context_enabled(ce));
@@ -2887,8 +2886,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_post_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(!intel_context_is_child(ce));
@@ -2899,6 +2896,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
 	intel_context_unpin(ce->parent);
 }
 
+static void guc_child_context_destroy(struct kref *kref)
+{
+	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
+
+	__guc_context_destroy(ce);
+}
+
+static const struct intel_context_ops virtual_parent_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_parent_context_pin,
+	.unpin = guc_parent_context_unpin,
+	.post_unpin = guc_context_post_unpin,
+
+	.ban = guc_context_ban,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.sched_disable = guc_context_sched_disable,
+
+	.destroy = guc_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static const struct intel_context_ops virtual_child_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_child_context_pin,
+	.unpin = guc_child_context_unpin,
+	.post_unpin = guc_child_context_post_unpin,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.destroy = guc_child_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_engine_create_virtual(siblings, num_siblings,
+						 FORCE_VIRTUAL);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+			parent->ops = &virtual_parent_context_ops;
+		} else {
+			ce->ops = &virtual_child_context_ops;
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
@@ -3726,7 +3815,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 }
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index b1248a67b4f8..b153f8215403 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -2049,6 +2050,132 @@ struct i915_context_engines_bond {
 	struct i915_engine_class_instance engines[N__]; \
 } __attribute__((packed)) name__
 
+/**
+ * struct i915_context_engines_parallel_submit - Configure engine for
+ * parallel submission.
+ *
+ * Setup a slot in the context engine map to allow multiple BBs to be submitted
+ * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
+ * in parallel. Multiple hardware contexts are created internally in the i915
+ * run these BBs. Once a slot is configured for N BBs only N BBs can be
+ * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
+ * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
+ * many BBs there are based on the slot's configuration. The N BBs are the last
+ * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
+ *
+ * The default placement behavior is to create implicit bonds between each
+ * context if each context maps to more than 1 physical engine (e.g. context is
+ * a virtual engine). Also we only allow contexts of same engine class and these
+ * contexts must be in logically contiguous order. Examples of the placement
+ * behavior described below. Lastly, the default is to not allow BBs to
+ * preempted mid BB rather insert coordinated preemption on all hardware
+ * contexts between each set of BBs. Flags may be added in the future to change
+ * both of these default behaviors.
+ *
+ * Returns -EINVAL if hardware context placement configuration is invalid or if
+ * the placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * interface.
+ *
+ * .. code-block:: none
+ *
+ *	Example 1 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=1,
+ *		     engines=CS[0],CS[1])
+ *
+ *	Results in the following valid placement:
+ *	CS[0], CS[1]
+ *
+ *	Example 2 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[2],CS[1],CS[3])
+ *
+ *	Results in the following valid placements:
+ *	CS[0], CS[1]
+ *	CS[2], CS[3]
+ *
+ *	This can also be thought of as 2 virtual engines described by 2-D array
+ *	in the engines the field with bonds placed between each index of the
+ *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
+ *	CS[3].
+ *	VE[0] = CS[0], CS[2]
+ *	VE[1] = CS[1], CS[3]
+ *
+ *	Example 3 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[1],CS[1],CS[3])
+ *
+ *	Results in the following valid and invalid placements:
+ *	CS[0], CS[1]
+ *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
+ */
+struct i915_context_engines_parallel_submit {
+	/**
+	 * @base: base user extension.
+	 */
+	struct i915_user_extension base;
+
+	/**
+	 * @engine_index: slot for parallel engine
+	 */
+	__u16 engine_index;
+
+	/**
+	 * @width: number of contexts per parallel engine
+	 */
+	__u16 width;
+
+	/**
+	 * @num_siblings: number of siblings per context
+	 */
+	__u16 num_siblings;
+
+	/**
+	 * @mbz16: reserved for future use; must be zero
+	 */
+	__u16 mbz16;
+
+	/**
+	 * @flags: all undefined flags must be zero, currently not defined flags
+	 */
+	__u64 flags;
+
+	/**
+	 * @mbz64: reserved for future use; must be zero
+	 */
+	__u64 mbz64[3];
+
+	/**
+	 * @engines: 2-d array of engine instances to configure parallel engine
+	 *
+	 * length = width (i) * num_siblings (j)
+	 * index = j + i * num_siblings
+	 */
+	struct i915_engine_class_instance engines[0];
+
+} __packed;
+
+#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
+	struct i915_user_extension base; \
+	__u16 engine_index; \
+	__u16 width; \
+	__u16 num_siblings; \
+	__u16 mbz16; \
+	__u64 flags; \
+	__u64 mbz64[3]; \
+	struct i915_engine_class_instance engines[N__]; \
+} __attribute__((packed)) name__
+
 /**
  * DOC: Context Engine Map uAPI
  *
@@ -2108,6 +2235,7 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: link to come

v2:
 (Daniel Vetter)
  - Add IGT link and placeholder for media UMD link

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 220 +++++++++++++++++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
 include/uapi/drm/i915_drm.h                   | 128 ++++++++++
 9 files changed, 485 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index bcaaf514876b..de0fd145fb47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -522,9 +522,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
 	return 0;
 }
 
+static int
+set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
+				      void *data)
+{
+	struct i915_context_engines_parallel_submit __user *ext =
+		container_of_user(base, typeof(*ext), base);
+	const struct set_proto_ctx_engines *set = data;
+	struct drm_i915_private *i915 = set->i915;
+	u64 flags;
+	int err = 0, n, i, j;
+	u16 slot, width, num_siblings;
+	struct intel_engine_cs **siblings = NULL;
+	intel_engine_mask_t prev_mask;
+
+	/* Disabling for now */
+	return -ENODEV;
+
+	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
+		return -ENODEV;
+
+	if (get_user(slot, &ext->engine_index))
+		return -EFAULT;
+
+	if (get_user(width, &ext->width))
+		return -EFAULT;
+
+	if (get_user(num_siblings, &ext->num_siblings))
+		return -EFAULT;
+
+	if (slot >= set->num_engines) {
+		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
+			slot, set->num_engines);
+		return -EINVAL;
+	}
+
+	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
+		drm_dbg(&i915->drm,
+			"Invalid placement[%d], already occupied\n", slot);
+		return -EINVAL;
+	}
+
+	if (get_user(flags, &ext->flags))
+		return -EFAULT;
+
+	if (flags) {
+		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
+		return -EINVAL;
+	}
+
+	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
+		err = check_user_mbz(&ext->mbz64[n]);
+		if (err)
+			return err;
+	}
+
+	if (width < 2) {
+		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
+		return -EINVAL;
+	}
+
+	if (num_siblings < 1) {
+		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
+	siblings = kmalloc_array(num_siblings * width,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return -ENOMEM;
+
+	/* Create contexts / engines */
+	for (i = 0; i < width; ++i) {
+		intel_engine_mask_t current_mask = 0;
+		struct i915_engine_class_instance prev_engine;
+
+		for (j = 0; j < num_siblings; ++j) {
+			struct i915_engine_class_instance ci;
+
+			n = i * num_siblings + j;
+			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
+				err = -EFAULT;
+				goto out_err;
+			}
+
+			siblings[n] =
+				intel_engine_lookup_user(i915, ci.engine_class,
+							 ci.engine_instance);
+			if (!siblings[n]) {
+				drm_dbg(&i915->drm,
+					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
+					n, ci.engine_class, ci.engine_instance);
+				err = -EINVAL;
+				goto out_err;
+			}
+
+			if (n) {
+				if (prev_engine.engine_class !=
+				    ci.engine_class) {
+					drm_dbg(&i915->drm,
+						"Mismatched class %d, %d\n",
+						prev_engine.engine_class,
+						ci.engine_class);
+					err = -EINVAL;
+					goto out_err;
+				}
+			}
+
+			prev_engine = ci;
+			current_mask |= siblings[n]->logical_mask;
+		}
+
+		if (i > 0) {
+			if (current_mask != prev_mask << 1) {
+				drm_dbg(&i915->drm,
+					"Non contiguous logical mask 0x%x, 0x%x\n",
+					prev_mask, current_mask);
+				err = -EINVAL;
+				goto out_err;
+			}
+		}
+		prev_mask = current_mask;
+	}
+
+	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
+	set->engines[slot].num_siblings = num_siblings;
+	set->engines[slot].width = width;
+	set->engines[slot].siblings = siblings;
+
+	return 0;
+
+out_err:
+	kfree(siblings);
+
+	return err;
+}
+
 static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
 	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
 	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
+	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
+		set_proto_ctx_engines_parallel_submit,
 };
 
 static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
@@ -821,6 +961,25 @@ static int intel_context_set_gem(struct intel_context *ce,
 	return ret;
 }
 
+static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
+{
+	while (count--) {
+		struct intel_context *ce = e->engines[count], *child;
+
+		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
+			continue;
+
+		for_each_child(ce, child)
+			intel_context_unpin(child);
+		intel_context_unpin(ce);
+	}
+}
+
+static void unpin_engines(struct i915_gem_engines *e)
+{
+	__unpin_engines(e, e->num_engines);
+}
+
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
 {
 	while (count--) {
@@ -936,6 +1095,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+static int perma_pin_contexts(struct intel_context *ce)
+{
+	struct intel_context *child;
+	int i = 0, j = 0, ret;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	ret = intel_context_pin(ce);
+	if (unlikely(ret))
+		return ret;
+
+	for_each_child(ce, child) {
+		ret = intel_context_pin(child);
+		if (unlikely(ret))
+			goto unwind;
+		++i;
+	}
+
+	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
+
+	return 0;
+
+unwind:
+	intel_context_unpin(ce);
+	for_each_child(ce, child) {
+		if (j++ < i)
+			intel_context_unpin(child);
+		else
+			break;
+	}
+
+	return ret;
+}
+
 static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 					     unsigned int num_engines,
 					     struct i915_gem_proto_engine *pe)
@@ -946,7 +1139,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 	e = alloc_engines(num_engines);
 	e->num_engines = num_engines;
 	for (n = 0; n < num_engines; n++) {
-		struct intel_context *ce;
+		struct intel_context *ce, *child;
 		int ret;
 
 		switch (pe[n].type) {
@@ -956,7 +1149,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 		case I915_GEM_ENGINE_TYPE_BALANCED:
 			ce = intel_engine_create_virtual(pe[n].siblings,
-							 pe[n].num_siblings);
+							 pe[n].num_siblings, 0);
+			break;
+
+		case I915_GEM_ENGINE_TYPE_PARALLEL:
+			ce = intel_engine_create_parallel(pe[n].siblings,
+							  pe[n].num_siblings,
+							  pe[n].width);
 			break;
 
 		case I915_GEM_ENGINE_TYPE_INVALID:
@@ -977,6 +1176,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			err = ERR_PTR(ret);
 			goto free_engines;
 		}
+		for_each_child(ce, child) {
+			ret = intel_context_set_gem(child, ctx, pe->sseu);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
+
+		/* XXX: Must be done after setting gem context */
+		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
+			ret = perma_pin_contexts(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
 	}
 
 	return e;
@@ -1200,6 +1415,7 @@ static void context_close(struct i915_gem_context *ctx)
 
 	/* Flush any concurrent set_engines() */
 	mutex_lock(&ctx->engines_mutex);
+	unpin_engines(ctx->engines);
 	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
 	i915_gem_context_set_closed(ctx);
 	mutex_unlock(&ctx->engines_mutex);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 94c03a97cb77..7b096d83bca1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -78,6 +78,9 @@ enum i915_gem_engine_type {
 
 	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
 	I915_GEM_ENGINE_TYPE_BALANCED,
+
+	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
+	I915_GEM_ENGINE_TYPE_PARALLEL,
 };
 
 /**
@@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
 	/** @num_siblings: Number of balanced siblings */
 	unsigned int num_siblings;
 
+	/** @width: Width of each sibling */
+	unsigned int width;
+
 	/** @siblings: Balanced siblings */
 	struct intel_engine_cs **siblings;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index a63329520c35..713d85b0b364 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -55,9 +55,13 @@ struct intel_context_ops {
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
-	/* virtual engine/context interface */
+	/* virtual/parallel engine/context interface */
 	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
-						unsigned int count);
+						unsigned int count,
+						unsigned long flags);
+	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
+						 unsigned int num_siblings,
+						 unsigned int width);
 	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
 					       unsigned int sibling);
 };
@@ -113,6 +117,7 @@ struct intel_context {
 #define CONTEXT_NOPREEMPT		8
 #define CONTEXT_LRCA_DIRTY		9
 #define CONTEXT_GUC_INIT		10
+#define CONTEXT_PERMA_PIN		11
 
 	struct {
 		u64 timeout_us;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 87579affb952..43f16a8347ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+#define FORCE_VIRTUAL	BIT(0)
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count);
+			    unsigned int count, unsigned long flags);
+
+static inline struct intel_context *
+intel_engine_create_parallel(struct intel_engine_cs **engines,
+			     unsigned int num_engines,
+			     unsigned int width)
+{
+	GEM_BUG_ON(!engines[0]->cops->create_parallel);
+	return engines[0]->cops->create_parallel(engines, num_engines, width);
+}
 
 static inline bool
 intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 4d790f9a65dd..f66c75c77584 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count)
+			    unsigned int count, unsigned long flags)
 {
 	if (count == 0)
 		return ERR_PTR(-EINVAL);
 
-	if (count == 1)
+	if (count == 1 && !(flags & FORCE_VIRTUAL))
 		return intel_context_create(siblings[0]);
 
 	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
-	return siblings[0]->cops->create_virtual(siblings, count);
+	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
 struct i915_request *
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 813a6de01382..d1e2d6f8ff81 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags);
 
 static struct i915_request *
 __active_request(const struct intel_timeline * const tl,
@@ -3782,7 +3783,8 @@ static void virtual_submit_request(struct i915_request *rq)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags)
 {
 	struct virtual_engine *ve;
 	unsigned int n;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index f12ffe797639..e876a9d88a5c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
 
 	for (n = 0; n < nctx; n++) {
-		ve[n] = intel_engine_create_virtual(siblings, nsibling);
+		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ve[n])) {
 			err = PTR_ERR(ve[n]);
 			nctx = n;
@@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
 	 * restrict it to our desired engine within the virtual engine.
 	 */
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_close;
@@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
 		i915_request_add(rq);
 	}
 
-	ce = intel_engine_create_virtual(siblings, nsibling);
+	ce = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ce)) {
 		err = PTR_ERR(ce);
 		goto out;
@@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
 
 	/* XXX We do not handle oversubscription and fairness with normal rq */
 	for (n = 0; n < nsibling; n++) {
-		ce = intel_engine_create_virtual(siblings, nsibling);
+		ce = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ce)) {
 			err = PTR_ERR(ce);
 			goto out;
@@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
 	if (err)
 		goto out_scratch;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_scratch;
@@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	if (igt_spinner_init(&spin, gt))
 		return -ENOMEM;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_spin;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 07eee9a399c8..2554d0eb4afd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -121,7 +121,13 @@ struct guc_virtual_engine {
 };
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags);
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
@@ -2581,6 +2587,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.destroy = guc_context_destroy,
 
 	.create_virtual = guc_create_virtual,
+	.create_parallel = guc_create_parallel,
 };
 
 static void submit_work_cb(struct irq_work *wrk)
@@ -2827,8 +2834,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2845,8 +2850,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
@@ -2858,8 +2861,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
 	return __guc_context_pin(ce, engine, vaddr);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -2875,8 +2876,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(context_enabled(ce));
@@ -2887,8 +2886,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
 	lrc_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_post_unpin(struct intel_context *ce)
 {
 	GEM_BUG_ON(!intel_context_is_child(ce));
@@ -2899,6 +2896,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
 	intel_context_unpin(ce->parent);
 }
 
+static void guc_child_context_destroy(struct kref *kref)
+{
+	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
+
+	__guc_context_destroy(ce);
+}
+
+static const struct intel_context_ops virtual_parent_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_parent_context_pin,
+	.unpin = guc_parent_context_unpin,
+	.post_unpin = guc_context_post_unpin,
+
+	.ban = guc_context_ban,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.sched_disable = guc_context_sched_disable,
+
+	.destroy = guc_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static const struct intel_context_ops virtual_child_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_context_pre_pin,
+	.pin = guc_child_context_pin,
+	.unpin = guc_child_context_unpin,
+	.post_unpin = guc_child_context_post_unpin,
+
+	.cancel_request = guc_context_cancel_request,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.destroy = guc_child_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_engine_create_virtual(siblings, num_siblings,
+						 FORCE_VIRTUAL);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+			parent->ops = &virtual_parent_context_ops;
+		} else {
+			ce->ops = &virtual_child_context_ops;
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static bool
 guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
 {
@@ -3726,7 +3815,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 }
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index b1248a67b4f8..b153f8215403 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -2049,6 +2050,132 @@ struct i915_context_engines_bond {
 	struct i915_engine_class_instance engines[N__]; \
 } __attribute__((packed)) name__
 
+/**
+ * struct i915_context_engines_parallel_submit - Configure engine for
+ * parallel submission.
+ *
+ * Setup a slot in the context engine map to allow multiple BBs to be submitted
+ * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
+ * in parallel. Multiple hardware contexts are created internally in the i915
+ * run these BBs. Once a slot is configured for N BBs only N BBs can be
+ * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
+ * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
+ * many BBs there are based on the slot's configuration. The N BBs are the last
+ * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
+ *
+ * The default placement behavior is to create implicit bonds between each
+ * context if each context maps to more than 1 physical engine (e.g. context is
+ * a virtual engine). Also we only allow contexts of same engine class and these
+ * contexts must be in logically contiguous order. Examples of the placement
+ * behavior described below. Lastly, the default is to not allow BBs to
+ * preempted mid BB rather insert coordinated preemption on all hardware
+ * contexts between each set of BBs. Flags may be added in the future to change
+ * both of these default behaviors.
+ *
+ * Returns -EINVAL if hardware context placement configuration is invalid or if
+ * the placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * interface.
+ *
+ * .. code-block:: none
+ *
+ *	Example 1 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=1,
+ *		     engines=CS[0],CS[1])
+ *
+ *	Results in the following valid placement:
+ *	CS[0], CS[1]
+ *
+ *	Example 2 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[2],CS[1],CS[3])
+ *
+ *	Results in the following valid placements:
+ *	CS[0], CS[1]
+ *	CS[2], CS[3]
+ *
+ *	This can also be thought of as 2 virtual engines described by 2-D array
+ *	in the engines the field with bonds placed between each index of the
+ *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
+ *	CS[3].
+ *	VE[0] = CS[0], CS[2]
+ *	VE[1] = CS[1], CS[3]
+ *
+ *	Example 3 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[1],CS[1],CS[3])
+ *
+ *	Results in the following valid and invalid placements:
+ *	CS[0], CS[1]
+ *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
+ */
+struct i915_context_engines_parallel_submit {
+	/**
+	 * @base: base user extension.
+	 */
+	struct i915_user_extension base;
+
+	/**
+	 * @engine_index: slot for parallel engine
+	 */
+	__u16 engine_index;
+
+	/**
+	 * @width: number of contexts per parallel engine
+	 */
+	__u16 width;
+
+	/**
+	 * @num_siblings: number of siblings per context
+	 */
+	__u16 num_siblings;
+
+	/**
+	 * @mbz16: reserved for future use; must be zero
+	 */
+	__u16 mbz16;
+
+	/**
+	 * @flags: all undefined flags must be zero, currently not defined flags
+	 */
+	__u64 flags;
+
+	/**
+	 * @mbz64: reserved for future use; must be zero
+	 */
+	__u64 mbz64[3];
+
+	/**
+	 * @engines: 2-d array of engine instances to configure parallel engine
+	 *
+	 * length = width (i) * num_siblings (j)
+	 * index = j + i * num_siblings
+	 */
+	struct i915_engine_class_instance engines[0];
+
+} __packed;
+
+#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
+	struct i915_user_extension base; \
+	__u16 engine_index; \
+	__u16 width; \
+	__u16 num_siblings; \
+	__u16 mbz16; \
+	__u64 flags; \
+	__u64 mbz64[3]; \
+	struct i915_engine_class_instance engines[N__]; \
+} __attribute__((packed)) name__
+
 /**
  * DOC: Context Engine Map uAPI
  *
@@ -2108,6 +2235,7 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 21/27] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Update parallel submit doc to point to i915_drm.h

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
 Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
 2 files changed, 2 insertions(+), 124 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
deleted file mode 100644
index 8cbe2c4e0172..000000000000
--- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
+++ /dev/null
@@ -1,122 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2021 Intel Corporation
- */
-
-#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
-
-/**
- * struct drm_i915_context_engines_parallel_submit - Configure engine for
- * parallel submission.
- *
- * Setup a slot in the context engine map to allow multiple BBs to be submitted
- * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
- * in parallel. Multiple hardware contexts are created internally in the i915
- * run these BBs. Once a slot is configured for N BBs only N BBs can be
- * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
- * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
- * many BBs there are based on the slot's configuration. The N BBs are the last
- * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
- *
- * The default placement behavior is to create implicit bonds between each
- * context if each context maps to more than 1 physical engine (e.g. context is
- * a virtual engine). Also we only allow contexts of same engine class and these
- * contexts must be in logically contiguous order. Examples of the placement
- * behavior described below. Lastly, the default is to not allow BBs to
- * preempted mid BB rather insert coordinated preemption on all hardware
- * contexts between each set of BBs. Flags may be added in the future to change
- * both of these default behaviors.
- *
- * Returns -EINVAL if hardware context placement configuration is invalid or if
- * the placement configuration isn't supported on the platform / submission
- * interface.
- * Returns -ENODEV if extension isn't supported on the platform / submission
- * interface.
- *
- * .. code-block:: none
- *
- *	Example 1 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=1,
- *		     engines=CS[0],CS[1])
- *
- *	Results in the following valid placement:
- *	CS[0], CS[1]
- *
- *	Example 2 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[2],CS[1],CS[3])
- *
- *	Results in the following valid placements:
- *	CS[0], CS[1]
- *	CS[2], CS[3]
- *
- *	This can also be thought of as 2 virtual engines described by 2-D array
- *	in the engines the field with bonds placed between each index of the
- *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
- *	CS[3].
- *	VE[0] = CS[0], CS[2]
- *	VE[1] = CS[1], CS[3]
- *
- *	Example 3 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[1],CS[1],CS[3])
- *
- *	Results in the following valid and invalid placements:
- *	CS[0], CS[1]
- *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
- */
-struct drm_i915_context_engines_parallel_submit {
-	/**
-	 * @base: base user extension.
-	 */
-	struct i915_user_extension base;
-
-	/**
-	 * @engine_index: slot for parallel engine
-	 */
-	__u16 engine_index;
-
-	/**
-	 * @width: number of contexts per parallel engine
-	 */
-	__u16 width;
-
-	/**
-	 * @num_siblings: number of siblings per context
-	 */
-	__u16 num_siblings;
-
-	/**
-	 * @mbz16: reserved for future use; must be zero
-	 */
-	__u16 mbz16;
-
-	/**
-	 * @flags: all undefined flags must be zero, currently not defined flags
-	 */
-	__u64 flags;
-
-	/**
-	 * @mbz64: reserved for future use; must be zero
-	 */
-	__u64 mbz64[3];
-
-	/**
-	 * @engines: 2-d array of engine instances to configure parallel engine
-	 *
-	 * length = width (i) * num_siblings (j)
-	 * index = j + i * num_siblings
-	 */
-	struct i915_engine_class_instance engines[0];
-
-} __packed;
-
diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index cbda75065dad..d630f15ab795 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
 drm_i915_context_engines_parallel_submit to the uAPI to implement this
 extension.
 
-.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
-        :functions: drm_i915_context_engines_parallel_submit
+.. kernel-doc:: include/uapi/drm/i915_drm.h
+        :functions: i915_context_engines_parallel_submit
 
 Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
 -------------------------------------------------------------------
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 21/27] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Update parallel submit doc to point to i915_drm.h

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
 Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
 2 files changed, 2 insertions(+), 124 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
deleted file mode 100644
index 8cbe2c4e0172..000000000000
--- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
+++ /dev/null
@@ -1,122 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2021 Intel Corporation
- */
-
-#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
-
-/**
- * struct drm_i915_context_engines_parallel_submit - Configure engine for
- * parallel submission.
- *
- * Setup a slot in the context engine map to allow multiple BBs to be submitted
- * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
- * in parallel. Multiple hardware contexts are created internally in the i915
- * run these BBs. Once a slot is configured for N BBs only N BBs can be
- * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
- * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
- * many BBs there are based on the slot's configuration. The N BBs are the last
- * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
- *
- * The default placement behavior is to create implicit bonds between each
- * context if each context maps to more than 1 physical engine (e.g. context is
- * a virtual engine). Also we only allow contexts of same engine class and these
- * contexts must be in logically contiguous order. Examples of the placement
- * behavior described below. Lastly, the default is to not allow BBs to
- * preempted mid BB rather insert coordinated preemption on all hardware
- * contexts between each set of BBs. Flags may be added in the future to change
- * both of these default behaviors.
- *
- * Returns -EINVAL if hardware context placement configuration is invalid or if
- * the placement configuration isn't supported on the platform / submission
- * interface.
- * Returns -ENODEV if extension isn't supported on the platform / submission
- * interface.
- *
- * .. code-block:: none
- *
- *	Example 1 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=1,
- *		     engines=CS[0],CS[1])
- *
- *	Results in the following valid placement:
- *	CS[0], CS[1]
- *
- *	Example 2 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[2],CS[1],CS[3])
- *
- *	Results in the following valid placements:
- *	CS[0], CS[1]
- *	CS[2], CS[3]
- *
- *	This can also be thought of as 2 virtual engines described by 2-D array
- *	in the engines the field with bonds placed between each index of the
- *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
- *	CS[3].
- *	VE[0] = CS[0], CS[2]
- *	VE[1] = CS[1], CS[3]
- *
- *	Example 3 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[1],CS[1],CS[3])
- *
- *	Results in the following valid and invalid placements:
- *	CS[0], CS[1]
- *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
- */
-struct drm_i915_context_engines_parallel_submit {
-	/**
-	 * @base: base user extension.
-	 */
-	struct i915_user_extension base;
-
-	/**
-	 * @engine_index: slot for parallel engine
-	 */
-	__u16 engine_index;
-
-	/**
-	 * @width: number of contexts per parallel engine
-	 */
-	__u16 width;
-
-	/**
-	 * @num_siblings: number of siblings per context
-	 */
-	__u16 num_siblings;
-
-	/**
-	 * @mbz16: reserved for future use; must be zero
-	 */
-	__u16 mbz16;
-
-	/**
-	 * @flags: all undefined flags must be zero, currently not defined flags
-	 */
-	__u64 flags;
-
-	/**
-	 * @mbz64: reserved for future use; must be zero
-	 */
-	__u64 mbz64[3];
-
-	/**
-	 * @engines: 2-d array of engine instances to configure parallel engine
-	 *
-	 * length = width (i) * num_siblings (j)
-	 * index = j + i * num_siblings
-	 */
-	struct i915_engine_class_instance engines[0];
-
-} __packed;
-
diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index cbda75065dad..d630f15ab795 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
 drm_i915_context_engines_parallel_submit to the uAPI to implement this
 extension.
 
-.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
-        :functions: drm_i915_context_engines_parallel_submit
+.. kernel-doc:: include/uapi/drm/i915_drm.h
+        :functions: i915_context_engines_parallel_submit
 
 Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
 -------------------------------------------------------------------
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 22/27] drm/i915/guc: Add basic GuC multi-lrc selftest
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 180 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 182 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2554d0eb4afd..91330525330d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3924,4 +3924,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
+#include "selftest_guc_multi_lrc.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
new file mode 100644
index 000000000000..dacfc5dfadd6
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static void logical_sort(struct intel_engine_cs **engines, int num_engines)
+{
+	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
+	int i, j;
+
+	for (i = 0; i < num_engines; ++i)
+		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
+			if (engines[j]->logical_mask & BIT(i)) {
+				sorted[i] = engines[j];
+				break;
+			}
+		}
+
+	memcpy(*engines, *sorted,
+	       sizeof(struct intel_engine_cs *) * num_engines);
+}
+
+static struct intel_context *
+multi_lrc_create_parent(struct intel_gt *gt, u8 class,
+			unsigned long flags)
+{
+	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i = 0;
+
+	for_each_engine(engine, gt, id) {
+		if (engine->class != class)
+			continue;
+
+		siblings[i++] = engine;
+	}
+
+	if (i <= 1)
+		return ERR_PTR(0);
+
+	logical_sort(siblings, i);
+
+	return intel_engine_create_parallel(siblings, 1, i);
+}
+
+static void multi_lrc_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+}
+
+static void multi_lrc_context_put(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/*
+	 * Only the parent gets the creation ref put in the uAPI, the parent
+	 * itself is responsible for creation ref put on the children.
+	 */
+	intel_context_put(ce);
+}
+
+static struct i915_request *
+multi_lrc_nop_request(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct i915_request *rq, *child_rq;
+	int i = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	for_each_child(ce, child) {
+		child_rq = intel_context_create_request(child);
+		if (IS_ERR(child_rq))
+			goto child_error;
+
+		if (++i == ce->guc_number_children)
+			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+				&child_rq->fence.flags);
+		i915_request_add(child_rq);
+	}
+
+	return rq;
+
+child_error:
+	i915_request_put(rq);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
+{
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, class, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d",
+			 VIDEO_DECODE_CLASS);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on request: %d", ret);
+
+	i915_request_put(rq);
+
+	if (ret >= 0) {
+		ret = intel_gt_wait_for_idle(gt, HZ * 5);
+		if (ret < 0)
+			pr_err("GT failed to idle: %d\n", ret);
+	}
+
+out:
+	multi_lrc_context_unpin(parent);
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
+static int intel_guc_multi_lrc_basic(void *arg)
+{
+	struct intel_gt *gt = arg;
+	unsigned int class;
+	int ret;
+
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		ret = __intel_guc_multi_lrc_basic(gt, class);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int intel_guc_multi_lrc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_multi_lrc_basic),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index 3cf6758931f9..bdd290f2bf3c 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -48,5 +48,6 @@ selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
+selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 22/27] drm/i915/guc: Add basic GuC multi-lrc selftest
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 180 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 182 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2554d0eb4afd..91330525330d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3924,4 +3924,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_guc.c"
+#include "selftest_guc_multi_lrc.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
new file mode 100644
index 000000000000..dacfc5dfadd6
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static void logical_sort(struct intel_engine_cs **engines, int num_engines)
+{
+	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
+	int i, j;
+
+	for (i = 0; i < num_engines; ++i)
+		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
+			if (engines[j]->logical_mask & BIT(i)) {
+				sorted[i] = engines[j];
+				break;
+			}
+		}
+
+	memcpy(*engines, *sorted,
+	       sizeof(struct intel_engine_cs *) * num_engines);
+}
+
+static struct intel_context *
+multi_lrc_create_parent(struct intel_gt *gt, u8 class,
+			unsigned long flags)
+{
+	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i = 0;
+
+	for_each_engine(engine, gt, id) {
+		if (engine->class != class)
+			continue;
+
+		siblings[i++] = engine;
+	}
+
+	if (i <= 1)
+		return ERR_PTR(0);
+
+	logical_sort(siblings, i);
+
+	return intel_engine_create_parallel(siblings, 1, i);
+}
+
+static void multi_lrc_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+}
+
+static void multi_lrc_context_put(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/*
+	 * Only the parent gets the creation ref put in the uAPI, the parent
+	 * itself is responsible for creation ref put on the children.
+	 */
+	intel_context_put(ce);
+}
+
+static struct i915_request *
+multi_lrc_nop_request(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct i915_request *rq, *child_rq;
+	int i = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	for_each_child(ce, child) {
+		child_rq = intel_context_create_request(child);
+		if (IS_ERR(child_rq))
+			goto child_error;
+
+		if (++i == ce->guc_number_children)
+			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+				&child_rq->fence.flags);
+		i915_request_add(child_rq);
+	}
+
+	return rq;
+
+child_error:
+	i915_request_put(rq);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
+{
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, class, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d",
+			 VIDEO_DECODE_CLASS);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on request: %d", ret);
+
+	i915_request_put(rq);
+
+	if (ret >= 0) {
+		ret = intel_gt_wait_for_idle(gt, HZ * 5);
+		if (ret < 0)
+			pr_err("GT failed to idle: %d\n", ret);
+	}
+
+out:
+	multi_lrc_context_unpin(parent);
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
+static int intel_guc_multi_lrc_basic(void *arg)
+{
+	struct intel_gt *gt = arg;
+	unsigned int class;
+	int ret;
+
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		ret = __intel_guc_multi_lrc_basic(gt, class);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int intel_guc_multi_lrc_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_multi_lrc_basic),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index 3cf6758931f9..bdd290f2bf3c 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -48,5 +48,6 @@ selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(slpc, intel_slpc_live_selftests)
 selftest(guc, intel_guc_live_selftests)
+selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between to parent and child is needed. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled via by
default if a context is configured by set parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
 4 files changed, 287 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 5615be32879c..2de62649e275 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	GEM_BUG_ON(intel_context_is_child(child));
 	GEM_BUG_ON(intel_context_is_parent(child));
 
-	parent->guc_number_children++;
+	child->guc_child_index = parent->guc_number_children++;
 	list_add_tail(&child->guc_child_link,
 		      &parent->guc_child_list);
 	child->parent = parent;
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 713d85b0b364..727f91e7f7c2 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -246,6 +246,9 @@ struct intel_context {
 		/** @guc_number_children: number of children if parent */
 		u8 guc_number_children;
 
+		/** @guc_child_index: index into guc_child_list if child */
+		u8 guc_child_index;
+
 		/**
 		 * @parent_page: page in context used by parent for work queue,
 		 * work queue descriptor
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 6cd26dc060d1..9f61cfa5566a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -188,7 +188,7 @@ struct guc_process_desc {
 	u32 wq_status;
 	u32 engine_presence;
 	u32 priority;
-	u32 reserved[30];
+	u32 reserved[36];
 } __packed;
 
 #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 91330525330d..1a18f99bf12a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -11,6 +11,7 @@
 #include "gt/intel_context.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
+#include "gt/intel_gpu_commands.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
@@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
 
 /*
  * When using multi-lrc submission an extra page in the context state is
- * reserved for the process descriptor and work queue.
+ * reserved for the process descriptor, work queue, and preempt BB boundary
+ * handshake between the parent + childlren contexts.
  *
  * The layout of this page is below:
  * 0						guc_process_desc
+ * + sizeof(struct guc_process_desc)		child go
+ * + CACHELINE_BYTES				child join ...
+ * + CACHELINE_BYTES ...
  * ...						unused
  * PAGE_SIZE / 2				work queue start
  * ...						work queue
@@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
+static inline void clear_children_join_go_memory(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+	u8 i;
+
+	for (i = 0; i < ce->guc_number_children + 1; ++i)
+		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
+}
+
+static inline u32 get_children_go_value(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[0];
+}
+
+static inline u32 get_children_join_value(struct intel_context *ce,
+					  u8 child_index)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
+}
+
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 			guc_context_policy_init(engine, desc);
 		}
+
+		clear_children_join_go_memory(ce);
 	}
 
 	/*
@@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/*
+ * The below override of the breadcrumbs is enabled when the user configures a
+ * context for parallel submission (multi-lrc, parent-child).
+ *
+ * The overridden breadcrumbs implements an algorithm which allows the GuC to
+ * safely preempt all the hw contexts configured for parallel submission
+ * between each BB. The contract between the i915 and GuC is if the parent
+ * context can be preempted, all the children can be preempted, and the GuC will
+ * always try to preempt the parent before the children. A handshake between the
+ * parent / children breadcrumbs ensures the i915 holds up its end of the deal
+ * creating a window to preempt between each set of BBs.
+ */
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags);
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags);
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs);
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs);
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->engine->emit_bb_start =
+		emit_bb_start_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb =
+		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb_dw =
+		12 + 4 * parent->guc_number_children;
+	for_each_child(parent, ce) {
+		ce->engine->emit_bb_start =
+			emit_bb_start_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb =
+			emit_fini_breadcrumb_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb_dw = 16;
+	}
+
 	kfree(siblings);
 	return parent;
 
@@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
+static inline u32 get_children_go_addr(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return i915_ggtt_offset(ce->state) +
+		__get_process_desc_offset(ce) +
+		sizeof(struct guc_process_desc);
+}
+
+static inline u32 get_children_join_addr(struct intel_context *ce,
+					 u8 child_index)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
+}
+
+#define PARENT_GO_BB			1
+#define PARENT_GO_FINI_BREADCRUMB	0
+#define CHILD_GO_BB			1
+#define CHILD_GO_FINI_BREADCRUMB	0
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Wait on chidlren */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_BB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_BB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	cs = intel_ring_begin(rq, 12);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_BB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_BB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/* Wait on children */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_FINI_BREADCRUMB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_FINI_BREADCRUMB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_FINI_BREADCRUMB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_FINI_BREADCRUMB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
 static struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
@@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			drm_printf(p, "\t\tWQI Status: %u\n\n",
 				   READ_ONCE(desc->wq_status));
 
+			drm_printf(p, "\t\tNumber Children: %u\n\n",
+				   ce->guc_number_children);
+			if (ce->engine->emit_bb_start ==
+			    emit_bb_start_parent_no_preempt_mid_batch) {
+				u8 i;
+
+				drm_printf(p, "\t\tChildren Go: %u\n\n",
+					   get_children_go_value(ce));
+				for (i = 0; i < ce->guc_number_children; ++i)
+					drm_printf(p, "\t\tChildren Join: %u\n",
+						   get_children_join_value(ce, i));
+			}
+
 			for_each_child(ce, child)
 				guc_log_context(p, child);
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between to parent and child is needed. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled via by
default if a context is configured by set parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
 4 files changed, 287 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 5615be32879c..2de62649e275 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	GEM_BUG_ON(intel_context_is_child(child));
 	GEM_BUG_ON(intel_context_is_parent(child));
 
-	parent->guc_number_children++;
+	child->guc_child_index = parent->guc_number_children++;
 	list_add_tail(&child->guc_child_link,
 		      &parent->guc_child_list);
 	child->parent = parent;
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 713d85b0b364..727f91e7f7c2 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -246,6 +246,9 @@ struct intel_context {
 		/** @guc_number_children: number of children if parent */
 		u8 guc_number_children;
 
+		/** @guc_child_index: index into guc_child_list if child */
+		u8 guc_child_index;
+
 		/**
 		 * @parent_page: page in context used by parent for work queue,
 		 * work queue descriptor
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 6cd26dc060d1..9f61cfa5566a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -188,7 +188,7 @@ struct guc_process_desc {
 	u32 wq_status;
 	u32 engine_presence;
 	u32 priority;
-	u32 reserved[30];
+	u32 reserved[36];
 } __packed;
 
 #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 91330525330d..1a18f99bf12a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -11,6 +11,7 @@
 #include "gt/intel_context.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
+#include "gt/intel_gpu_commands.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
@@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
 
 /*
  * When using multi-lrc submission an extra page in the context state is
- * reserved for the process descriptor and work queue.
+ * reserved for the process descriptor, work queue, and preempt BB boundary
+ * handshake between the parent + childlren contexts.
  *
  * The layout of this page is below:
  * 0						guc_process_desc
+ * + sizeof(struct guc_process_desc)		child go
+ * + CACHELINE_BYTES				child join ...
+ * + CACHELINE_BYTES ...
  * ...						unused
  * PAGE_SIZE / 2				work queue start
  * ...						work queue
@@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
+static inline void clear_children_join_go_memory(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+	u8 i;
+
+	for (i = 0; i < ce->guc_number_children + 1; ++i)
+		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
+}
+
+static inline u32 get_children_go_value(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[0];
+}
+
+static inline u32 get_children_join_value(struct intel_context *ce,
+					  u8 child_index)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
+}
+
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 			guc_context_policy_init(engine, desc);
 		}
+
+		clear_children_join_go_memory(ce);
 	}
 
 	/*
@@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
 	.get_sibling = guc_virtual_get_sibling,
 };
 
+/*
+ * The below override of the breadcrumbs is enabled when the user configures a
+ * context for parallel submission (multi-lrc, parent-child).
+ *
+ * The overridden breadcrumbs implements an algorithm which allows the GuC to
+ * safely preempt all the hw contexts configured for parallel submission
+ * between each BB. The contract between the i915 and GuC is if the parent
+ * context can be preempted, all the children can be preempted, and the GuC will
+ * always try to preempt the parent before the children. A handshake between the
+ * parent / children breadcrumbs ensures the i915 holds up its end of the deal
+ * creating a window to preempt between each set of BBs.
+ */
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags);
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags);
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs);
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs);
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->engine->emit_bb_start =
+		emit_bb_start_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb =
+		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb_dw =
+		12 + 4 * parent->guc_number_children;
+	for_each_child(parent, ce) {
+		ce->engine->emit_bb_start =
+			emit_bb_start_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb =
+			emit_fini_breadcrumb_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb_dw = 16;
+	}
+
 	kfree(siblings);
 	return parent;
 
@@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
+static inline u32 get_children_go_addr(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return i915_ggtt_offset(ce->state) +
+		__get_process_desc_offset(ce) +
+		sizeof(struct guc_process_desc);
+}
+
+static inline u32 get_children_join_addr(struct intel_context *ce,
+					 u8 child_index)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
+}
+
+#define PARENT_GO_BB			1
+#define PARENT_GO_FINI_BREADCRUMB	0
+#define CHILD_GO_BB			1
+#define CHILD_GO_FINI_BREADCRUMB	0
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Wait on chidlren */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_BB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_BB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	cs = intel_ring_begin(rq, 12);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_BB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_BB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/* Wait on children */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_FINI_BREADCRUMB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_FINI_BREADCRUMB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_FINI_BREADCRUMB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_FINI_BREADCRUMB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
 static struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
@@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			drm_printf(p, "\t\tWQI Status: %u\n\n",
 				   READ_ONCE(desc->wq_status));
 
+			drm_printf(p, "\t\tNumber Children: %u\n\n",
+				   ce->guc_number_children);
+			if (ce->engine->emit_bb_start ==
+			    emit_bb_start_parent_no_preempt_mid_batch) {
+				u8 i;
+
+				drm_printf(p, "\t\tChildren Go: %u\n\n",
+					   get_children_go_value(ce));
+				for (i = 0; i < ce->guc_number_children; ++i)
+					drm_printf(p, "\t\tChildren Join: %u\n",
+						   get_children_join_value(ce, i));
+			}
+
 			for_each_child(ce, child)
 				guc_log_context(p, child);
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 24/27] drm/i915: Multi-BB execbuf
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Allow multiple batch buffers to be submitted in a single execbuf IOCTL
after a context has been configured with the 'set_parallel' extension.
The number batches is implicit based on the contexts configuration.

This is implemented with a series of loops. First a loop is used to find
all the batches, a loop to pin all the HW contexts, a loop to generate
all the requests, a loop to submit all the requests, a loop to commit
all the requests, and finally a loop to tie the requests to the VMAs
they touch.

A composite fence is also created for the also the generated requests to
return to the user and to stick in dma resv slots.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: link to come

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 765 ++++++++++++------
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
 drivers/gpu/drm/i915/i915_request.h           |   9 +
 drivers/gpu/drm/i915/i915_vma.c               |  21 +-
 drivers/gpu/drm/i915/i915_vma.h               |  13 +-
 7 files changed, 573 insertions(+), 257 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 8290bdadd167..481978974627 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -244,17 +244,23 @@ struct i915_execbuffer {
 	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
 	struct eb_vma *vma;
 
-	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct intel_gt *gt; /* gt for the execbuf */
 	struct intel_context *context; /* logical state for the request */
 	struct i915_gem_context *gem_context; /** caller's context */
 
-	struct i915_request *request; /** our request to build */
-	struct eb_vma *batch; /** identity of the batch obj/vma */
+	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; /** our requests to build */
+	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1]; /** identity of the batch obj/vma */
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/** used for excl fence in dma_resv objects when > 1 BB submitted */
+	struct dma_fence *composite_fence;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
+	/* number of batches in execbuf IOCTL */
+	unsigned int num_batches;
+
 	/** list of vma not yet bound during reservation phase */
 	struct list_head unbound;
 
@@ -281,7 +287,7 @@ struct i915_execbuffer {
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
 
-	u64 batch_len; /** Length of batch within object */
+	u64 batch_len[MAX_ENGINE_INSTANCE + 1]; /** Length of batch within object */
 	u32 batch_start_offset; /** Location within object of batch */
 	u32 batch_flags; /** Flags composed for emit_bb_start() */
 	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
@@ -299,14 +305,13 @@ struct i915_execbuffer {
 };
 
 static int eb_parse(struct i915_execbuffer *eb);
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
-					  bool throttle);
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
 static void eb_unpin_engine(struct i915_execbuffer *eb);
 
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
-	return intel_engine_requires_cmd_parser(eb->engine) ||
-		(intel_engine_using_cmd_parser(eb->engine) &&
+	return intel_engine_requires_cmd_parser(eb->context->engine) ||
+		(intel_engine_using_cmd_parser(eb->context->engine) &&
 		 eb->args->batch_len);
 }
 
@@ -544,11 +549,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
 	return 0;
 }
 
-static void
+static inline bool
+is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
+{
+	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
+		buffer_idx < eb->num_batches :
+		buffer_idx >= eb->args->buffer_count - eb->num_batches;
+}
+
+static int
 eb_add_vma(struct i915_execbuffer *eb,
-	   unsigned int i, unsigned batch_idx,
+	   unsigned int *current_batch,
+	   unsigned int i,
 	   struct i915_vma *vma)
 {
+	struct drm_i915_private *i915 = eb->i915;
 	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 	struct eb_vma *ev = &eb->vma[i];
 
@@ -575,15 +590,41 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if (i == batch_idx) {
+	if (is_batch_buffer(eb, i)) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
 			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-		eb->batch = ev;
+		eb->batches[*current_batch] = ev;
+
+		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
+			drm_dbg(&i915->drm,
+				"Attempting to use self-modifying batch buffer\n");
+			return -EINVAL;
+		}
+
+		if (range_overflows_t(u64,
+				      eb->batch_start_offset,
+				      eb->args->batch_len,
+				      ev->vma->size)) {
+			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
+			return -EINVAL;
+		}
+
+		if (eb->args->batch_len == 0)
+			eb->batch_len[*current_batch] = ev->vma->size -
+				eb->batch_start_offset;
+		if (unlikely(eb->batch_len == 0)) { /* impossible! */
+			drm_dbg(&i915->drm, "Invalid batch length\n");
+			return -EINVAL;
+		}
+
+		++*current_batch;
 	}
+
+	return 0;
 }
 
 static inline int use_cpu_reloc(const struct reloc_cache *cache,
@@ -727,14 +768,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	} while (1);
 }
 
-static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
-{
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		return 0;
-	else
-		return eb->buffer_count - 1;
-}
-
 static int eb_select_context(struct i915_execbuffer *eb)
 {
 	struct i915_gem_context *ctx;
@@ -843,9 +876,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
 
 static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_private *i915 = eb->i915;
-	unsigned int batch = eb_batch_index(eb);
-	unsigned int i;
+	unsigned int i, current_batch = 0;
 	int err = 0;
 
 	INIT_LIST_HEAD(&eb->relocs);
@@ -865,7 +896,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, batch, vma);
+		err = eb_add_vma(eb, &current_batch, i, vma);
+		if (err)
+			return err;
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
@@ -888,26 +921,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 	}
 
-	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
-		drm_dbg(&i915->drm,
-			"Attempting to use self-modifying batch buffer\n");
-		return -EINVAL;
-	}
-
-	if (range_overflows_t(u64,
-			      eb->batch_start_offset, eb->batch_len,
-			      eb->batch->vma->size)) {
-		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
-		return -EINVAL;
-	}
-
-	if (eb->batch_len == 0)
-		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
-	if (unlikely(eb->batch_len == 0)) { /* impossible! */
-		drm_dbg(&i915->drm, "Invalid batch length\n");
-		return -EINVAL;
-	}
-
 	return 0;
 
 err:
@@ -1640,8 +1653,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
-					   struct i915_request *rq)
+static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 {
 	bool have_copy = false;
 	struct eb_vma *ev;
@@ -1657,21 +1669,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	eb_release_vmas(eb, false);
 	i915_gem_ww_ctx_fini(&eb->ww);
 
-	if (rq) {
-		/* nonblocking is always false */
-		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
-				      MAX_SCHEDULE_TIMEOUT) < 0) {
-			i915_request_put(rq);
-			rq = NULL;
-
-			err = -EINTR;
-			goto err_relock;
-		}
-
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/*
 	 * We take 3 passes through the slowpatch.
 	 *
@@ -1698,28 +1695,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	if (!err)
 		err = eb_reinit_userptr(eb);
 
-err_relock:
 	i915_gem_ww_ctx_init(&eb->ww, true);
 	if (err)
 		goto out;
 
 	/* reacquire the objects */
 repeat_validate:
-	rq = eb_pin_engine(eb, false);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, false);
+	if (err)
 		goto err;
-	}
-
-	/* We didn't throttle, should be NULL */
-	GEM_WARN_ON(rq);
 
 	err = eb_validate_vmas(eb);
 	if (err)
 		goto err;
 
-	GEM_BUG_ON(!eb->batch);
+	GEM_BUG_ON(!eb->batches[0]);
 
 	list_for_each_entry(ev, &eb->relocs, reloc_link) {
 		if (!have_copy) {
@@ -1783,46 +1773,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 		}
 	}
 
-	if (rq)
-		i915_request_put(rq);
-
 	return err;
 }
 
 static int eb_relocate_parse(struct i915_execbuffer *eb)
 {
 	int err;
-	struct i915_request *rq = NULL;
 	bool throttle = true;
 
 retry:
-	rq = eb_pin_engine(eb, throttle);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, throttle);
+	if (err) {
 		if (err != -EDEADLK)
 			return err;
 
 		goto err;
 	}
 
-	if (rq) {
-		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
-
-		/* Need to drop all locks now for throttling, take slowpath */
-		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
-		if (err == -ETIME) {
-			if (nonblock) {
-				err = -EWOULDBLOCK;
-				i915_request_put(rq);
-				goto err;
-			}
-			goto slow;
-		}
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/* only throttle once, even if we didn't need to throttle */
 	throttle = false;
 
@@ -1862,7 +1829,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 
 slow:
-	err = eb_relocate_parse_slow(eb, rq);
+	err = eb_relocate_parse_slow(eb);
 	if (err)
 		/*
 		 * If the user expects the execobject.offset and
@@ -1876,11 +1843,31 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < _eb->num_batches; ++_i)
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = _eb->num_batches - 1; _i >= 0; --_i)
+
+static struct i915_request *
+eb_find_first_request(struct i915_execbuffer *eb)
+{
+	int i;
+
+	for_each_batch_add_order(eb, i)
+		if (eb->requests[i])
+			return eb->requests[i];
+
+	GEM_BUG_ON("Request not found");
+
+	return NULL;
+}
+
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
-	int err = 0;
+	int err = 0, j;
 
 	while (i--) {
 		struct eb_vma *ev = &eb->vma[i];
@@ -1893,11 +1880,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_capture_list *capture;
 
-			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
-			if (capture) {
-				capture->next = eb->request->capture_list;
-				capture->vma = vma;
-				eb->request->capture_list = capture;
+			for_each_batch_create_order(eb, j) {
+				if (!eb->requests[j])
+					break;
+
+				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
+				if (capture) {
+					capture->next =
+						eb->requests[j]->capture_list;
+					capture->vma = vma;
+					eb->requests[j]->capture_list = capture;
+				}
 			}
 		}
 
@@ -1918,14 +1911,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
+		/* We only need to await on the first request */
 		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
-				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
+				(eb_find_first_request(eb), obj,
+				 flags & EXEC_OBJECT_WRITE);
 		}
 
-		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request,
-						      flags | __EXEC_OBJECT_NO_RESERVE);
+		for_each_batch_add_order(eb, j) {
+			if (err)
+				break;
+			if (!eb->requests[j])
+				continue;
+
+			err = _i915_vma_move_to_active(vma, eb->requests[j],
+						       j ? NULL :
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->requests[j]->fence,
+						       flags | __EXEC_OBJECT_NO_RESERVE);
+		}
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -1956,11 +1961,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		goto err_skip;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	intel_gt_chipset_flush(eb->engine->gt);
+	intel_gt_chipset_flush(eb->gt);
 	return 0;
 
 err_skip:
-	i915_request_set_error_once(eb->request, err);
+	for_each_batch_create_order(eb, j) {
+		if (!eb->requests[j])
+			break;
+
+		i915_request_set_error_once(eb->requests[j], err);
+	}
 	return err;
 }
 
@@ -2055,14 +2065,16 @@ static int eb_parse(struct i915_execbuffer *eb)
 	int err;
 
 	if (!eb_use_cmdparser(eb)) {
-		batch = eb_dispatch_secure(eb, eb->batch->vma);
+		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
 		if (IS_ERR(batch))
 			return PTR_ERR(batch);
 
 		goto secure_batch;
 	}
 
-	len = eb->batch_len;
+	GEM_BUG_ON(intel_context_is_parallel(eb->context));
+
+	len = eb->batch_len[0];
 	if (!CMDPARSER_USES_GGTT(eb->i915)) {
 		/*
 		 * ppGTT backed shadow buffers must be mapped RO, to prevent
@@ -2076,11 +2088,11 @@ static int eb_parse(struct i915_execbuffer *eb)
 	} else {
 		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
 	}
-	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
+	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
 		return -EINVAL;
 
 	if (!pool) {
-		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
+		pool = intel_gt_get_buffer_pool(eb->gt, len,
 						I915_MAP_WB);
 		if (IS_ERR(pool))
 			return PTR_ERR(pool);
@@ -2105,7 +2117,7 @@ static int eb_parse(struct i915_execbuffer *eb)
 		trampoline = shadow;
 
 		shadow = shadow_batch_pin(eb, pool->obj,
-					  &eb->engine->gt->ggtt->vm,
+					  &eb->gt->ggtt->vm,
 					  PIN_GLOBAL);
 		if (IS_ERR(shadow)) {
 			err = PTR_ERR(shadow);
@@ -2127,26 +2139,27 @@ static int eb_parse(struct i915_execbuffer *eb)
 	if (err)
 		goto err_trampoline;
 
-	err = intel_engine_cmd_parser(eb->engine,
-				      eb->batch->vma,
+	err = intel_engine_cmd_parser(eb->context->engine,
+				      eb->batches[0]->vma,
 				      eb->batch_start_offset,
-				      eb->batch_len,
+				      eb->batch_len[0],
 				      shadow, trampoline);
 	if (err)
 		goto err_unpin_batch;
 
-	eb->batch = &eb->vma[eb->buffer_count++];
-	eb->batch->vma = i915_vma_get(shadow);
-	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
+	eb->batches[0] = &eb->vma[eb->buffer_count++];
+	eb->batches[0]->vma = i915_vma_get(shadow);
+	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
 
 	eb->trampoline = trampoline;
 	eb->batch_start_offset = 0;
 
 secure_batch:
 	if (batch) {
-		eb->batch = &eb->vma[eb->buffer_count++];
-		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
-		eb->batch->vma = i915_vma_get(batch);
+		GEM_BUG_ON(intel_context_is_parallel(eb->context));
+		eb->batches[0] = &eb->vma[eb->buffer_count++];
+		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
+		eb->batches[0]->vma = i915_vma_get(batch);
 	}
 	return 0;
 
@@ -2162,19 +2175,18 @@ static int eb_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_request_submit(struct i915_execbuffer *eb,
+			     struct i915_request *rq,
+			     struct i915_vma *batch,
+			     u64 batch_len)
 {
 	int err;
 
-	if (intel_context_nopreempt(eb->context))
-		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
-
-	err = eb_move_to_gpu(eb);
-	if (err)
-		return err;
+	if (intel_context_nopreempt(rq->context))
+		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		err = i915_reset_gen7_sol_offsets(eb->request);
+		err = i915_reset_gen7_sol_offsets(rq);
 		if (err)
 			return err;
 	}
@@ -2185,26 +2197,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	 * allows us to determine if the batch is still waiting on the GPU
 	 * or actually running by checking the breadcrumb.
 	 */
-	if (eb->engine->emit_init_breadcrumb) {
-		err = eb->engine->emit_init_breadcrumb(eb->request);
+	if (rq->context->engine->emit_init_breadcrumb) {
+		err = rq->context->engine->emit_init_breadcrumb(rq);
 		if (err)
 			return err;
 	}
 
-	err = eb->engine->emit_bb_start(eb->request,
-					batch->node.start +
-					eb->batch_start_offset,
-					eb->batch_len,
-					eb->batch_flags);
+	err = rq->context->engine->emit_bb_start(rq,
+						 batch->node.start +
+						 eb->batch_start_offset,
+						 batch_len,
+						 eb->batch_flags);
 	if (err)
 		return err;
 
 	if (eb->trampoline) {
+		GEM_BUG_ON(intel_context_is_parallel(rq->context));
 		GEM_BUG_ON(eb->batch_start_offset);
-		err = eb->engine->emit_bb_start(eb->request,
-						eb->trampoline->node.start +
-						eb->batch_len,
-						0, 0);
+		err = rq->context->engine->emit_bb_start(rq,
+							 eb->trampoline->node.start +
+							 batch_len, 0, 0);
 		if (err)
 			return err;
 	}
@@ -2212,6 +2224,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	return 0;
 }
 
+static int eb_submit(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+	int err;
+
+	err = eb_move_to_gpu(eb);
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
+		if (!err)
+			err = eb_request_submit(eb, eb->requests[i],
+						eb->batches[i]->vma,
+						eb->batch_len[i]);
+	}
+
+	return err;
+}
+
 static int num_vcs_engines(const struct drm_i915_private *i915)
 {
 	return hweight_long(VDBOX_MASK(&i915->gt));
@@ -2277,26 +2310,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
 	return i915_request_get(rq);
 }
 
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
+			   bool throttle)
 {
-	struct intel_context *ce = eb->context;
 	struct intel_timeline *tl;
-	struct i915_request *rq = NULL;
-	int err;
-
-	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
-
-	if (unlikely(intel_context_is_banned(ce)))
-		return ERR_PTR(-EIO);
-
-	/*
-	 * Pinning the contexts may generate requests in order to acquire
-	 * GGTT space, so do this first before we reserve a seqno for
-	 * ourselves.
-	 */
-	err = intel_context_pin_ww(ce, &eb->ww);
-	if (err)
-		return ERR_PTR(err);
+	struct i915_request *rq;
 
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
@@ -2307,33 +2325,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
 	 * taken on the engine, and the parent device.
 	 */
 	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl)) {
-		intel_context_unpin(ce);
-		return ERR_CAST(tl);
-	}
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
 
 	intel_context_enter(ce);
 	if (throttle)
 		rq = eb_throttle(eb, ce);
 	intel_context_timeline_unlock(tl);
 
+	if (rq) {
+		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
+		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
+
+		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
+				      timeout) < 0) {
+			i915_request_put(rq);
+
+			tl = intel_context_timeline_lock(ce);
+			intel_context_exit(ce);
+			intel_context_timeline_unlock(tl);
+
+			if (nonblock)
+				return -EWOULDBLOCK;
+			else
+				return -EINTR;
+		}
+		i915_request_put(rq);
+	}
+
+	return 0;
+}
+
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+{
+	struct intel_context *ce = eb->context, *child;
+	int err;
+	int i = 0, j = 0;
+
+	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
+
+	if (unlikely(intel_context_is_banned(ce)))
+		return -EIO;
+
+	/*
+	 * Pinning the contexts may generate requests in order to acquire
+	 * GGTT space, so do this first before we reserve a seqno for
+	 * ourselves.
+	 */
+	err = intel_context_pin_ww(ce, &eb->ww);
+	if (err)
+		return err;
+	for_each_child(ce, child) {
+		err = intel_context_pin_ww(child, &eb->ww);
+		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
+	}
+
+	for_each_child(ce, child) {
+		err = eb_pin_timeline(eb, child, throttle);
+		if (err)
+			goto unwind;
+		++i;
+	}
+	err = eb_pin_timeline(eb, ce, throttle);
+	if (err)
+		goto unwind;
+
 	eb->args->flags |= __EXEC_ENGINE_PINNED;
-	return rq;
+	return 0;
+
+unwind:
+	for_each_child(ce, child) {
+		if (j++ < i) {
+			mutex_lock(&child->timeline->mutex);
+			intel_context_exit(child);
+			mutex_unlock(&child->timeline->mutex);
+		}
+	}
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+	return err;
 }
 
 static void eb_unpin_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce = eb->context;
-	struct intel_timeline *tl = ce->timeline;
+	struct intel_context *ce = eb->context, *child;
 
 	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
 		return;
 
 	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
 
-	mutex_lock(&tl->mutex);
+	for_each_child(ce, child) {
+		mutex_lock(&child->timeline->mutex);
+		intel_context_exit(child);
+		mutex_unlock(&child->timeline->mutex);
+
+		intel_context_unpin(child);
+	}
+
+	mutex_lock(&ce->timeline->mutex);
 	intel_context_exit(ce);
-	mutex_unlock(&tl->mutex);
+	mutex_unlock(&ce->timeline->mutex);
 
 	intel_context_unpin(ce);
 }
@@ -2384,7 +2477,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
 static int
 eb_select_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce;
+	struct intel_context *ce, *child;
 	unsigned int idx;
 	int err;
 
@@ -2397,6 +2490,20 @@ eb_select_engine(struct i915_execbuffer *eb)
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
+	if (intel_context_is_parallel(ce)) {
+		if (eb->buffer_count < ce->guc_number_children + 1) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+		if (eb->batch_start_offset || eb->args->batch_len) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+	}
+	eb->num_batches = ce->guc_number_children + 1;
+
+	for_each_child(ce, child)
+		intel_context_get(child);
 	intel_gt_pm_get(ce->engine->gt);
 
 	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
@@ -2404,6 +2511,13 @@ eb_select_engine(struct i915_execbuffer *eb)
 		if (err)
 			goto err;
 	}
+	for_each_child(ce, child) {
+		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
+			err = intel_context_alloc_state(child);
+			if (err)
+				goto err;
+		}
+	}
 
 	/*
 	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
@@ -2414,7 +2528,7 @@ eb_select_engine(struct i915_execbuffer *eb)
 		goto err;
 
 	eb->context = ce;
-	eb->engine = ce->engine;
+	eb->gt = ce->engine->gt;
 
 	/*
 	 * Make sure engine pool stays alive even if we call intel_context_put
@@ -2425,6 +2539,8 @@ eb_select_engine(struct i915_execbuffer *eb)
 
 err:
 	intel_gt_pm_put(ce->engine->gt);
+	for_each_child(ce, child)
+		intel_context_put(child);
 	intel_context_put(ce);
 	return err;
 }
@@ -2432,7 +2548,11 @@ eb_select_engine(struct i915_execbuffer *eb)
 static void
 eb_put_engine(struct i915_execbuffer *eb)
 {
-	intel_gt_pm_put(eb->engine->gt);
+	struct intel_context *child;
+
+	intel_gt_pm_put(eb->gt);
+	for_each_child(eb->context, child)
+		intel_context_put(child);
 	intel_context_put(eb->context);
 }
 
@@ -2655,7 +2775,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
 }
 
 static int
-await_fence_array(struct i915_execbuffer *eb)
+await_fence_array(struct i915_execbuffer *eb,
+		  struct i915_request *rq)
 {
 	unsigned int n;
 	int err;
@@ -2669,8 +2790,7 @@ await_fence_array(struct i915_execbuffer *eb)
 		if (!eb->fences[n].dma_fence)
 			continue;
 
-		err = i915_request_await_dma_fence(eb->request,
-						   eb->fences[n].dma_fence);
+		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
 		if (err < 0)
 			return err;
 	}
@@ -2678,9 +2798,9 @@ await_fence_array(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static void signal_fence_array(const struct i915_execbuffer *eb)
+static void signal_fence_array(const struct i915_execbuffer *eb,
+			       struct dma_fence * const fence)
 {
-	struct dma_fence * const fence = &eb->request->fence;
 	unsigned int n;
 
 	for (n = 0; n < eb->num_fences; n++) {
@@ -2728,12 +2848,12 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 			break;
 }
 
-static int eb_request_add(struct i915_execbuffer *eb, int err)
+static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
 {
-	struct i915_request *rq = eb->request;
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
 	struct i915_request *prev;
+	int err = 0;
 
 	lockdep_assert_held(&tl->mutex);
 	lockdep_unpin_lock(&tl->mutex, rq->cookie);
@@ -2745,11 +2865,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
 		attr = eb->gem_context->sched;
-	} else {
-		/* Serialise with context_close via the add_to_timeline */
-		i915_request_set_error_once(rq, -ENOENT);
-		__i915_request_skip(rq);
-		err = -ENOENT; /* override any transient errors */
 	}
 
 	__i915_request_queue(rq, &attr);
@@ -2763,6 +2878,44 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	return err;
 }
 
+static int eb_requests_add(struct i915_execbuffer *eb, int err)
+{
+	int i;
+
+	/*
+	* We iterate in reverse order of creation to release timeline mutexes in
+	* same order.
+	*/
+	for_each_batch_add_order(eb, i) {
+		struct i915_request *rq = eb->requests[i];
+
+		if (!rq)
+			continue;
+
+		if (unlikely(intel_context_is_closed(eb->context))) {
+			/* Serialise with context_close via the add_to_timeline */
+			i915_request_set_error_once(rq, -ENOENT);
+			__i915_request_skip(rq);
+			err = -ENOENT; /* override any transient errors */
+		}
+
+		if (intel_context_is_parallel(eb->context)) {
+			if (err) {
+				__i915_request_skip(rq);
+				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
+					&rq->fence.flags);
+			}
+			if (i == 0)
+				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+					&rq->fence.flags);
+		}
+
+		err |= eb_request_add(eb, rq);
+	}
+
+	return err;
+}
+
 static const i915_user_extension_fn execbuf_extensions[] = {
 	[DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES] = parse_timeline_fences,
 };
@@ -2789,6 +2942,166 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
 				    eb);
 }
 
+static void eb_requests_get(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_get(eb->requests[i]);
+	}
+}
+
+static void eb_requests_put(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_put(eb->requests[i]);
+	}
+}
+static struct sync_file *
+eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	struct dma_fence_array *fence_array;
+	struct dma_fence **fences;
+	unsigned int i;
+
+	GEM_BUG_ON(!intel_context_is_parent(eb->context));
+
+	fences = kmalloc(eb->num_batches * sizeof(*fences), GFP_KERNEL);
+	if (!fences)
+		return ERR_PTR(-ENOMEM);
+
+	for_each_batch_create_order(eb, i)
+		fences[i] = &eb->requests[i]->fence;
+
+	fence_array = dma_fence_array_create(eb->num_batches,
+					     fences,
+					     eb->context->fence_context,
+					     eb->context->seqno,
+					     false);
+	if (!fence_array) {
+		kfree(fences);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Move ownership to the dma_fence_array created above */
+	for_each_batch_create_order(eb, i)
+		dma_fence_get(fences[i]);
+
+	if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&fence_array->base);
+		/* sync_file now owns fence_arry, drop creation ref */
+		dma_fence_put(&fence_array->base);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	eb->composite_fence = &fence_array->base;
+
+	return out_fence;
+}
+
+static struct intel_context *
+eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
+{
+	struct intel_context *child;
+
+	if (likely(context_number == 0))
+		return eb->context;
+
+	for_each_child(eb->context, child)
+		if (!--context_number)
+			return child;
+
+	GEM_BUG_ON("Context not found");
+
+	return NULL;
+}
+
+static struct sync_file *
+eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
+		   int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	unsigned int i;
+	int err;
+
+	for_each_batch_create_order(eb, i) {
+		bool first_request_to_add = i + 1 == eb->num_batches;
+
+		/* Allocate a request for this batch buffer nice and early. */
+		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
+		if (IS_ERR(eb->requests[i])) {
+			eb->requests[i] = NULL;
+			return ERR_PTR(PTR_ERR(eb->requests[i]));
+		}
+
+		if (unlikely(eb->gem_context->syncobj &&
+			     first_request_to_add)) {
+			struct dma_fence *fence;
+
+			fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
+			err = i915_request_await_dma_fence(eb->requests[i], fence);
+			dma_fence_put(fence);
+			if (err)
+				return ERR_PTR(err);
+		}
+
+		if (in_fence && first_request_to_add) {
+			if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
+				err = i915_request_await_execution(eb->requests[i],
+								   in_fence);
+			else
+				err = i915_request_await_dma_fence(eb->requests[i],
+								   in_fence);
+			if (err < 0)
+				return ERR_PTR(err);
+		}
+
+		if (eb->fences && first_request_to_add) {
+			err = await_fence_array(eb, eb->requests[i]);
+			if (err)
+				return ERR_PTR(err);
+		}
+
+		if (first_request_to_add) {
+			if (intel_context_is_parallel(eb->context)) {
+				out_fence = eb_composite_fence_create(eb, out_fence_fd);
+				if (IS_ERR(out_fence))
+					return ERR_PTR(-ENOMEM);
+			} else if (out_fence_fd != -1) {
+				out_fence = sync_file_create(&eb->requests[i]->fence);
+				if (!out_fence)
+					return ERR_PTR(-ENOMEM);
+			}
+		}
+
+		/*
+		 * Whilst this request exists, batch_obj will be on the
+		 * active_list, and so will hold the active reference. Only when
+		 * this request is retired will the the batch_obj be moved onto
+		 * the inactive_list and lose its active reference. Hence we do
+		 * not need to explicitly hold another reference here.
+		 */
+		eb->requests[i]->batch = eb->batches[i]->vma;
+		if (eb->batch_pool) {
+			GEM_BUG_ON(intel_context_is_parallel(eb->context));
+			intel_gt_buffer_pool_mark_active(eb->batch_pool,
+							 eb->requests[i]);
+		}
+	}
+
+	return out_fence;
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
@@ -2799,7 +3112,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
-	struct i915_vma *batch;
 	int out_fence_fd = -1;
 	int err;
 
@@ -2823,12 +3135,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
-	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
 
+	memset(eb.requests, 0, sizeof(struct i915_request *) *
+	       ARRAY_SIZE(eb.requests));
+	eb.composite_fence = NULL;
+
 	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (GRAPHICS_VER(i915) >= 11)
@@ -2912,70 +3227,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	ww_acquire_done(&eb.ww.ctx);
 
-	batch = eb.batch->vma;
-
-	/* Allocate a request for this batch buffer nice and early. */
-	eb.request = i915_request_create(eb.context);
-	if (IS_ERR(eb.request)) {
-		err = PTR_ERR(eb.request);
-		goto err_vma;
-	}
-
-	if (unlikely(eb.gem_context->syncobj)) {
-		struct dma_fence *fence;
-
-		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
-		err = i915_request_await_dma_fence(eb.request, fence);
-		dma_fence_put(fence);
-		if (err)
-			goto err_ext;
-	}
-
-	if (in_fence) {
-		if (args->flags & I915_EXEC_FENCE_SUBMIT)
-			err = i915_request_await_execution(eb.request,
-							   in_fence);
-		else
-			err = i915_request_await_dma_fence(eb.request,
-							   in_fence);
-		if (err < 0)
-			goto err_request;
-	}
-
-	if (eb.fences) {
-		err = await_fence_array(&eb);
-		if (err)
-			goto err_request;
-	}
-
-	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&eb.request->fence);
-		if (!out_fence) {
-			err = -ENOMEM;
+	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
+	if (IS_ERR(out_fence)) {
+		err = PTR_ERR(out_fence);
+		if (eb.requests[0])
 			goto err_request;
-		}
+		else
+			goto err_vma;
 	}
 
-	/*
-	 * Whilst this request exists, batch_obj will be on the
-	 * active_list, and so will hold the active reference. Only when this
-	 * request is retired will the the batch_obj be moved onto the
-	 * inactive_list and lose its active reference. Hence we do not need
-	 * to explicitly hold another reference here.
-	 */
-	eb.request->batch = batch;
-	if (eb.batch_pool)
-		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
-
-	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb);
 
 err_request:
-	i915_request_get(eb.request);
-	err = eb_request_add(&eb, err);
+	eb_requests_get(&eb);
+	err = eb_requests_add(&eb, err);
 
 	if (eb.fences)
-		signal_fence_array(&eb);
+		signal_fence_array(&eb, eb.composite_fence ?
+				   eb.composite_fence :
+				   &eb.requests[0]->fence);
 
 	if (out_fence) {
 		if (err == 0) {
@@ -2990,10 +3260,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
-					  &eb.request->fence);
+					  eb.composite_fence ?
+					  eb.composite_fence :
+					  &eb.requests[0]->fence);
 	}
 
-	i915_request_put(eb.request);
+	if (!out_fence && eb.composite_fence)
+		dma_fence_put(eb.composite_fence);
+
+	eb_requests_put(&eb);
 
 err_vma:
 	eb_release_vmas(&eb, true);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 9dcc1b14697b..1f6a5ae3e33e 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -237,7 +237,13 @@ intel_context_timeline_lock(struct intel_context *ce)
 	struct intel_timeline *tl = ce->timeline;
 	int err;
 
-	err = mutex_lock_interruptible(&tl->mutex);
+	if (intel_context_is_parent(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
+	else if (intel_context_is_child(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex,
+						      ce->guc_child_index + 1);
+	else
+		err = mutex_lock_interruptible(&tl->mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 727f91e7f7c2..094fcfb5cbe1 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -261,6 +261,18 @@ struct intel_context {
 		 * context.
 		 */
 		struct i915_request *last_rq;
+
+		/**
+		 * @fence_context: fence context composite fence when doing
+		 * parallel submission
+		 */
+		u64 fence_context;
+
+		/**
+		 * @seqno: seqno for composite fence when doing parallel
+		 * submission
+		 */
+		u32 seqno;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1a18f99bf12a..2ef38557b0f0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3034,6 +3034,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->fence_context = dma_fence_context_alloc(1);
+
 	parent->engine->emit_bb_start =
 		emit_bb_start_parent_no_preempt_mid_batch;
 	parent->engine->emit_fini_breadcrumb =
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 8f0073e19079..602cc246ba85 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -147,6 +147,15 @@ enum {
 	 * tail.
 	 */
 	I915_FENCE_FLAG_SUBMIT_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) that
+	 * hit an error while generating requests in the execbuf IOCTL.
+	 * Indicates this request should be skipped as another request in
+	 * submission / relationship encoutered an error.
+	 */
+	I915_FENCE_FLAG_SKIP_PARALLEL,
 };
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 4b7fc4647e46..90546fa58fc1 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	return i915_active_add_request(&vma->active, rq);
 }
 
-int i915_vma_move_to_active(struct i915_vma *vma,
-			    struct i915_request *rq,
-			    unsigned int flags)
+int _i915_vma_move_to_active(struct i915_vma *vma,
+			     struct i915_request *rq,
+			     struct dma_fence *fence,
+			     unsigned int flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
@@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 			intel_frontbuffer_put(front);
 		}
 
-		dma_resv_add_excl_fence(vma->resv, &rq->fence);
-		obj->write_domain = I915_GEM_DOMAIN_RENDER;
-		obj->read_domains = 0;
+		if (fence) {
+			dma_resv_add_excl_fence(vma->resv, fence);
+			obj->write_domain = I915_GEM_DOMAIN_RENDER;
+			obj->read_domains = 0;
+		}
 	} else {
 		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
 			err = dma_resv_reserve_shared(vma->resv, 1);
@@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 				return err;
 		}
 
-		dma_resv_add_shared_fence(vma->resv, &rq->fence);
-		obj->write_domain = 0;
+		if (fence) {
+			dma_resv_add_shared_fence(vma->resv, fence);
+			obj->write_domain = 0;
+		}
 	}
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index ed69f66c7ab0..648dbe744c96 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
 
 int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
 					   struct i915_request *rq);
-int __must_check i915_vma_move_to_active(struct i915_vma *vma,
-					 struct i915_request *rq,
-					 unsigned int flags);
+int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
+					  struct i915_request *rq,
+					  struct dma_fence *fence,
+					  unsigned int flags);
+static inline int __must_check
+i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
+			unsigned int flags)
+{
+	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
+}
 
 #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Allow multiple batch buffers to be submitted in a single execbuf IOCTL
after a context has been configured with the 'set_parallel' extension.
The number batches is implicit based on the contexts configuration.

This is implemented with a series of loops. First a loop is used to find
all the batches, a loop to pin all the HW contexts, a loop to generate
all the requests, a loop to submit all the requests, a loop to commit
all the requests, and finally a loop to tie the requests to the VMAs
they touch.

A composite fence is also created for the also the generated requests to
return to the user and to stick in dma resv slots.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: link to come

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 765 ++++++++++++------
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  12 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
 drivers/gpu/drm/i915/i915_request.h           |   9 +
 drivers/gpu/drm/i915/i915_vma.c               |  21 +-
 drivers/gpu/drm/i915/i915_vma.h               |  13 +-
 7 files changed, 573 insertions(+), 257 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 8290bdadd167..481978974627 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -244,17 +244,23 @@ struct i915_execbuffer {
 	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
 	struct eb_vma *vma;
 
-	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct intel_gt *gt; /* gt for the execbuf */
 	struct intel_context *context; /* logical state for the request */
 	struct i915_gem_context *gem_context; /** caller's context */
 
-	struct i915_request *request; /** our request to build */
-	struct eb_vma *batch; /** identity of the batch obj/vma */
+	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; /** our requests to build */
+	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1]; /** identity of the batch obj/vma */
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/** used for excl fence in dma_resv objects when > 1 BB submitted */
+	struct dma_fence *composite_fence;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
+	/* number of batches in execbuf IOCTL */
+	unsigned int num_batches;
+
 	/** list of vma not yet bound during reservation phase */
 	struct list_head unbound;
 
@@ -281,7 +287,7 @@ struct i915_execbuffer {
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
 
-	u64 batch_len; /** Length of batch within object */
+	u64 batch_len[MAX_ENGINE_INSTANCE + 1]; /** Length of batch within object */
 	u32 batch_start_offset; /** Location within object of batch */
 	u32 batch_flags; /** Flags composed for emit_bb_start() */
 	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
@@ -299,14 +305,13 @@ struct i915_execbuffer {
 };
 
 static int eb_parse(struct i915_execbuffer *eb);
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
-					  bool throttle);
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
 static void eb_unpin_engine(struct i915_execbuffer *eb);
 
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
-	return intel_engine_requires_cmd_parser(eb->engine) ||
-		(intel_engine_using_cmd_parser(eb->engine) &&
+	return intel_engine_requires_cmd_parser(eb->context->engine) ||
+		(intel_engine_using_cmd_parser(eb->context->engine) &&
 		 eb->args->batch_len);
 }
 
@@ -544,11 +549,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
 	return 0;
 }
 
-static void
+static inline bool
+is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
+{
+	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
+		buffer_idx < eb->num_batches :
+		buffer_idx >= eb->args->buffer_count - eb->num_batches;
+}
+
+static int
 eb_add_vma(struct i915_execbuffer *eb,
-	   unsigned int i, unsigned batch_idx,
+	   unsigned int *current_batch,
+	   unsigned int i,
 	   struct i915_vma *vma)
 {
+	struct drm_i915_private *i915 = eb->i915;
 	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 	struct eb_vma *ev = &eb->vma[i];
 
@@ -575,15 +590,41 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if (i == batch_idx) {
+	if (is_batch_buffer(eb, i)) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
 			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-		eb->batch = ev;
+		eb->batches[*current_batch] = ev;
+
+		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
+			drm_dbg(&i915->drm,
+				"Attempting to use self-modifying batch buffer\n");
+			return -EINVAL;
+		}
+
+		if (range_overflows_t(u64,
+				      eb->batch_start_offset,
+				      eb->args->batch_len,
+				      ev->vma->size)) {
+			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
+			return -EINVAL;
+		}
+
+		if (eb->args->batch_len == 0)
+			eb->batch_len[*current_batch] = ev->vma->size -
+				eb->batch_start_offset;
+		if (unlikely(eb->batch_len == 0)) { /* impossible! */
+			drm_dbg(&i915->drm, "Invalid batch length\n");
+			return -EINVAL;
+		}
+
+		++*current_batch;
 	}
+
+	return 0;
 }
 
 static inline int use_cpu_reloc(const struct reloc_cache *cache,
@@ -727,14 +768,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	} while (1);
 }
 
-static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
-{
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		return 0;
-	else
-		return eb->buffer_count - 1;
-}
-
 static int eb_select_context(struct i915_execbuffer *eb)
 {
 	struct i915_gem_context *ctx;
@@ -843,9 +876,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
 
 static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_private *i915 = eb->i915;
-	unsigned int batch = eb_batch_index(eb);
-	unsigned int i;
+	unsigned int i, current_batch = 0;
 	int err = 0;
 
 	INIT_LIST_HEAD(&eb->relocs);
@@ -865,7 +896,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, batch, vma);
+		err = eb_add_vma(eb, &current_batch, i, vma);
+		if (err)
+			return err;
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
@@ -888,26 +921,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 	}
 
-	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
-		drm_dbg(&i915->drm,
-			"Attempting to use self-modifying batch buffer\n");
-		return -EINVAL;
-	}
-
-	if (range_overflows_t(u64,
-			      eb->batch_start_offset, eb->batch_len,
-			      eb->batch->vma->size)) {
-		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
-		return -EINVAL;
-	}
-
-	if (eb->batch_len == 0)
-		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
-	if (unlikely(eb->batch_len == 0)) { /* impossible! */
-		drm_dbg(&i915->drm, "Invalid batch length\n");
-		return -EINVAL;
-	}
-
 	return 0;
 
 err:
@@ -1640,8 +1653,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
-					   struct i915_request *rq)
+static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 {
 	bool have_copy = false;
 	struct eb_vma *ev;
@@ -1657,21 +1669,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	eb_release_vmas(eb, false);
 	i915_gem_ww_ctx_fini(&eb->ww);
 
-	if (rq) {
-		/* nonblocking is always false */
-		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
-				      MAX_SCHEDULE_TIMEOUT) < 0) {
-			i915_request_put(rq);
-			rq = NULL;
-
-			err = -EINTR;
-			goto err_relock;
-		}
-
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/*
 	 * We take 3 passes through the slowpatch.
 	 *
@@ -1698,28 +1695,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	if (!err)
 		err = eb_reinit_userptr(eb);
 
-err_relock:
 	i915_gem_ww_ctx_init(&eb->ww, true);
 	if (err)
 		goto out;
 
 	/* reacquire the objects */
 repeat_validate:
-	rq = eb_pin_engine(eb, false);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, false);
+	if (err)
 		goto err;
-	}
-
-	/* We didn't throttle, should be NULL */
-	GEM_WARN_ON(rq);
 
 	err = eb_validate_vmas(eb);
 	if (err)
 		goto err;
 
-	GEM_BUG_ON(!eb->batch);
+	GEM_BUG_ON(!eb->batches[0]);
 
 	list_for_each_entry(ev, &eb->relocs, reloc_link) {
 		if (!have_copy) {
@@ -1783,46 +1773,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 		}
 	}
 
-	if (rq)
-		i915_request_put(rq);
-
 	return err;
 }
 
 static int eb_relocate_parse(struct i915_execbuffer *eb)
 {
 	int err;
-	struct i915_request *rq = NULL;
 	bool throttle = true;
 
 retry:
-	rq = eb_pin_engine(eb, throttle);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		rq = NULL;
+	err = eb_pin_engine(eb, throttle);
+	if (err) {
 		if (err != -EDEADLK)
 			return err;
 
 		goto err;
 	}
 
-	if (rq) {
-		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
-
-		/* Need to drop all locks now for throttling, take slowpath */
-		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
-		if (err == -ETIME) {
-			if (nonblock) {
-				err = -EWOULDBLOCK;
-				i915_request_put(rq);
-				goto err;
-			}
-			goto slow;
-		}
-		i915_request_put(rq);
-		rq = NULL;
-	}
-
 	/* only throttle once, even if we didn't need to throttle */
 	throttle = false;
 
@@ -1862,7 +1829,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 
 slow:
-	err = eb_relocate_parse_slow(eb, rq);
+	err = eb_relocate_parse_slow(eb);
 	if (err)
 		/*
 		 * If the user expects the execobject.offset and
@@ -1876,11 +1843,31 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < _eb->num_batches; ++_i)
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = _eb->num_batches - 1; _i >= 0; --_i)
+
+static struct i915_request *
+eb_find_first_request(struct i915_execbuffer *eb)
+{
+	int i;
+
+	for_each_batch_add_order(eb, i)
+		if (eb->requests[i])
+			return eb->requests[i];
+
+	GEM_BUG_ON("Request not found");
+
+	return NULL;
+}
+
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
-	int err = 0;
+	int err = 0, j;
 
 	while (i--) {
 		struct eb_vma *ev = &eb->vma[i];
@@ -1893,11 +1880,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_capture_list *capture;
 
-			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
-			if (capture) {
-				capture->next = eb->request->capture_list;
-				capture->vma = vma;
-				eb->request->capture_list = capture;
+			for_each_batch_create_order(eb, j) {
+				if (!eb->requests[j])
+					break;
+
+				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
+				if (capture) {
+					capture->next =
+						eb->requests[j]->capture_list;
+					capture->vma = vma;
+					eb->requests[j]->capture_list = capture;
+				}
 			}
 		}
 
@@ -1918,14 +1911,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
+		/* We only need to await on the first request */
 		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
-				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
+				(eb_find_first_request(eb), obj,
+				 flags & EXEC_OBJECT_WRITE);
 		}
 
-		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request,
-						      flags | __EXEC_OBJECT_NO_RESERVE);
+		for_each_batch_add_order(eb, j) {
+			if (err)
+				break;
+			if (!eb->requests[j])
+				continue;
+
+			err = _i915_vma_move_to_active(vma, eb->requests[j],
+						       j ? NULL :
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->requests[j]->fence,
+						       flags | __EXEC_OBJECT_NO_RESERVE);
+		}
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -1956,11 +1961,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		goto err_skip;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	intel_gt_chipset_flush(eb->engine->gt);
+	intel_gt_chipset_flush(eb->gt);
 	return 0;
 
 err_skip:
-	i915_request_set_error_once(eb->request, err);
+	for_each_batch_create_order(eb, j) {
+		if (!eb->requests[j])
+			break;
+
+		i915_request_set_error_once(eb->requests[j], err);
+	}
 	return err;
 }
 
@@ -2055,14 +2065,16 @@ static int eb_parse(struct i915_execbuffer *eb)
 	int err;
 
 	if (!eb_use_cmdparser(eb)) {
-		batch = eb_dispatch_secure(eb, eb->batch->vma);
+		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
 		if (IS_ERR(batch))
 			return PTR_ERR(batch);
 
 		goto secure_batch;
 	}
 
-	len = eb->batch_len;
+	GEM_BUG_ON(intel_context_is_parallel(eb->context));
+
+	len = eb->batch_len[0];
 	if (!CMDPARSER_USES_GGTT(eb->i915)) {
 		/*
 		 * ppGTT backed shadow buffers must be mapped RO, to prevent
@@ -2076,11 +2088,11 @@ static int eb_parse(struct i915_execbuffer *eb)
 	} else {
 		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
 	}
-	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
+	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
 		return -EINVAL;
 
 	if (!pool) {
-		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
+		pool = intel_gt_get_buffer_pool(eb->gt, len,
 						I915_MAP_WB);
 		if (IS_ERR(pool))
 			return PTR_ERR(pool);
@@ -2105,7 +2117,7 @@ static int eb_parse(struct i915_execbuffer *eb)
 		trampoline = shadow;
 
 		shadow = shadow_batch_pin(eb, pool->obj,
-					  &eb->engine->gt->ggtt->vm,
+					  &eb->gt->ggtt->vm,
 					  PIN_GLOBAL);
 		if (IS_ERR(shadow)) {
 			err = PTR_ERR(shadow);
@@ -2127,26 +2139,27 @@ static int eb_parse(struct i915_execbuffer *eb)
 	if (err)
 		goto err_trampoline;
 
-	err = intel_engine_cmd_parser(eb->engine,
-				      eb->batch->vma,
+	err = intel_engine_cmd_parser(eb->context->engine,
+				      eb->batches[0]->vma,
 				      eb->batch_start_offset,
-				      eb->batch_len,
+				      eb->batch_len[0],
 				      shadow, trampoline);
 	if (err)
 		goto err_unpin_batch;
 
-	eb->batch = &eb->vma[eb->buffer_count++];
-	eb->batch->vma = i915_vma_get(shadow);
-	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
+	eb->batches[0] = &eb->vma[eb->buffer_count++];
+	eb->batches[0]->vma = i915_vma_get(shadow);
+	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
 
 	eb->trampoline = trampoline;
 	eb->batch_start_offset = 0;
 
 secure_batch:
 	if (batch) {
-		eb->batch = &eb->vma[eb->buffer_count++];
-		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
-		eb->batch->vma = i915_vma_get(batch);
+		GEM_BUG_ON(intel_context_is_parallel(eb->context));
+		eb->batches[0] = &eb->vma[eb->buffer_count++];
+		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
+		eb->batches[0]->vma = i915_vma_get(batch);
 	}
 	return 0;
 
@@ -2162,19 +2175,18 @@ static int eb_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_request_submit(struct i915_execbuffer *eb,
+			     struct i915_request *rq,
+			     struct i915_vma *batch,
+			     u64 batch_len)
 {
 	int err;
 
-	if (intel_context_nopreempt(eb->context))
-		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
-
-	err = eb_move_to_gpu(eb);
-	if (err)
-		return err;
+	if (intel_context_nopreempt(rq->context))
+		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		err = i915_reset_gen7_sol_offsets(eb->request);
+		err = i915_reset_gen7_sol_offsets(rq);
 		if (err)
 			return err;
 	}
@@ -2185,26 +2197,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	 * allows us to determine if the batch is still waiting on the GPU
 	 * or actually running by checking the breadcrumb.
 	 */
-	if (eb->engine->emit_init_breadcrumb) {
-		err = eb->engine->emit_init_breadcrumb(eb->request);
+	if (rq->context->engine->emit_init_breadcrumb) {
+		err = rq->context->engine->emit_init_breadcrumb(rq);
 		if (err)
 			return err;
 	}
 
-	err = eb->engine->emit_bb_start(eb->request,
-					batch->node.start +
-					eb->batch_start_offset,
-					eb->batch_len,
-					eb->batch_flags);
+	err = rq->context->engine->emit_bb_start(rq,
+						 batch->node.start +
+						 eb->batch_start_offset,
+						 batch_len,
+						 eb->batch_flags);
 	if (err)
 		return err;
 
 	if (eb->trampoline) {
+		GEM_BUG_ON(intel_context_is_parallel(rq->context));
 		GEM_BUG_ON(eb->batch_start_offset);
-		err = eb->engine->emit_bb_start(eb->request,
-						eb->trampoline->node.start +
-						eb->batch_len,
-						0, 0);
+		err = rq->context->engine->emit_bb_start(rq,
+							 eb->trampoline->node.start +
+							 batch_len, 0, 0);
 		if (err)
 			return err;
 	}
@@ -2212,6 +2224,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	return 0;
 }
 
+static int eb_submit(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+	int err;
+
+	err = eb_move_to_gpu(eb);
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
+		if (!err)
+			err = eb_request_submit(eb, eb->requests[i],
+						eb->batches[i]->vma,
+						eb->batch_len[i]);
+	}
+
+	return err;
+}
+
 static int num_vcs_engines(const struct drm_i915_private *i915)
 {
 	return hweight_long(VDBOX_MASK(&i915->gt));
@@ -2277,26 +2310,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
 	return i915_request_get(rq);
 }
 
-static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
+			   bool throttle)
 {
-	struct intel_context *ce = eb->context;
 	struct intel_timeline *tl;
-	struct i915_request *rq = NULL;
-	int err;
-
-	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
-
-	if (unlikely(intel_context_is_banned(ce)))
-		return ERR_PTR(-EIO);
-
-	/*
-	 * Pinning the contexts may generate requests in order to acquire
-	 * GGTT space, so do this first before we reserve a seqno for
-	 * ourselves.
-	 */
-	err = intel_context_pin_ww(ce, &eb->ww);
-	if (err)
-		return ERR_PTR(err);
+	struct i915_request *rq;
 
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
@@ -2307,33 +2325,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
 	 * taken on the engine, and the parent device.
 	 */
 	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl)) {
-		intel_context_unpin(ce);
-		return ERR_CAST(tl);
-	}
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
 
 	intel_context_enter(ce);
 	if (throttle)
 		rq = eb_throttle(eb, ce);
 	intel_context_timeline_unlock(tl);
 
+	if (rq) {
+		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
+		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
+
+		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
+				      timeout) < 0) {
+			i915_request_put(rq);
+
+			tl = intel_context_timeline_lock(ce);
+			intel_context_exit(ce);
+			intel_context_timeline_unlock(tl);
+
+			if (nonblock)
+				return -EWOULDBLOCK;
+			else
+				return -EINTR;
+		}
+		i915_request_put(rq);
+	}
+
+	return 0;
+}
+
+static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
+{
+	struct intel_context *ce = eb->context, *child;
+	int err;
+	int i = 0, j = 0;
+
+	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
+
+	if (unlikely(intel_context_is_banned(ce)))
+		return -EIO;
+
+	/*
+	 * Pinning the contexts may generate requests in order to acquire
+	 * GGTT space, so do this first before we reserve a seqno for
+	 * ourselves.
+	 */
+	err = intel_context_pin_ww(ce, &eb->ww);
+	if (err)
+		return err;
+	for_each_child(ce, child) {
+		err = intel_context_pin_ww(child, &eb->ww);
+		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
+	}
+
+	for_each_child(ce, child) {
+		err = eb_pin_timeline(eb, child, throttle);
+		if (err)
+			goto unwind;
+		++i;
+	}
+	err = eb_pin_timeline(eb, ce, throttle);
+	if (err)
+		goto unwind;
+
 	eb->args->flags |= __EXEC_ENGINE_PINNED;
-	return rq;
+	return 0;
+
+unwind:
+	for_each_child(ce, child) {
+		if (j++ < i) {
+			mutex_lock(&child->timeline->mutex);
+			intel_context_exit(child);
+			mutex_unlock(&child->timeline->mutex);
+		}
+	}
+	for_each_child(ce, child)
+		intel_context_unpin(child);
+	intel_context_unpin(ce);
+	return err;
 }
 
 static void eb_unpin_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce = eb->context;
-	struct intel_timeline *tl = ce->timeline;
+	struct intel_context *ce = eb->context, *child;
 
 	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
 		return;
 
 	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
 
-	mutex_lock(&tl->mutex);
+	for_each_child(ce, child) {
+		mutex_lock(&child->timeline->mutex);
+		intel_context_exit(child);
+		mutex_unlock(&child->timeline->mutex);
+
+		intel_context_unpin(child);
+	}
+
+	mutex_lock(&ce->timeline->mutex);
 	intel_context_exit(ce);
-	mutex_unlock(&tl->mutex);
+	mutex_unlock(&ce->timeline->mutex);
 
 	intel_context_unpin(ce);
 }
@@ -2384,7 +2477,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
 static int
 eb_select_engine(struct i915_execbuffer *eb)
 {
-	struct intel_context *ce;
+	struct intel_context *ce, *child;
 	unsigned int idx;
 	int err;
 
@@ -2397,6 +2490,20 @@ eb_select_engine(struct i915_execbuffer *eb)
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
+	if (intel_context_is_parallel(ce)) {
+		if (eb->buffer_count < ce->guc_number_children + 1) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+		if (eb->batch_start_offset || eb->args->batch_len) {
+			intel_context_put(ce);
+			return -EINVAL;
+		}
+	}
+	eb->num_batches = ce->guc_number_children + 1;
+
+	for_each_child(ce, child)
+		intel_context_get(child);
 	intel_gt_pm_get(ce->engine->gt);
 
 	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
@@ -2404,6 +2511,13 @@ eb_select_engine(struct i915_execbuffer *eb)
 		if (err)
 			goto err;
 	}
+	for_each_child(ce, child) {
+		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
+			err = intel_context_alloc_state(child);
+			if (err)
+				goto err;
+		}
+	}
 
 	/*
 	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
@@ -2414,7 +2528,7 @@ eb_select_engine(struct i915_execbuffer *eb)
 		goto err;
 
 	eb->context = ce;
-	eb->engine = ce->engine;
+	eb->gt = ce->engine->gt;
 
 	/*
 	 * Make sure engine pool stays alive even if we call intel_context_put
@@ -2425,6 +2539,8 @@ eb_select_engine(struct i915_execbuffer *eb)
 
 err:
 	intel_gt_pm_put(ce->engine->gt);
+	for_each_child(ce, child)
+		intel_context_put(child);
 	intel_context_put(ce);
 	return err;
 }
@@ -2432,7 +2548,11 @@ eb_select_engine(struct i915_execbuffer *eb)
 static void
 eb_put_engine(struct i915_execbuffer *eb)
 {
-	intel_gt_pm_put(eb->engine->gt);
+	struct intel_context *child;
+
+	intel_gt_pm_put(eb->gt);
+	for_each_child(eb->context, child)
+		intel_context_put(child);
 	intel_context_put(eb->context);
 }
 
@@ -2655,7 +2775,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
 }
 
 static int
-await_fence_array(struct i915_execbuffer *eb)
+await_fence_array(struct i915_execbuffer *eb,
+		  struct i915_request *rq)
 {
 	unsigned int n;
 	int err;
@@ -2669,8 +2790,7 @@ await_fence_array(struct i915_execbuffer *eb)
 		if (!eb->fences[n].dma_fence)
 			continue;
 
-		err = i915_request_await_dma_fence(eb->request,
-						   eb->fences[n].dma_fence);
+		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
 		if (err < 0)
 			return err;
 	}
@@ -2678,9 +2798,9 @@ await_fence_array(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static void signal_fence_array(const struct i915_execbuffer *eb)
+static void signal_fence_array(const struct i915_execbuffer *eb,
+			       struct dma_fence * const fence)
 {
-	struct dma_fence * const fence = &eb->request->fence;
 	unsigned int n;
 
 	for (n = 0; n < eb->num_fences; n++) {
@@ -2728,12 +2848,12 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
 			break;
 }
 
-static int eb_request_add(struct i915_execbuffer *eb, int err)
+static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)
 {
-	struct i915_request *rq = eb->request;
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
 	struct i915_request *prev;
+	int err = 0;
 
 	lockdep_assert_held(&tl->mutex);
 	lockdep_unpin_lock(&tl->mutex, rq->cookie);
@@ -2745,11 +2865,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
 		attr = eb->gem_context->sched;
-	} else {
-		/* Serialise with context_close via the add_to_timeline */
-		i915_request_set_error_once(rq, -ENOENT);
-		__i915_request_skip(rq);
-		err = -ENOENT; /* override any transient errors */
 	}
 
 	__i915_request_queue(rq, &attr);
@@ -2763,6 +2878,44 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
 	return err;
 }
 
+static int eb_requests_add(struct i915_execbuffer *eb, int err)
+{
+	int i;
+
+	/*
+	* We iterate in reverse order of creation to release timeline mutexes in
+	* same order.
+	*/
+	for_each_batch_add_order(eb, i) {
+		struct i915_request *rq = eb->requests[i];
+
+		if (!rq)
+			continue;
+
+		if (unlikely(intel_context_is_closed(eb->context))) {
+			/* Serialise with context_close via the add_to_timeline */
+			i915_request_set_error_once(rq, -ENOENT);
+			__i915_request_skip(rq);
+			err = -ENOENT; /* override any transient errors */
+		}
+
+		if (intel_context_is_parallel(eb->context)) {
+			if (err) {
+				__i915_request_skip(rq);
+				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
+					&rq->fence.flags);
+			}
+			if (i == 0)
+				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+					&rq->fence.flags);
+		}
+
+		err |= eb_request_add(eb, rq);
+	}
+
+	return err;
+}
+
 static const i915_user_extension_fn execbuf_extensions[] = {
 	[DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES] = parse_timeline_fences,
 };
@@ -2789,6 +2942,166 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
 				    eb);
 }
 
+static void eb_requests_get(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_get(eb->requests[i]);
+	}
+}
+
+static void eb_requests_put(struct i915_execbuffer *eb)
+{
+	unsigned int i;
+
+	for_each_batch_create_order(eb, i) {
+		if (!eb->requests[i])
+			break;
+
+		i915_request_put(eb->requests[i]);
+	}
+}
+static struct sync_file *
+eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	struct dma_fence_array *fence_array;
+	struct dma_fence **fences;
+	unsigned int i;
+
+	GEM_BUG_ON(!intel_context_is_parent(eb->context));
+
+	fences = kmalloc(eb->num_batches * sizeof(*fences), GFP_KERNEL);
+	if (!fences)
+		return ERR_PTR(-ENOMEM);
+
+	for_each_batch_create_order(eb, i)
+		fences[i] = &eb->requests[i]->fence;
+
+	fence_array = dma_fence_array_create(eb->num_batches,
+					     fences,
+					     eb->context->fence_context,
+					     eb->context->seqno,
+					     false);
+	if (!fence_array) {
+		kfree(fences);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Move ownership to the dma_fence_array created above */
+	for_each_batch_create_order(eb, i)
+		dma_fence_get(fences[i]);
+
+	if (out_fence_fd != -1) {
+		out_fence = sync_file_create(&fence_array->base);
+		/* sync_file now owns fence_arry, drop creation ref */
+		dma_fence_put(&fence_array->base);
+		if (!out_fence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	eb->composite_fence = &fence_array->base;
+
+	return out_fence;
+}
+
+static struct intel_context *
+eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
+{
+	struct intel_context *child;
+
+	if (likely(context_number == 0))
+		return eb->context;
+
+	for_each_child(eb->context, child)
+		if (!--context_number)
+			return child;
+
+	GEM_BUG_ON("Context not found");
+
+	return NULL;
+}
+
+static struct sync_file *
+eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
+		   int out_fence_fd)
+{
+	struct sync_file *out_fence = NULL;
+	unsigned int i;
+	int err;
+
+	for_each_batch_create_order(eb, i) {
+		bool first_request_to_add = i + 1 == eb->num_batches;
+
+		/* Allocate a request for this batch buffer nice and early. */
+		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
+		if (IS_ERR(eb->requests[i])) {
+			eb->requests[i] = NULL;
+			return ERR_PTR(PTR_ERR(eb->requests[i]));
+		}
+
+		if (unlikely(eb->gem_context->syncobj &&
+			     first_request_to_add)) {
+			struct dma_fence *fence;
+
+			fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
+			err = i915_request_await_dma_fence(eb->requests[i], fence);
+			dma_fence_put(fence);
+			if (err)
+				return ERR_PTR(err);
+		}
+
+		if (in_fence && first_request_to_add) {
+			if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
+				err = i915_request_await_execution(eb->requests[i],
+								   in_fence);
+			else
+				err = i915_request_await_dma_fence(eb->requests[i],
+								   in_fence);
+			if (err < 0)
+				return ERR_PTR(err);
+		}
+
+		if (eb->fences && first_request_to_add) {
+			err = await_fence_array(eb, eb->requests[i]);
+			if (err)
+				return ERR_PTR(err);
+		}
+
+		if (first_request_to_add) {
+			if (intel_context_is_parallel(eb->context)) {
+				out_fence = eb_composite_fence_create(eb, out_fence_fd);
+				if (IS_ERR(out_fence))
+					return ERR_PTR(-ENOMEM);
+			} else if (out_fence_fd != -1) {
+				out_fence = sync_file_create(&eb->requests[i]->fence);
+				if (!out_fence)
+					return ERR_PTR(-ENOMEM);
+			}
+		}
+
+		/*
+		 * Whilst this request exists, batch_obj will be on the
+		 * active_list, and so will hold the active reference. Only when
+		 * this request is retired will the the batch_obj be moved onto
+		 * the inactive_list and lose its active reference. Hence we do
+		 * not need to explicitly hold another reference here.
+		 */
+		eb->requests[i]->batch = eb->batches[i]->vma;
+		if (eb->batch_pool) {
+			GEM_BUG_ON(intel_context_is_parallel(eb->context));
+			intel_gt_buffer_pool_mark_active(eb->batch_pool,
+							 eb->requests[i]);
+		}
+	}
+
+	return out_fence;
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
@@ -2799,7 +3112,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
-	struct i915_vma *batch;
 	int out_fence_fd = -1;
 	int err;
 
@@ -2823,12 +3135,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
-	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
 
+	memset(eb.requests, 0, sizeof(struct i915_request *) *
+	       ARRAY_SIZE(eb.requests));
+	eb.composite_fence = NULL;
+
 	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (GRAPHICS_VER(i915) >= 11)
@@ -2912,70 +3227,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	ww_acquire_done(&eb.ww.ctx);
 
-	batch = eb.batch->vma;
-
-	/* Allocate a request for this batch buffer nice and early. */
-	eb.request = i915_request_create(eb.context);
-	if (IS_ERR(eb.request)) {
-		err = PTR_ERR(eb.request);
-		goto err_vma;
-	}
-
-	if (unlikely(eb.gem_context->syncobj)) {
-		struct dma_fence *fence;
-
-		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
-		err = i915_request_await_dma_fence(eb.request, fence);
-		dma_fence_put(fence);
-		if (err)
-			goto err_ext;
-	}
-
-	if (in_fence) {
-		if (args->flags & I915_EXEC_FENCE_SUBMIT)
-			err = i915_request_await_execution(eb.request,
-							   in_fence);
-		else
-			err = i915_request_await_dma_fence(eb.request,
-							   in_fence);
-		if (err < 0)
-			goto err_request;
-	}
-
-	if (eb.fences) {
-		err = await_fence_array(&eb);
-		if (err)
-			goto err_request;
-	}
-
-	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&eb.request->fence);
-		if (!out_fence) {
-			err = -ENOMEM;
+	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
+	if (IS_ERR(out_fence)) {
+		err = PTR_ERR(out_fence);
+		if (eb.requests[0])
 			goto err_request;
-		}
+		else
+			goto err_vma;
 	}
 
-	/*
-	 * Whilst this request exists, batch_obj will be on the
-	 * active_list, and so will hold the active reference. Only when this
-	 * request is retired will the the batch_obj be moved onto the
-	 * inactive_list and lose its active reference. Hence we do not need
-	 * to explicitly hold another reference here.
-	 */
-	eb.request->batch = batch;
-	if (eb.batch_pool)
-		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
-
-	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb);
 
 err_request:
-	i915_request_get(eb.request);
-	err = eb_request_add(&eb, err);
+	eb_requests_get(&eb);
+	err = eb_requests_add(&eb, err);
 
 	if (eb.fences)
-		signal_fence_array(&eb);
+		signal_fence_array(&eb, eb.composite_fence ?
+				   eb.composite_fence :
+				   &eb.requests[0]->fence);
 
 	if (out_fence) {
 		if (err == 0) {
@@ -2990,10 +3260,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
-					  &eb.request->fence);
+					  eb.composite_fence ?
+					  eb.composite_fence :
+					  &eb.requests[0]->fence);
 	}
 
-	i915_request_put(eb.request);
+	if (!out_fence && eb.composite_fence)
+		dma_fence_put(eb.composite_fence);
+
+	eb_requests_put(&eb);
 
 err_vma:
 	eb_release_vmas(&eb, true);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 9dcc1b14697b..1f6a5ae3e33e 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -237,7 +237,13 @@ intel_context_timeline_lock(struct intel_context *ce)
 	struct intel_timeline *tl = ce->timeline;
 	int err;
 
-	err = mutex_lock_interruptible(&tl->mutex);
+	if (intel_context_is_parent(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
+	else if (intel_context_is_child(ce))
+		err = mutex_lock_interruptible_nested(&tl->mutex,
+						      ce->guc_child_index + 1);
+	else
+		err = mutex_lock_interruptible(&tl->mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 727f91e7f7c2..094fcfb5cbe1 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -261,6 +261,18 @@ struct intel_context {
 		 * context.
 		 */
 		struct i915_request *last_rq;
+
+		/**
+		 * @fence_context: fence context composite fence when doing
+		 * parallel submission
+		 */
+		u64 fence_context;
+
+		/**
+		 * @seqno: seqno for composite fence when doing parallel
+		 * submission
+		 */
+		u32 seqno;
 	};
 
 #ifdef CONFIG_DRM_I915_SELFTEST
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1a18f99bf12a..2ef38557b0f0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3034,6 +3034,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
 		}
 	}
 
+	parent->fence_context = dma_fence_context_alloc(1);
+
 	parent->engine->emit_bb_start =
 		emit_bb_start_parent_no_preempt_mid_batch;
 	parent->engine->emit_fini_breadcrumb =
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 8f0073e19079..602cc246ba85 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -147,6 +147,15 @@ enum {
 	 * tail.
 	 */
 	I915_FENCE_FLAG_SUBMIT_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) that
+	 * hit an error while generating requests in the execbuf IOCTL.
+	 * Indicates this request should be skipped as another request in
+	 * submission / relationship encoutered an error.
+	 */
+	I915_FENCE_FLAG_SKIP_PARALLEL,
 };
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 4b7fc4647e46..90546fa58fc1 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	return i915_active_add_request(&vma->active, rq);
 }
 
-int i915_vma_move_to_active(struct i915_vma *vma,
-			    struct i915_request *rq,
-			    unsigned int flags)
+int _i915_vma_move_to_active(struct i915_vma *vma,
+			     struct i915_request *rq,
+			     struct dma_fence *fence,
+			     unsigned int flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
@@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 			intel_frontbuffer_put(front);
 		}
 
-		dma_resv_add_excl_fence(vma->resv, &rq->fence);
-		obj->write_domain = I915_GEM_DOMAIN_RENDER;
-		obj->read_domains = 0;
+		if (fence) {
+			dma_resv_add_excl_fence(vma->resv, fence);
+			obj->write_domain = I915_GEM_DOMAIN_RENDER;
+			obj->read_domains = 0;
+		}
 	} else {
 		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
 			err = dma_resv_reserve_shared(vma->resv, 1);
@@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 				return err;
 		}
 
-		dma_resv_add_shared_fence(vma->resv, &rq->fence);
-		obj->write_domain = 0;
+		if (fence) {
+			dma_resv_add_shared_fence(vma->resv, fence);
+			obj->write_domain = 0;
+		}
 	}
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index ed69f66c7ab0..648dbe744c96 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
 
 int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
 					   struct i915_request *rq);
-int __must_check i915_vma_move_to_active(struct i915_vma *vma,
-					 struct i915_request *rq,
-					 unsigned int flags);
+int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
+					  struct i915_request *rq,
+					  struct dma_fence *fence,
+					  unsigned int flags);
+static inline int __must_check
+i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
+			unsigned int flags)
+{
+	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
+}
 
 #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 25/27] drm/i915/guc: Handle errors in multi-lrc requests
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arrises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forwad progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 61 ++++++++++++++++++-
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2ef38557b0f0..61e737fd1eee 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3546,8 +3546,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
-						 u32 *cs)
+__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						   u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	u8 i;
@@ -3575,6 +3575,41 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 				  get_children_go_addr(ce),
 				  0);
 
+	return cs;
+}
+
+/*
+ * If this true, a submission of multi-lrc requests had an error and the
+ * requests need to be skipped. The front end (execuf IOCTL) should've called
+ * i915_request_skip which squashes the BB but we still need to emit the fini
+ * breadrcrumbs seqno write. At this point we don't know how many of the
+ * requests in the multi-lrc submission were generated so we can't do the
+ * handshake between the parent and children (e.g. if 4 requests should be
+ * generated but 2nd hit an error only 1 would be seen by the GuC backend).
+ * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
+ * has occurred on any of the requests in submission / relationship.
+ */
+static inline bool skip_handshake(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
@@ -3591,7 +3626,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						  u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 
@@ -3617,6 +3653,25 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
 	*cs++ = get_children_go_addr(ce->parent);
 	*cs++ = 0;
 
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 25/27] drm/i915/guc: Handle errors in multi-lrc requests
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arrises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forwad progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 61 ++++++++++++++++++-
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2ef38557b0f0..61e737fd1eee 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3546,8 +3546,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
-						 u32 *cs)
+__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						   u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	u8 i;
@@ -3575,6 +3575,41 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 				  get_children_go_addr(ce),
 				  0);
 
+	return cs;
+}
+
+/*
+ * If this true, a submission of multi-lrc requests had an error and the
+ * requests need to be skipped. The front end (execuf IOCTL) should've called
+ * i915_request_skip which squashes the BB but we still need to emit the fini
+ * breadrcrumbs seqno write. At this point we don't know how many of the
+ * requests in the multi-lrc submission were generated so we can't do the
+ * handshake between the parent and children (e.g. if 4 requests should be
+ * generated but 2nd hit an error only 1 would be seen by the GuC backend).
+ * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
+ * has occurred on any of the requests in submission / relationship.
+ */
+static inline bool skip_handshake(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
@@ -3591,7 +3626,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						  u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 
@@ -3617,6 +3653,25 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
 	*cs++ = get_children_go_addr(ce->parent);
 	*cs++ = 0;
 
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 26/27] drm/i915: Enable multi-bb execbuf
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Enable multi-bb execbuf by enabling the set_parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index de0fd145fb47..0aa095bed310 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* Disabling for now */
-	return -ENODEV;
-
 	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
 		return -ENODEV;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 26/27] drm/i915: Enable multi-bb execbuf
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

Enable multi-bb execbuf by enabling the set_parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index de0fd145fb47..0aa095bed310 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* Disabling for now */
-	return -ENODEV;
-
 	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
 		return -ENODEV;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [PATCH 27/27] drm/i915/execlists: Weak parallel submission support for execlists
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
@ 2021-08-20 22:44   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Basically doing as little as possible to support this
interface for execlists - basically just passing submit fences between
each request generated and virtual engines are not allowed. This is on
par with what is there for the existing (hopefully soon deprecated)
bonding interface.

We perma-pin these execlists contexts to align with GuC implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  9 ++-
 drivers/gpu/drm/i915/gt/intel_context.c       |  4 +-
 .../drm/i915/gt/intel_execlists_submission.c  | 57 ++++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  2 -
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 0aa095bed310..cb6ce2ee1d8b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
-		return -ENODEV;
-
 	if (get_user(slot, &ext->engine_index))
 		return -EFAULT;
 
@@ -548,6 +545,12 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	if (get_user(num_siblings, &ext->num_siblings))
 		return -EFAULT;
 
+	if (!intel_uc_uses_guc_submission(&i915->gt.uc) && num_siblings != 1) {
+		drm_dbg(&i915->drm, "Only 1 sibling (%d) supported in non-GuC mode\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
 	if (slot >= set->num_engines) {
 		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
 			slot, set->num_engines);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2de62649e275..b0f0cac6a151 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -79,7 +79,8 @@ static int intel_context_active_acquire(struct intel_context *ce)
 
 	__i915_active_acquire(&ce->active);
 
-	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine))
+	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine) ||
+	    intel_context_is_parallel(ce))
 		return 0;
 
 	/* Preallocate tracking nodes */
@@ -554,7 +555,6 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	 * Callers responsibility to validate that this function is used
 	 * correctly but we use GEM_BUG_ON here ensure that they do.
 	 */
-	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
 	GEM_BUG_ON(intel_context_is_pinned(parent));
 	GEM_BUG_ON(intel_context_is_child(parent));
 	GEM_BUG_ON(intel_context_is_pinned(child));
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index d1e2d6f8ff81..8875d85a1677 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -927,8 +927,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
 {
-	return (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
-		intel_context_force_single_submission(ce));
+	return intel_context_force_single_submission(ce);
 }
 
 static bool can_merge_ctx(const struct intel_context *prev,
@@ -2598,6 +2597,59 @@ static void execlists_context_cancel_request(struct intel_context *ce,
 				      current->comm);
 }
 
+static struct intel_context *
+execlists_create_parallel(struct intel_engine_cs **engines,
+			  unsigned int num_siblings,
+			  unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	GEM_BUG_ON(num_siblings != 1);
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_context_create(siblings[0]);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+		} else {
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	parent->fence_context = dma_fence_context_alloc(1);
+
+	intel_context_set_nopreempt(parent);
+	intel_context_set_single_submission(parent);
+	for_each_child(parent, ce) {
+		intel_context_set_nopreempt(ce);
+		intel_context_set_single_submission(ce);
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static const struct intel_context_ops execlists_context_ops = {
 	.flags = COPS_HAS_INFLIGHT,
 
@@ -2616,6 +2668,7 @@ static const struct intel_context_ops execlists_context_ops = {
 	.reset = lrc_reset,
 	.destroy = lrc_destroy,
 
+	.create_parallel = execlists_create_parallel,
 	.create_virtual = execlists_create_virtual,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 0ddbad4e062a..7e1153f84cca 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -983,6 +983,8 @@ lrc_pin(struct intel_context *ce,
 
 void lrc_unpin(struct intel_context *ce)
 {
+	if (unlikely(ce->last_rq))
+		i915_request_put(ce->last_rq);
 	check_redzone((void *)ce->lrc_reg_state - LRC_STATE_OFFSET,
 		      ce->engine);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 61e737fd1eee..88cae956c468 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2901,8 +2901,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
-	if (ce->last_rq)
-		i915_request_put(ce->last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH 27/27] drm/i915/execlists: Weak parallel submission support for execlists
@ 2021-08-20 22:44   ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-08-20 22:44 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Basically doing as little as possible to support this
interface for execlists - basically just passing submit fences between
each request generated and virtual engines are not allowed. This is on
par with what is there for the existing (hopefully soon deprecated)
bonding interface.

We perma-pin these execlists contexts to align with GuC implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  9 ++-
 drivers/gpu/drm/i915/gt/intel_context.c       |  4 +-
 .../drm/i915/gt/intel_execlists_submission.c  | 57 ++++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  2 -
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 0aa095bed310..cb6ce2ee1d8b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -536,9 +536,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
-		return -ENODEV;
-
 	if (get_user(slot, &ext->engine_index))
 		return -EFAULT;
 
@@ -548,6 +545,12 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	if (get_user(num_siblings, &ext->num_siblings))
 		return -EFAULT;
 
+	if (!intel_uc_uses_guc_submission(&i915->gt.uc) && num_siblings != 1) {
+		drm_dbg(&i915->drm, "Only 1 sibling (%d) supported in non-GuC mode\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
 	if (slot >= set->num_engines) {
 		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
 			slot, set->num_engines);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2de62649e275..b0f0cac6a151 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -79,7 +79,8 @@ static int intel_context_active_acquire(struct intel_context *ce)
 
 	__i915_active_acquire(&ce->active);
 
-	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine))
+	if (intel_context_is_barrier(ce) || intel_engine_uses_guc(ce->engine) ||
+	    intel_context_is_parallel(ce))
 		return 0;
 
 	/* Preallocate tracking nodes */
@@ -554,7 +555,6 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	 * Callers responsibility to validate that this function is used
 	 * correctly but we use GEM_BUG_ON here ensure that they do.
 	 */
-	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
 	GEM_BUG_ON(intel_context_is_pinned(parent));
 	GEM_BUG_ON(intel_context_is_child(parent));
 	GEM_BUG_ON(intel_context_is_pinned(child));
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index d1e2d6f8ff81..8875d85a1677 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -927,8 +927,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
 {
-	return (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
-		intel_context_force_single_submission(ce));
+	return intel_context_force_single_submission(ce);
 }
 
 static bool can_merge_ctx(const struct intel_context *prev,
@@ -2598,6 +2597,59 @@ static void execlists_context_cancel_request(struct intel_context *ce,
 				      current->comm);
 }
 
+static struct intel_context *
+execlists_create_parallel(struct intel_engine_cs **engines,
+			  unsigned int num_siblings,
+			  unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+
+	GEM_BUG_ON(num_siblings != 1);
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_context_create(siblings[0]);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+		} else {
+			intel_context_bind_parent_child(parent, ce);
+		}
+	}
+
+	parent->fence_context = dma_fence_context_alloc(1);
+
+	intel_context_set_nopreempt(parent);
+	intel_context_set_single_submission(parent);
+	for_each_child(parent, ce) {
+		intel_context_set_nopreempt(ce);
+		intel_context_set_single_submission(ce);
+	}
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent)
+		intel_context_put(parent);
+	kfree(siblings);
+	return err;
+}
+
 static const struct intel_context_ops execlists_context_ops = {
 	.flags = COPS_HAS_INFLIGHT,
 
@@ -2616,6 +2668,7 @@ static const struct intel_context_ops execlists_context_ops = {
 	.reset = lrc_reset,
 	.destroy = lrc_destroy,
 
+	.create_parallel = execlists_create_parallel,
 	.create_virtual = execlists_create_virtual,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 0ddbad4e062a..7e1153f84cca 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -983,6 +983,8 @@ lrc_pin(struct intel_context *ce,
 
 void lrc_unpin(struct intel_context *ce)
 {
+	if (unlikely(ce->last_rq))
+		i915_request_put(ce->last_rq);
 	check_redzone((void *)ce->lrc_reg_state - LRC_STATE_OFFSET,
 		      ce->engine);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 61e737fd1eee..88cae956c468 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2901,8 +2901,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
 
-	if (ce->last_rq)
-		i915_request_put(ce->last_rq);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev3)
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
                   ` (27 preceding siblings ...)
  (?)
@ 2021-08-20 23:13 ` Patchwork
  -1 siblings, 0 replies; 145+ messages in thread
From: Patchwork @ 2021-08-20 23:13 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev3)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
77206ddfd7e8 drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and kernel DOC
-:1400: WARNING:BRACES: braces {} are not necessary for single statement blocks
#1400: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1491:
+		if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}

-:2274: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#2274: 
new file mode 100644

total: 0 errors, 2 warnings, 0 checks, 2619 lines checked
4a9a78aa06a7 drm/i915/guc: Allow flexible number of context ids
ce904de73b0d drm/i915/guc: Connect the number of guc_ids to debugfs
48e7785b7cb4 drm/i915/guc: Take GT PM ref when deregistering context
-:87: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#87: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:87: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#87: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:90: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#90: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:47:
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:90: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#90: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:47:
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:93: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#93: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:50:
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:93: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#93: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:50:
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:96: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#96: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:53:
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:96: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#96: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:53:
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

total: 0 errors, 0 warnings, 8 checks, 552 lines checked
08b1d791ac1a drm/i915: Add GT PM unpark worker
-:71: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#71: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 213 lines checked
e0617461e1ec drm/i915/guc: Take engine PM when a context is pinned with GuC submission
cb16e92153bb drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
0222d44e38b8 drm/i915: Add logical engine mapping
b398df972aaa drm/i915: Expose logical engine instance to user
02d0b3f0499f drm/i915/guc: Introduce context parent-child relationship
-:105: ERROR:CODE_INDENT: code indent should use tabs where possible
#105: FILE: drivers/gpu/drm/i915/gt/intel_context.h:62:
+        if (intel_context_is_child(ce)) {$

-:105: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#105: FILE: drivers/gpu/drm/i915/gt/intel_context.h:62:
+        if (intel_context_is_child(ce)) {$

-:113: ERROR:CODE_INDENT: code indent should use tabs where possible
#113: FILE: drivers/gpu/drm/i915/gt/intel_context.h:70:
+                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));$

-:113: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#113: FILE: drivers/gpu/drm/i915/gt/intel_context.h:70:
+                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));$

-:115: ERROR:CODE_INDENT: code indent should use tabs where possible
#115: FILE: drivers/gpu/drm/i915/gt/intel_context.h:72:
+                return ce->parent;$

-:115: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#115: FILE: drivers/gpu/drm/i915/gt/intel_context.h:72:
+                return ce->parent;$

-:116: ERROR:CODE_INDENT: code indent should use tabs where possible
#116: FILE: drivers/gpu/drm/i915/gt/intel_context.h:73:
+        } else {$

-:116: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#116: FILE: drivers/gpu/drm/i915/gt/intel_context.h:73:
+        } else {$

-:117: ERROR:CODE_INDENT: code indent should use tabs where possible
#117: FILE: drivers/gpu/drm/i915/gt/intel_context.h:74:
+                return ce;$

-:117: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#117: FILE: drivers/gpu/drm/i915/gt/intel_context.h:74:
+                return ce;$

-:118: ERROR:CODE_INDENT: code indent should use tabs where possible
#118: FILE: drivers/gpu/drm/i915/gt/intel_context.h:75:
+        }$

-:118: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#118: FILE: drivers/gpu/drm/i915/gt/intel_context.h:75:
+        }$

-:144: WARNING:REPEATED_WORD: Possible repeated word: 'of'
#144: FILE: drivers/gpu/drm/i915/gt/intel_context_types.h:220:
+			 * @guc_child_list: parent's list of of children

total: 6 errors, 7 warnings, 0 checks, 125 lines checked
bff8a93386b8 drm/i915/guc: Implement parallel context pin / unpin functions
2b3d3ef5cb7c drm/i915/guc: Add multi-lrc context registration
-:20: CHECK:LINE_SPACING: Please don't use multiple blank lines
#20: FILE: drivers/gpu/drm/i915/gt/intel_context_types.h:235:
 
+

total: 0 errors, 0 warnings, 1 checks, 192 lines checked
f671cf6be29a drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
4397e2c32688 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
-:19: WARNING:TYPO_SPELLING: 'Explictly' may be misspelled - perhaps 'Explicitly'?
#19: 
  - Explictly state why we assign consecutive guc_ids
    ^^^^^^^^^

-:58: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'guc' - possible side-effects?
#58: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:136:
+#define NUMBER_MULTI_LRC_GUC_ID(guc) \
+	((guc)->submission_state.num_guc_ids / 16 > 32 ? \
+	 (guc)->submission_state.num_guc_ids / 16 : 32)

total: 0 errors, 1 warnings, 1 checks, 195 lines checked
279b43791c6d drm/i915/guc: Implement multi-lrc submission
-:236: ERROR:SPACING: spaces required around that '=' (ctx:VxW)
#236: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:655:
+		guc->stalled_request= rq;
 		                    ^

-:330: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
#330: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:753:
+		*wqi++ = child->ring->tail / sizeof(u64);
 		^

total: 1 errors, 0 warnings, 1 checks, 545 lines checked
f0eaf349580b drm/i915/guc: Insert submit fences between requests in parent-child relationship
d20a9f8066d0 drm/i915/guc: Implement multi-lrc reset
28a6b590e69c drm/i915/guc: Update debugfs for GuC multi-lrc
-:19: CHECK:LINE_SPACING: Please don't use multiple blank lines
#19: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3676:
 
+

total: 0 errors, 0 warnings, 1 checks, 65 lines checked
f3e9fcd665da drm/i915: Fix bug in user proto-context creation that leaked contexts
24d24d71c346 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
d7635607ca11 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
-:12: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 checks, 10 lines checked
166ae35845ec drm/i915/guc: Add basic GuC multi-lrc selftest
-:21: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#21: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 191 lines checked
2f8c2ac84f38 drm/i915/guc: Implement no mid batch preemption for multi-lrc
1deecbc917a6 drm/i915: Multi-BB execbuf
-:340: CHECK:MACRO_ARG_PRECEDENCE: Macro argument '_eb' may be better as '(_eb)' to avoid precedence issues
#340: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1846:
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < _eb->num_batches; ++_i)

-:340: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#340: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1846:
+#define for_each_batch_create_order(_eb, _i) \
+	for (_i = 0; _i < _eb->num_batches; ++_i)

-:342: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#342: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1848:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = _eb->num_batches - 1; _i >= 0; --_i)

-:342: CHECK:MACRO_ARG_PRECEDENCE: Macro argument '_eb' may be better as '(_eb)' to avoid precedence issues
#342: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1848:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = _eb->num_batches - 1; _i >= 0; --_i)

-:342: CHECK:MACRO_ARG_REUSE: Macro argument reuse '_i' - possible side-effects?
#342: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1848:
+#define for_each_batch_add_order(_eb, _i) \
+	BUILD_BUG_ON(!typecheck(int, _i)); \
+	for (_i = _eb->num_batches - 1; _i >= 0; --_i)

-:905: WARNING:BLOCK_COMMENT_STYLE: Block comments should align the * on each line
#905: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2886:
+	/*
+	* We iterate in reverse order of creation to release timeline mutexes in

-:968: CHECK:LINE_SPACING: Please use a blank line after function/struct/union/enum declarations
#968: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2968:
+}
+static struct sync_file *

-:978: WARNING:ALLOC_WITH_MULTIPLY: Prefer kmalloc_array over kmalloc with multiply
#978: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2978:
+	fences = kmalloc(eb->num_batches * sizeof(*fences), GFP_KERNEL);

-:1090: WARNING:REPEATED_WORD: Possible repeated word: 'the'
#1090: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:3090:
+		 * this request is retired will the the batch_obj be moved onto

total: 1 errors, 3 warnings, 5 checks, 1275 lines checked
41dc4f27399e drm/i915/guc: Handle errors in multi-lrc requests
668649e1b343 drm/i915: Enable multi-bb execbuf
17b0d07e7aec drm/i915/execlists: Weak parallel submission support for execlists
-:112: WARNING:BRACES: braces {} are not necessary for any arm of this statement
#112: FILE: drivers/gpu/drm/i915/gt/intel_execlists_submission.c:2627:
+		if (i == 0) {
[...]
+		} else {
[...]

total: 0 errors, 1 warnings, 0 checks, 128 lines checked



^ permalink raw reply	[flat|nested] 145+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Parallel submission aka multi-bb execbuf (rev3)
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
                   ` (28 preceding siblings ...)
  (?)
@ 2021-08-20 23:14 ` Patchwork
  -1 siblings, 0 replies; 145+ messages in thread
From: Patchwork @ 2021-08-20 23:14 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev3)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/display/intel_display.c:1902:21:    expected struct i915_vma *[assigned] vma
+drivers/gpu/drm/i915/display/intel_display.c:1902:21:    got void [noderef] __iomem *[assigned] iomem
+drivers/gpu/drm/i915/display/intel_display.c:1902:21: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1418:26:    expected struct i915_gem_engines *e
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1418:26:    got struct i915_gem_engines [noderef] __rcu *engines
+drivers/gpu/drm/i915/gem/i915_gem_context.c:1418:26: warning: incorrect type in argument 1 (different address spaces)
+drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:619:21: warning: Using plain integer as NULL pointer
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/gt/intel_ring_submission.c:1268:24: warning: Using plain integer as NULL pointer
+drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1496:6: warning: symbol 'need_tasklet' was not declared. Should it be static?
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/intel_wakeref.c:142:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock
+drivers/gpu/drm/i915/selftests/i915_syncmap.c:80:54: warning: dubious: x | !y
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write8' - different lock contexts for basic block
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+./include/linux/stddef.h:17:9: this was the original definition
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined
+/usr/lib/gcc/x86_64-linux-gnu/8/include/stddef.h:417:9: warning: preprocessor token offsetof redefined



^ permalink raw reply	[flat|nested] 145+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Parallel submission aka multi-bb execbuf (rev3)
  2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
                   ` (29 preceding siblings ...)
  (?)
@ 2021-08-20 23:45 ` Patchwork
  -1 siblings, 0 replies; 145+ messages in thread
From: Patchwork @ 2021-08-20 23:45 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 15300 bytes --]

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev3)
URL   : https://patchwork.freedesktop.org/series/92789/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10505 -> Patchwork_20861
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_20861 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_20861, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_20861:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_close_race@basic-threads:
    - fi-skl-guc:         [PASS][1] -> [DMESG-FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-skl-guc/igt@gem_close_race@basic-threads.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-skl-guc/igt@gem_close_race@basic-threads.html
    - fi-kbl-7567u:       [PASS][3] -> [DMESG-FAIL][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-kbl-7567u/igt@gem_close_race@basic-threads.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-7567u/igt@gem_close_race@basic-threads.html
    - fi-kbl-7500u:       [PASS][5] -> [DMESG-FAIL][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-kbl-7500u/igt@gem_close_race@basic-threads.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-7500u/igt@gem_close_race@basic-threads.html

  * igt@gem_exec_gttfill@basic:
    - fi-icl-y:           [PASS][7] -> [DMESG-WARN][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-icl-y/igt@gem_exec_gttfill@basic.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-icl-y/igt@gem_exec_gttfill@basic.html
    - fi-bdw-5557u:       [PASS][9] -> [DMESG-WARN][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-bdw-5557u/igt@gem_exec_gttfill@basic.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-bdw-5557u/igt@gem_exec_gttfill@basic.html
    - fi-rkl-11600:       [PASS][11] -> [DMESG-WARN][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-rkl-11600/igt@gem_exec_gttfill@basic.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-rkl-11600/igt@gem_exec_gttfill@basic.html
    - fi-icl-u2:          [PASS][13] -> [DMESG-WARN][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-icl-u2/igt@gem_exec_gttfill@basic.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-icl-u2/igt@gem_exec_gttfill@basic.html
    - fi-tgl-1115g4:      NOTRUN -> [DMESG-WARN][15]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-tgl-1115g4/igt@gem_exec_gttfill@basic.html

  * igt@gem_render_tiled_blits@basic:
    - fi-ivb-3770:        [PASS][16] -> [FAIL][17] +2 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-ivb-3770/igt@gem_render_tiled_blits@basic.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-ivb-3770/igt@gem_render_tiled_blits@basic.html
    - fi-hsw-4770:        [PASS][18] -> [FAIL][19] +2 similar issues
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-hsw-4770/igt@gem_render_tiled_blits@basic.html

  * igt@runner@aborted:
    - fi-rkl-11600:       NOTRUN -> [FAIL][20]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-rkl-11600/igt@runner@aborted.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@gem_exec_gttfill@basic:
    - {fi-ehl-2}:         [PASS][21] -> [DMESG-WARN][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-ehl-2/igt@gem_exec_gttfill@basic.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-ehl-2/igt@gem_exec_gttfill@basic.html
    - {fi-jsl-1}:         [PASS][23] -> [DMESG-WARN][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-jsl-1/igt@gem_exec_gttfill@basic.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-jsl-1/igt@gem_exec_gttfill@basic.html
    - {fi-tgl-dsi}:       [PASS][25] -> [DMESG-WARN][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-tgl-dsi/igt@gem_exec_gttfill@basic.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-tgl-dsi/igt@gem_exec_gttfill@basic.html

  * igt@gem_render_tiled_blits@basic:
    - {fi-hsw-gt1}:       [PASS][27] -> [FAIL][28] +1 similar issue
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-hsw-gt1/igt@gem_render_tiled_blits@basic.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-hsw-gt1/igt@gem_render_tiled_blits@basic.html

  * igt@runner@aborted:
    - {fi-jsl-1}:         NOTRUN -> [FAIL][29]
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-jsl-1/igt@runner@aborted.html
    - {fi-ehl-2}:         NOTRUN -> [FAIL][30]
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-ehl-2/igt@runner@aborted.html

  
New tests
---------

  New tests have been introduced between CI_DRM_10505 and Patchwork_20861:

### New IGT tests (2) ###

  * igt@i915_selftest@live@guc:
    - Statuses : 15 pass(s)
    - Exec time: [0.51, 5.15] s

  * igt@i915_selftest@live@guc_multi_lrc:
    - Statuses : 15 pass(s)
    - Exec time: [0.47, 5.22] s

  

Known issues
------------

  Here are the changes found in Patchwork_20861 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@cs-gfx:
    - fi-rkl-guc:         NOTRUN -> [SKIP][31] ([fdo#109315]) +17 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-rkl-guc/igt@amdgpu/amd_basic@cs-gfx.html

  * igt@debugfs_test@read_all_entries:
    - fi-tgl-1115g4:      NOTRUN -> [DMESG-WARN][32] ([i915#4002])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-tgl-1115g4/igt@debugfs_test@read_all_entries.html

  * igt@gem_exec_gttfill@basic:
    - fi-cfl-8109u:       [PASS][33] -> [DMESG-WARN][34] ([i915#1610])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-cfl-8109u/igt@gem_exec_gttfill@basic.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-8109u/igt@gem_exec_gttfill@basic.html
    - fi-kbl-8809g:       [PASS][35] -> [DMESG-WARN][36] ([i915#1610])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-kbl-8809g/igt@gem_exec_gttfill@basic.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-8809g/igt@gem_exec_gttfill@basic.html
    - fi-kbl-guc:         [PASS][37] -> [DMESG-WARN][38] ([i915#1610])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-kbl-guc/igt@gem_exec_gttfill@basic.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-guc/igt@gem_exec_gttfill@basic.html
    - fi-cml-u2:          [PASS][39] -> [DMESG-WARN][40] ([i915#1610])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-cml-u2/igt@gem_exec_gttfill@basic.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cml-u2/igt@gem_exec_gttfill@basic.html
    - fi-cfl-8700k:       [PASS][41] -> [DMESG-WARN][42] ([i915#1610])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-cfl-8700k/igt@gem_exec_gttfill@basic.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-8700k/igt@gem_exec_gttfill@basic.html
    - fi-apl-guc:         [PASS][43] -> [DMESG-WARN][44] ([i915#1610])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-apl-guc/igt@gem_exec_gttfill@basic.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-apl-guc/igt@gem_exec_gttfill@basic.html
    - fi-cfl-guc:         [PASS][45] -> [DMESG-WARN][46] ([i915#1610])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-cfl-guc/igt@gem_exec_gttfill@basic.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-guc/igt@gem_exec_gttfill@basic.html
    - fi-skl-6700k2:      [PASS][47] -> [DMESG-WARN][48] ([i915#1610])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-skl-6700k2/igt@gem_exec_gttfill@basic.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-skl-6700k2/igt@gem_exec_gttfill@basic.html

  * igt@runner@aborted:
    - fi-cfl-8700k:       NOTRUN -> [FAIL][49] ([i915#2426] / [i915#3363])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-8700k/igt@runner@aborted.html
    - fi-cfl-8109u:       NOTRUN -> [FAIL][50] ([i915#2426] / [i915#3363])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-8109u/igt@runner@aborted.html
    - fi-icl-u2:          NOTRUN -> [FAIL][51] ([i915#2426] / [i915#3363] / [i915#3690])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-icl-u2/igt@runner@aborted.html
    - fi-bdw-5557u:       NOTRUN -> [FAIL][52] ([i915#2426])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-bdw-5557u/igt@runner@aborted.html
    - fi-kbl-7500u:       NOTRUN -> [FAIL][53] ([i915#2722] / [i915#3363])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-7500u/igt@runner@aborted.html
    - fi-kbl-guc:         NOTRUN -> [FAIL][54] ([i915#2426] / [i915#3363])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-guc/igt@runner@aborted.html
    - fi-cml-u2:          NOTRUN -> [FAIL][55] ([i915#2082] / [i915#2426] / [i915#3363])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cml-u2/igt@runner@aborted.html
    - fi-tgl-1115g4:      NOTRUN -> [FAIL][56] ([i915#3690])
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-tgl-1115g4/igt@runner@aborted.html
    - fi-cfl-guc:         NOTRUN -> [FAIL][57] ([i915#2426] / [i915#3363])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-cfl-guc/igt@runner@aborted.html
    - fi-icl-y:           NOTRUN -> [FAIL][58] ([i915#3690])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-icl-y/igt@runner@aborted.html
    - fi-kbl-7567u:       NOTRUN -> [FAIL][59] ([i915#2722] / [i915#3363])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-7567u/igt@runner@aborted.html
    - fi-skl-guc:         NOTRUN -> [FAIL][60] ([i915#2722] / [i915#3363])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-skl-guc/igt@runner@aborted.html
    - fi-skl-6700k2:      NOTRUN -> [FAIL][61] ([i915#2426] / [i915#3363])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-skl-6700k2/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@workarounds:
    - fi-rkl-guc:         [INCOMPLETE][62] ([i915#3920]) -> [PASS][63]
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-rkl-guc/igt@i915_selftest@live@workarounds.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-rkl-guc/igt@i915_selftest@live@workarounds.html

  
#### Warnings ####

  * igt@runner@aborted:
    - fi-kbl-8809g:       [FAIL][64] ([i915#180] / [i915#3363]) -> [FAIL][65] ([i915#2426] / [i915#3363])
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10505/fi-kbl-8809g/igt@runner@aborted.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/fi-kbl-8809g/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [i915#1610]: https://gitlab.freedesktop.org/drm/intel/issues/1610
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#2082]: https://gitlab.freedesktop.org/drm/intel/issues/2082
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2722]: https://gitlab.freedesktop.org/drm/intel/issues/2722
  [i915#2932]: https://gitlab.freedesktop.org/drm/intel/issues/2932
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3690]: https://gitlab.freedesktop.org/drm/intel/issues/3690
  [i915#3920]: https://gitlab.freedesktop.org/drm/intel/issues/3920
  [i915#4002]: https://gitlab.freedesktop.org/drm/intel/issues/4002


Participating hosts (40 -> 36)
------------------------------

  Additional (1): fi-tgl-1115g4 
  Missing    (5): fi-ilk-m540 fi-hsw-4200u fi-bsw-cyan bat-jsl-1 fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_10505 -> Patchwork_20861

  CI-20190529: 20190529
  CI_DRM_10505: 7c36ed237585ed2f645439e62dafccac070d5e33 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6181: e7a9ab2f21a67b1ab3f4093ec0bd775647308ba6 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20861: 17b0d07e7aecf1b25b5386d7c461d6b8ab861fb3 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

17b0d07e7aec drm/i915/execlists: Weak parallel submission support for execlists
668649e1b343 drm/i915: Enable multi-bb execbuf
41dc4f27399e drm/i915/guc: Handle errors in multi-lrc requests
1deecbc917a6 drm/i915: Multi-BB execbuf
2f8c2ac84f38 drm/i915/guc: Implement no mid batch preemption for multi-lrc
166ae35845ec drm/i915/guc: Add basic GuC multi-lrc selftest
d7635607ca11 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
24d24d71c346 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
f3e9fcd665da drm/i915: Fix bug in user proto-context creation that leaked contexts
28a6b590e69c drm/i915/guc: Update debugfs for GuC multi-lrc
d20a9f8066d0 drm/i915/guc: Implement multi-lrc reset
f0eaf349580b drm/i915/guc: Insert submit fences between requests in parent-child relationship
279b43791c6d drm/i915/guc: Implement multi-lrc submission
4397e2c32688 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
f671cf6be29a drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
2b3d3ef5cb7c drm/i915/guc: Add multi-lrc context registration
bff8a93386b8 drm/i915/guc: Implement parallel context pin / unpin functions
02d0b3f0499f drm/i915/guc: Introduce context parent-child relationship
b398df972aaa drm/i915: Expose logical engine instance to user
0222d44e38b8 drm/i915: Add logical engine mapping
cb16e92153bb drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
e0617461e1ec drm/i915/guc: Take engine PM when a context is pinned with GuC submission
08b1d791ac1a drm/i915: Add GT PM unpark worker
48e7785b7cb4 drm/i915/guc: Take GT PM ref when deregistering context
ce904de73b0d drm/i915/guc: Connect the number of guc_ids to debugfs
4a9a78aa06a7 drm/i915/guc: Allow flexible number of context ids
77206ddfd7e8 drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and kernel DOC

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20861/index.html

[-- Attachment #2: Type: text/html, Size: 18834 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-08-21 14:04     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-21 14:04 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: clang-built-linux, kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip next-20210820]
[cannot apply to drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-buildonly-randconfig-r002-20210821 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9e9d70591e72fc6762b4b9a226b68ed1307419bf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for function 'need_tasklet' [-Wmissing-prototypes]
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
        ^
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
   ^
   static 
   1 warning generated.


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39742 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
@ 2021-08-21 14:04     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-21 14:04 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: clang-built-linux, kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip next-20210820]
[cannot apply to drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-buildonly-randconfig-r002-20210821 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9e9d70591e72fc6762b4b9a226b68ed1307419bf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for function 'need_tasklet' [-Wmissing-prototypes]
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
        ^
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
   ^
   static 
   1 warning generated.


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39742 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
@ 2021-08-21 14:04     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-21 14:04 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip next-20210820]
[cannot apply to drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-buildonly-randconfig-r002-20210821 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9e9d70591e72fc6762b4b9a226b68ed1307419bf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for function 'need_tasklet' [-Wmissing-prototypes]
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
        ^
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
   ^
   static 
   1 warning generated.


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39742 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
@ 2021-08-21 19:01     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-21 19:01 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: clang-built-linux, kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 4662 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210820]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-buildonly-randconfig-r002-20210821 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9e9d70591e72fc6762b4b9a226b68ed1307419bf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:608:20: warning: comparison of array 'eb->batch_len' equal to a null pointer is always false [-Wtautological-pointer-compare]
                   if (unlikely(eb->batch_len == 0)) { /* impossible! */
                                ~~~~^~~~~~~~~    ~
   include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
   # define unlikely(x)    __builtin_expect(!!(x), 0)
                                               ^
   1 warning generated.


vim +608 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

   548	
   549	static int
   550	eb_add_vma(struct i915_execbuffer *eb,
   551		   unsigned int *current_batch,
   552		   unsigned int i,
   553		   struct i915_vma *vma)
   554	{
   555		struct drm_i915_private *i915 = eb->i915;
   556		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
   557		struct eb_vma *ev = &eb->vma[i];
   558	
   559		ev->vma = vma;
   560		ev->exec = entry;
   561		ev->flags = entry->flags;
   562	
   563		if (eb->lut_size > 0) {
   564			ev->handle = entry->handle;
   565			hlist_add_head(&ev->node,
   566				       &eb->buckets[hash_32(entry->handle,
   567							    eb->lut_size)]);
   568		}
   569	
   570		if (entry->relocation_count)
   571			list_add_tail(&ev->reloc_link, &eb->relocs);
   572	
   573		/*
   574		 * SNA is doing fancy tricks with compressing batch buffers, which leads
   575		 * to negative relocation deltas. Usually that works out ok since the
   576		 * relocate address is still positive, except when the batch is placed
   577		 * very low in the GTT. Ensure this doesn't happen.
   578		 *
   579		 * Note that actual hangs have only been observed on gen7, but for
   580		 * paranoia do it everywhere.
   581		 */
   582		if (is_batch_buffer(eb, i)) {
   583			if (entry->relocation_count &&
   584			    !(ev->flags & EXEC_OBJECT_PINNED))
   585				ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
   586			if (eb->reloc_cache.has_fence)
   587				ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
   588	
   589			eb->batches[*current_batch] = ev;
   590	
   591			if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
   592				drm_dbg(&i915->drm,
   593					"Attempting to use self-modifying batch buffer\n");
   594				return -EINVAL;
   595			}
   596	
   597			if (range_overflows_t(u64,
   598					      eb->batch_start_offset,
   599					      eb->args->batch_len,
   600					      ev->vma->size)) {
   601				drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
   602				return -EINVAL;
   603			}
   604	
   605			if (eb->args->batch_len == 0)
   606				eb->batch_len[*current_batch] = ev->vma->size -
   607					eb->batch_start_offset;
 > 608			if (unlikely(eb->batch_len == 0)) { /* impossible! */
   609				drm_dbg(&i915->drm, "Invalid batch length\n");
   610				return -EINVAL;
   611			}
   612	
   613			++*current_batch;
   614		}
   615	
   616		return 0;
   617	}
   618	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39742 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
@ 2021-08-21 19:01     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-21 19:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4780 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210820]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-buildonly-randconfig-r002-20210821 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9e9d70591e72fc6762b4b9a226b68ed1307419bf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:608:20: warning: comparison of array 'eb->batch_len' equal to a null pointer is always false [-Wtautological-pointer-compare]
                   if (unlikely(eb->batch_len == 0)) { /* impossible! */
                                ~~~~^~~~~~~~~    ~
   include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
   # define unlikely(x)    __builtin_expect(!!(x), 0)
                                               ^
   1 warning generated.


vim +608 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

   548	
   549	static int
   550	eb_add_vma(struct i915_execbuffer *eb,
   551		   unsigned int *current_batch,
   552		   unsigned int i,
   553		   struct i915_vma *vma)
   554	{
   555		struct drm_i915_private *i915 = eb->i915;
   556		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
   557		struct eb_vma *ev = &eb->vma[i];
   558	
   559		ev->vma = vma;
   560		ev->exec = entry;
   561		ev->flags = entry->flags;
   562	
   563		if (eb->lut_size > 0) {
   564			ev->handle = entry->handle;
   565			hlist_add_head(&ev->node,
   566				       &eb->buckets[hash_32(entry->handle,
   567							    eb->lut_size)]);
   568		}
   569	
   570		if (entry->relocation_count)
   571			list_add_tail(&ev->reloc_link, &eb->relocs);
   572	
   573		/*
   574		 * SNA is doing fancy tricks with compressing batch buffers, which leads
   575		 * to negative relocation deltas. Usually that works out ok since the
   576		 * relocate address is still positive, except when the batch is placed
   577		 * very low in the GTT. Ensure this doesn't happen.
   578		 *
   579		 * Note that actual hangs have only been observed on gen7, but for
   580		 * paranoia do it everywhere.
   581		 */
   582		if (is_batch_buffer(eb, i)) {
   583			if (entry->relocation_count &&
   584			    !(ev->flags & EXEC_OBJECT_PINNED))
   585				ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
   586			if (eb->reloc_cache.has_fence)
   587				ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
   588	
   589			eb->batches[*current_batch] = ev;
   590	
   591			if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
   592				drm_dbg(&i915->drm,
   593					"Attempting to use self-modifying batch buffer\n");
   594				return -EINVAL;
   595			}
   596	
   597			if (range_overflows_t(u64,
   598					      eb->batch_start_offset,
   599					      eb->args->batch_len,
   600					      ev->vma->size)) {
   601				drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
   602				return -EINVAL;
   603			}
   604	
   605			if (eb->args->batch_len == 0)
   606				eb->batch_len[*current_batch] = ev->vma->size -
   607					eb->batch_start_offset;
 > 608			if (unlikely(eb->batch_len == 0)) { /* impossible! */
   609				drm_dbg(&i915->drm, "Invalid batch length\n");
   610				return -EINVAL;
   611			}
   612	
   613			++*current_batch;
   614		}
   615	
   616		return 0;
   617	}
   618	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39742 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-08-22  2:18     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-22  2:18 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 2233 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210820]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-a011-20210821 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for 'need_tasklet' [-Wmissing-prototypes]
    1467 | bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
         |      ^~~~~~~~~~~~


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37328 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
@ 2021-08-22  2:18     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-22  2:18 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 2233 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210820]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-a011-20210821 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for 'need_tasklet' [-Wmissing-prototypes]
    1467 | bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
         |      ^~~~~~~~~~~~


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37328 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
@ 2021-08-22  2:18     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-22  2:18 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210820]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-a011-20210821 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/55917f6ee7575ffb033e6b19a9eb38c3210e14db
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 55917f6ee7575ffb033e6b19a9eb38c3210e14db
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1467:6: warning: no previous prototype for 'need_tasklet' [-Wmissing-prototypes]
    1467 | bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
         |      ^~~~~~~~~~~~


vim +/need_tasklet +1467 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

  1466	
> 1467	bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
  1468	{
  1469		struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
  1470		struct intel_context *ce = request_to_scheduling_context(rq);
  1471	
  1472		return submission_disabled(guc) || guc->stalled_request ||
  1473			!i915_sched_engine_is_empty(sched_engine) ||
  1474			!lrc_desc_registered(guc, ce->guc_id.id);
  1475	}
  1476	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 37328 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-08-20 22:44   ` Matthew Brost
@ 2021-08-29  4:00     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-29  4:00 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 4120 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/0741c4627df7b17e3e1b06c5967aed4371c688f7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 0741c4627df7b17e3e1b06c5967aed4371c688f7
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_gem_engines *e @@     got struct i915_gem_engines [noderef] __rcu *engines @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     expected struct i915_gem_engines *e
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     got struct i915_gem_engines [noderef] __rcu *engines
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_address_space *vm @@     got struct i915_address_space [noderef] __rcu *vm @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     expected struct i915_address_space *vm
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     got struct i915_address_space [noderef] __rcu *vm

vim +1411 drivers/gpu/drm/i915/gem/i915_gem_context.c

  1404	
  1405	static void context_close(struct i915_gem_context *ctx)
  1406	{
  1407		struct i915_address_space *vm;
  1408	
  1409		/* Flush any concurrent set_engines() */
  1410		mutex_lock(&ctx->engines_mutex);
> 1411		unpin_engines(ctx->engines);
  1412		engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
  1413		i915_gem_context_set_closed(ctx);
  1414		mutex_unlock(&ctx->engines_mutex);
  1415	
  1416		mutex_lock(&ctx->mutex);
  1417	
  1418		set_closed_name(ctx);
  1419	
  1420		vm = i915_gem_context_vm(ctx);
  1421		if (vm)
  1422			i915_vm_close(vm);
  1423	
  1424		if (ctx->syncobj)
  1425			drm_syncobj_put(ctx->syncobj);
  1426	
  1427		ctx->file_priv = ERR_PTR(-EBADF);
  1428	
  1429		/*
  1430		 * The LUT uses the VMA as a backpointer to unref the object,
  1431		 * so we need to clear the LUT before we close all the VMA (inside
  1432		 * the ppgtt).
  1433		 */
  1434		lut_close(ctx);
  1435	
  1436		spin_lock(&ctx->i915->gem.contexts.lock);
  1437		list_del(&ctx->link);
  1438		spin_unlock(&ctx->i915->gem.contexts.lock);
  1439	
  1440		mutex_unlock(&ctx->mutex);
  1441	
  1442		/*
  1443		 * If the user has disabled hangchecking, we can not be sure that
  1444		 * the batches will ever complete after the context is closed,
  1445		 * keeping the context and all resources pinned forever. So in this
  1446		 * case we opt to forcibly kill off all remaining requests on
  1447		 * context close.
  1448		 */
  1449		kill_context(ctx);
  1450	
  1451		i915_gem_context_put(ctx);
  1452	}
  1453	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-08-29  4:00     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-29  4:00 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4215 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/0741c4627df7b17e3e1b06c5967aed4371c688f7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 0741c4627df7b17e3e1b06c5967aed4371c688f7
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_gem_engines *e @@     got struct i915_gem_engines [noderef] __rcu *engines @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     expected struct i915_gem_engines *e
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     got struct i915_gem_engines [noderef] __rcu *engines
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_address_space *vm @@     got struct i915_address_space [noderef] __rcu *vm @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     expected struct i915_address_space *vm
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     got struct i915_address_space [noderef] __rcu *vm

vim +1411 drivers/gpu/drm/i915/gem/i915_gem_context.c

  1404	
  1405	static void context_close(struct i915_gem_context *ctx)
  1406	{
  1407		struct i915_address_space *vm;
  1408	
  1409		/* Flush any concurrent set_engines() */
  1410		mutex_lock(&ctx->engines_mutex);
> 1411		unpin_engines(ctx->engines);
  1412		engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
  1413		i915_gem_context_set_closed(ctx);
  1414		mutex_unlock(&ctx->engines_mutex);
  1415	
  1416		mutex_lock(&ctx->mutex);
  1417	
  1418		set_closed_name(ctx);
  1419	
  1420		vm = i915_gem_context_vm(ctx);
  1421		if (vm)
  1422			i915_vm_close(vm);
  1423	
  1424		if (ctx->syncobj)
  1425			drm_syncobj_put(ctx->syncobj);
  1426	
  1427		ctx->file_priv = ERR_PTR(-EBADF);
  1428	
  1429		/*
  1430		 * The LUT uses the VMA as a backpointer to unref the object,
  1431		 * so we need to clear the LUT before we close all the VMA (inside
  1432		 * the ppgtt).
  1433		 */
  1434		lut_close(ctx);
  1435	
  1436		spin_lock(&ctx->i915->gem.contexts.lock);
  1437		list_del(&ctx->link);
  1438		spin_unlock(&ctx->i915->gem.contexts.lock);
  1439	
  1440		mutex_unlock(&ctx->mutex);
  1441	
  1442		/*
  1443		 * If the user has disabled hangchecking, we can not be sure that
  1444		 * the batches will ever complete after the context is closed,
  1445		 * keeping the context and all resources pinned forever. So in this
  1446		 * case we opt to forcibly kill off all remaining requests on
  1447		 * context close.
  1448		 */
  1449		kill_context(ctx);
  1450	
  1451		i915_gem_context_put(ctx);
  1452	}
  1453	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-08-20 22:44   ` Matthew Brost
@ 2021-08-29 19:59     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-29 19:59 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 4120 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/0741c4627df7b17e3e1b06c5967aed4371c688f7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 0741c4627df7b17e3e1b06c5967aed4371c688f7
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_gem_engines *e @@     got struct i915_gem_engines [noderef] __rcu *engines @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     expected struct i915_gem_engines *e
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     got struct i915_gem_engines [noderef] __rcu *engines
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_address_space *vm @@     got struct i915_address_space [noderef] __rcu *vm @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     expected struct i915_address_space *vm
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     got struct i915_address_space [noderef] __rcu *vm

vim +1411 drivers/gpu/drm/i915/gem/i915_gem_context.c

  1404	
  1405	static void context_close(struct i915_gem_context *ctx)
  1406	{
  1407		struct i915_address_space *vm;
  1408	
  1409		/* Flush any concurrent set_engines() */
  1410		mutex_lock(&ctx->engines_mutex);
> 1411		unpin_engines(ctx->engines);
  1412		engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
  1413		i915_gem_context_set_closed(ctx);
  1414		mutex_unlock(&ctx->engines_mutex);
  1415	
  1416		mutex_lock(&ctx->mutex);
  1417	
  1418		set_closed_name(ctx);
  1419	
  1420		vm = i915_gem_context_vm(ctx);
  1421		if (vm)
  1422			i915_vm_close(vm);
  1423	
  1424		if (ctx->syncobj)
  1425			drm_syncobj_put(ctx->syncobj);
  1426	
  1427		ctx->file_priv = ERR_PTR(-EBADF);
  1428	
  1429		/*
  1430		 * The LUT uses the VMA as a backpointer to unref the object,
  1431		 * so we need to clear the LUT before we close all the VMA (inside
  1432		 * the ppgtt).
  1433		 */
  1434		lut_close(ctx);
  1435	
  1436		spin_lock(&ctx->i915->gem.contexts.lock);
  1437		list_del(&ctx->link);
  1438		spin_unlock(&ctx->i915->gem.contexts.lock);
  1439	
  1440		mutex_unlock(&ctx->mutex);
  1441	
  1442		/*
  1443		 * If the user has disabled hangchecking, we can not be sure that
  1444		 * the batches will ever complete after the context is closed,
  1445		 * keeping the context and all resources pinned forever. So in this
  1446		 * case we opt to forcibly kill off all remaining requests on
  1447		 * context close.
  1448		 */
  1449		kill_context(ctx);
  1450	
  1451		i915_gem_context_put(ctx);
  1452	}
  1453	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
@ 2021-08-29 19:59     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-29 19:59 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4215 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/0741c4627df7b17e3e1b06c5967aed4371c688f7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 0741c4627df7b17e3e1b06c5967aed4371c688f7
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_gem_engines *e @@     got struct i915_gem_engines [noderef] __rcu *engines @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     expected struct i915_gem_engines *e
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1411:26: sparse:     got struct i915_gem_engines [noderef] __rcu *engines
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct i915_address_space *vm @@     got struct i915_address_space [noderef] __rcu *vm @@
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     expected struct i915_address_space *vm
   drivers/gpu/drm/i915/gem/i915_gem_context.c:1626:34: sparse:     got struct i915_address_space [noderef] __rcu *vm

vim +1411 drivers/gpu/drm/i915/gem/i915_gem_context.c

  1404	
  1405	static void context_close(struct i915_gem_context *ctx)
  1406	{
  1407		struct i915_address_space *vm;
  1408	
  1409		/* Flush any concurrent set_engines() */
  1410		mutex_lock(&ctx->engines_mutex);
> 1411		unpin_engines(ctx->engines);
  1412		engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
  1413		i915_gem_context_set_closed(ctx);
  1414		mutex_unlock(&ctx->engines_mutex);
  1415	
  1416		mutex_lock(&ctx->mutex);
  1417	
  1418		set_closed_name(ctx);
  1419	
  1420		vm = i915_gem_context_vm(ctx);
  1421		if (vm)
  1422			i915_vm_close(vm);
  1423	
  1424		if (ctx->syncobj)
  1425			drm_syncobj_put(ctx->syncobj);
  1426	
  1427		ctx->file_priv = ERR_PTR(-EBADF);
  1428	
  1429		/*
  1430		 * The LUT uses the VMA as a backpointer to unref the object,
  1431		 * so we need to clear the LUT before we close all the VMA (inside
  1432		 * the ppgtt).
  1433		 */
  1434		lut_close(ctx);
  1435	
  1436		spin_lock(&ctx->i915->gem.contexts.lock);
  1437		list_del(&ctx->link);
  1438		spin_unlock(&ctx->i915->gem.contexts.lock);
  1439	
  1440		mutex_unlock(&ctx->mutex);
  1441	
  1442		/*
  1443		 * If the user has disabled hangchecking, we can not be sure that
  1444		 * the batches will ever complete after the context is closed,
  1445		 * keeping the context and all resources pinned forever. So in this
  1446		 * case we opt to forcibly kill off all remaining requests on
  1447		 * context close.
  1448		 */
  1449		kill_context(ctx);
  1450	
  1451		i915_gem_context_put(ctx);
  1452	}
  1453	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
@ 2021-08-30  3:46     ` kernel test robot
  -1 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-30  3:46 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel
  Cc: kbuild-all, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 4134 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:608:21: sparse: sparse: Using plain integer as NULL pointer

vim +608 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

   548	
   549	static int
   550	eb_add_vma(struct i915_execbuffer *eb,
   551		   unsigned int *current_batch,
   552		   unsigned int i,
   553		   struct i915_vma *vma)
   554	{
   555		struct drm_i915_private *i915 = eb->i915;
   556		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
   557		struct eb_vma *ev = &eb->vma[i];
   558	
   559		ev->vma = vma;
   560		ev->exec = entry;
   561		ev->flags = entry->flags;
   562	
   563		if (eb->lut_size > 0) {
   564			ev->handle = entry->handle;
   565			hlist_add_head(&ev->node,
   566				       &eb->buckets[hash_32(entry->handle,
   567							    eb->lut_size)]);
   568		}
   569	
   570		if (entry->relocation_count)
   571			list_add_tail(&ev->reloc_link, &eb->relocs);
   572	
   573		/*
   574		 * SNA is doing fancy tricks with compressing batch buffers, which leads
   575		 * to negative relocation deltas. Usually that works out ok since the
   576		 * relocate address is still positive, except when the batch is placed
   577		 * very low in the GTT. Ensure this doesn't happen.
   578		 *
   579		 * Note that actual hangs have only been observed on gen7, but for
   580		 * paranoia do it everywhere.
   581		 */
   582		if (is_batch_buffer(eb, i)) {
   583			if (entry->relocation_count &&
   584			    !(ev->flags & EXEC_OBJECT_PINNED))
   585				ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
   586			if (eb->reloc_cache.has_fence)
   587				ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
   588	
   589			eb->batches[*current_batch] = ev;
   590	
   591			if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
   592				drm_dbg(&i915->drm,
   593					"Attempting to use self-modifying batch buffer\n");
   594				return -EINVAL;
   595			}
   596	
   597			if (range_overflows_t(u64,
   598					      eb->batch_start_offset,
   599					      eb->args->batch_len,
   600					      ev->vma->size)) {
   601				drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
   602				return -EINVAL;
   603			}
   604	
   605			if (eb->args->batch_len == 0)
   606				eb->batch_len[*current_batch] = ev->vma->size -
   607					eb->batch_start_offset;
 > 608			if (unlikely(eb->batch_len == 0)) { /* impossible! */
   609				drm_dbg(&i915->drm, "Invalid batch length\n");
   610				return -EINVAL;
   611			}
   612	
   613			++*current_batch;
   614		}
   615	
   616		return 0;
   617	}
   618	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
@ 2021-08-30  3:46     ` kernel test robot
  0 siblings, 0 replies; 145+ messages in thread
From: kernel test robot @ 2021-08-30  3:46 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]

Hi Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next next-20210827]
[cannot apply to tegra-drm/drm/tegra/for-next linus/master drm/drm-next v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-348-gf0e6938b-dirty
        # https://github.com/0day-ci/linux/commit/7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matthew-Brost/Parallel-submission-aka-multi-bb-execbuf/20210821-065348
        git checkout 7e7ae2111b2855ac3d63aa5c806c6936daaa6bbc
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:608:21: sparse: sparse: Using plain integer as NULL pointer

vim +608 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

   548	
   549	static int
   550	eb_add_vma(struct i915_execbuffer *eb,
   551		   unsigned int *current_batch,
   552		   unsigned int i,
   553		   struct i915_vma *vma)
   554	{
   555		struct drm_i915_private *i915 = eb->i915;
   556		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
   557		struct eb_vma *ev = &eb->vma[i];
   558	
   559		ev->vma = vma;
   560		ev->exec = entry;
   561		ev->flags = entry->flags;
   562	
   563		if (eb->lut_size > 0) {
   564			ev->handle = entry->handle;
   565			hlist_add_head(&ev->node,
   566				       &eb->buckets[hash_32(entry->handle,
   567							    eb->lut_size)]);
   568		}
   569	
   570		if (entry->relocation_count)
   571			list_add_tail(&ev->reloc_link, &eb->relocs);
   572	
   573		/*
   574		 * SNA is doing fancy tricks with compressing batch buffers, which leads
   575		 * to negative relocation deltas. Usually that works out ok since the
   576		 * relocate address is still positive, except when the batch is placed
   577		 * very low in the GTT. Ensure this doesn't happen.
   578		 *
   579		 * Note that actual hangs have only been observed on gen7, but for
   580		 * paranoia do it everywhere.
   581		 */
   582		if (is_batch_buffer(eb, i)) {
   583			if (entry->relocation_count &&
   584			    !(ev->flags & EXEC_OBJECT_PINNED))
   585				ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
   586			if (eb->reloc_cache.has_fence)
   587				ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
   588	
   589			eb->batches[*current_batch] = ev;
   590	
   591			if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
   592				drm_dbg(&i915->drm,
   593					"Attempting to use self-modifying batch buffer\n");
   594				return -EINVAL;
   595			}
   596	
   597			if (range_overflows_t(u64,
   598					      eb->batch_start_offset,
   599					      eb->args->batch_len,
   600					      ev->vma->size)) {
   601				drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
   602				return -EINVAL;
   603			}
   604	
   605			if (eb->args->batch_len == 0)
   606				eb->batch_len[*current_batch] = ev->vma->size -
   607					eb->batch_start_offset;
 > 608			if (unlikely(eb->batch_len == 0)) { /* impossible! */
   609				drm_dbg(&i915->drm, "Invalid batch length\n");
   610				return -EINVAL;
   611			}
   612	
   613			++*current_batch;
   614		}
   615	
   616		return 0;
   617	}
   618	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 42173 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 02/27] drm/i915/guc: Allow flexible number of context ids
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:13   ` John Harrison
  2021-09-10  0:14     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:13 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Number of available GuC contexts ids might be limited.
> Stop referring in code to macro and use variable instead.
>
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h            |  4 ++++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++++++------
>   2 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 112dd29a63fe..6fd2719d1b75 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -60,6 +60,10 @@ struct intel_guc {
>   	spinlock_t contexts_lock;
>   	/** @guc_ids: used to allocate new guc_ids */
>   	struct ida guc_ids;
> +	/** @num_guc_ids: number of guc_ids that can be used */
> +	u32 num_guc_ids;
> +	/** @max_guc_ids: max number of guc_ids that can be used */
> +	u32 max_guc_ids;
How do these differ? The description needs to say how or why 'num' might 
be less than 'max'. And surely 'max' is not the count 'that can be 
used'? Num is how many can be used, but max is how many are physically 
possible or something?

John.

>   	/**
>   	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
>   	 */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 46158d996bf6..8235e49bb347 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>   
> -	GEM_BUG_ON(index >= GUC_MAX_LRC_DESCRIPTORS);
> +	GEM_BUG_ON(index >= guc->max_guc_ids);
>   
>   	return &base[index];
>   }
> @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
>   {
>   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
>   
> -	GEM_BUG_ON(id >= GUC_MAX_LRC_DESCRIPTORS);
> +	GEM_BUG_ON(id >= guc->max_guc_ids);
>   
>   	return ce;
>   }
> @@ -363,8 +363,7 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
>   	u32 size;
>   	int ret;
>   
> -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> -			  GUC_MAX_LRC_DESCRIPTORS);
> +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
>   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
>   					     (void **)&guc->lrc_desc_pool_vaddr);
>   	if (ret)
> @@ -1193,7 +1192,7 @@ static void guc_submit_request(struct i915_request *rq)
>   static int new_guc_id(struct intel_guc *guc)
>   {
>   	return ida_simple_get(&guc->guc_ids, 0,
> -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> +			      guc->num_guc_ids, GFP_KERNEL |
>   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>   }
>   
> @@ -2704,6 +2703,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>   
>   void intel_guc_submission_init_early(struct intel_guc *guc)
>   {
> +	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> +	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>   	guc->submission_supported = __guc_submission_supported(guc);
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
> @@ -2713,7 +2714,7 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   {
>   	struct intel_context *ce;
>   
> -	if (unlikely(desc_idx >= GUC_MAX_LRC_DESCRIPTORS)) {
> +	if (unlikely(desc_idx >= guc->max_guc_ids)) {
>   		drm_err(&guc_to_gt(guc)->i915->drm,
>   			"Invalid desc_idx %u", desc_idx);
>   		return NULL;
> @@ -3063,6 +3064,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>   
>   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>   		   atomic_read(&guc->outstanding_submission_g2h));
> +	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> +	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
>   	drm_printf(p, "GuC tasklet count: %u\n\n",
>   		   atomic_read(&sched_engine->tasklet.count));
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 03/27] drm/i915/guc: Connect the number of guc_ids to debugfs
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:16   ` John Harrison
  2021-09-10  0:16     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:16 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> For testing purposes it may make sense to reduce the number of guc_ids
> available to be allocated. Add debugfs support for setting the number of
> guc_ids.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
It 'may make sense'? Is there an actual IGT/selftest test that uses this 
and needs to in order to not take several hours to run or something? 
Seems wrong to be adding a debugfs interface for something that would 
only be used for testing by internal developers who can quite easy just 
edit a define in the code to simplify their testing.

John.


> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    | 31 +++++++++++++++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  3 +-
>   2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index 887c8c8f35db..b88d343ee432 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -71,12 +71,43 @@ static bool intel_eval_slpc_support(void *data)
>   	return intel_guc_slpc_is_used(guc);
>   }
>   
> +static int guc_num_id_get(void *data, u64 *val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	*val = guc->num_guc_ids;
> +
> +	return 0;
> +}
> +
> +static int guc_num_id_set(void *data, u64 val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	if (val > guc->max_guc_ids)
> +		val = guc->max_guc_ids;
> +	else if (val < 256)
> +		val = 256;
> +
> +	guc->num_guc_ids = val;
> +
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> +
>   void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
>   {
>   	static const struct debugfs_gt_file files[] = {
>   		{ "guc_info", &guc_info_fops, NULL },
>   		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
>   		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
> +		{ "guc_num_id", &guc_num_id_fops, NULL },
>   	};
>   
>   	if (!intel_guc_is_supported(guc))
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 8235e49bb347..68742b612692 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2716,7 +2716,8 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   
>   	if (unlikely(desc_idx >= guc->max_guc_ids)) {
>   		drm_err(&guc_to_gt(guc)->i915->drm,
> -			"Invalid desc_idx %u", desc_idx);
> +			"Invalid desc_idx %u, max %u",
> +			desc_idx, guc->max_guc_ids);
>   		return NULL;
>   	}
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:28   ` John Harrison
  2021-09-10  0:21     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:28 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a deregister context H2G is in flight.
>
> FIXME: Move locking / structure changes into different patch
This split needs to be done. It would also be helpful to have a more 
detailed explanation of what is going on and why the change is 
necessary. Is this a generic problem and the code currently in the tree 
is broken? Or is it something specific to parallel submission and isn't 
actually a problem until later in the series?

John.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
>   8 files changed, 199 insertions(+), 106 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index adfe49b53b1b..c8595da64ad8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   	INIT_LIST_HEAD(&ce->guc_id.link);
>   
> +	INIT_LIST_HEAD(&ce->destroyed_link);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 80bbdc7810f6..fd338a30617e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -190,22 +190,29 @@ struct intel_context {
>   		/**
>   		 * @id: unique handle which is used to communicate information
>   		 * with the GuC about this context, protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		u16 id;
>   		/**
>   		 * @ref: the number of references to the guc_id, when
>   		 * transitioning in and out of zero protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		atomic_t ref;
>   		/**
>   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> -		 * still valid, protected by guc->contexts_lock
> +		 * still valid, protected by guc->submission_state.lock
>   		 */
>   		struct list_head link;
>   	} guc_id;
>   
> +	/**
> +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> +	 * list when context is pending to be destroyed (deregistered with the
> +	 * GuC), protected by guc->submission_state.lock
> +	 */
> +	struct list_head destroyed_link;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 70ea46d6cfb0..17a5028ea177 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>   	return intel_wakeref_is_active(&engine->wakeref);
>   }
>   
> +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> +{
> +	__intel_wakeref_get(&engine->wakeref);
> +}
> +
>   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_get(&engine->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index d0588d8aaa44..a17bf0d4592b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +#define with_intel_gt_pm_async(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put_async(gt), tmp = 0)
> +#define with_intel_gt_pm_if_awake(gt, tmp) \
> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +#define with_intel_gt_pm_if_awake_async(gt, tmp) \
> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> +	     intel_gt_pm_put_async(gt), tmp = 0)
> +
>   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>   {
>   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ba10bd374cee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -142,6 +142,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>   	INTEL_GUC_ACTION_LIMIT
>   };
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 6fd2719d1b75..7358883f1540 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -53,21 +53,37 @@ struct intel_guc {
>   		void (*disable)(struct intel_guc *guc);
>   	} interrupts;
>   
> -	/**
> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> -	 * ce->guc_id.ref when transitioning in and out of zero
> -	 */
> -	spinlock_t contexts_lock;
> -	/** @guc_ids: used to allocate new guc_ids */
> -	struct ida guc_ids;
> -	/** @num_guc_ids: number of guc_ids that can be used */
> -	u32 num_guc_ids;
> -	/** @max_guc_ids: max number of guc_ids that can be used */
> -	u32 max_guc_ids;
> -	/**
> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> -	 */
> -	struct list_head guc_id_list;
> +	struct {
> +		/**
> +		 * @lock: protects everything in submission_state, ce->guc_id,
> +		 * and ce->destroyed_link
> +		 */
> +		spinlock_t lock;
> +		/**
> +		 * @guc_ids: used to allocate new guc_ids
> +		 */
> +		struct ida guc_ids;
> +		/** @num_guc_ids: number of guc_ids that can be used */
> +		u32 num_guc_ids;
> +		/** @max_guc_ids: max number of guc_ids that can be used */
> +		u32 max_guc_ids;
> +		/**
> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> +		 * refs
> +		 */
> +		struct list_head guc_id_list;
> +		/**
> +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> +		 * (deregistered with the GuC)
> +		 */
> +		struct list_head destroyed_contexts;
> +		/**
> +		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * need to take a GT PM reference and can't from destroy
> +		 * function as it might be in an atomic context (no sleeping)
> +		 */
> +		struct work_struct destroyed_worker;
> +	} submission_state;
>   
>   	bool submission_supported;
>   	bool submission_selected;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index b88d343ee432..27655460ee84 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
>   	if (!intel_guc_submission_is_used(guc))
>   		return -ENODEV;
>   
> -	*val = guc->num_guc_ids;
> +	*val = guc->submission_state.num_guc_ids;
>   
>   	return 0;
>   }
> @@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
>   static int guc_num_id_set(void *data, u64 val)
>   {
>   	struct intel_guc *guc = data;
> +	unsigned long flags;
>   
>   	if (!intel_guc_submission_is_used(guc))
>   		return -ENODEV;
>   
> -	if (val > guc->max_guc_ids)
> -		val = guc->max_guc_ids;
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +
> +	if (val > guc->submission_state.max_guc_ids)
> +		val = guc->submission_state.max_guc_ids;
>   	else if (val < 256)
>   		val = 256;
>   
> -	guc->num_guc_ids = val;
> +	guc->submission_state.num_guc_ids = val;
> +
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 68742b612692..f835e06e5f9f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -86,9 +86,9 @@
>    * submitting at a time. Currently only 1 sched_engine used for all of GuC
>    * submission but that could change in the future.
>    *
> - * guc->contexts_lock
> - * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> - * submission can hold this at a time.
> + * guc->submission_state.lock
> + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> + * list.
>    *
>    * ce->guc_state.lock
>    * Protects everything under ce->guc_state. Ensures that a context is in the
> @@ -100,7 +100,7 @@
>    *
>    * Lock ordering rules:
>    * sched_engine->lock -> ce->guc_state.lock
> - * guc->contexts_lock -> ce->guc_state.lock
> + * guc->submission_state.lock -> ce->guc_state.lock
>    *
>    * Reset races:
>    * When a GPU full reset is triggered it is assumed that some G2H responses to
> @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>   
> -	GEM_BUG_ON(index >= guc->max_guc_ids);
> +	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
>   
>   	return &base[index];
>   }
> @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
>   {
>   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
>   
> -	GEM_BUG_ON(id >= guc->max_guc_ids);
> +	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
>   
>   	return ce;
>   }
> @@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
>   	u32 size;
>   	int ret;
>   
> -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
> +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> +			  guc->submission_state.max_guc_ids);
>   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
>   					     (void **)&guc->lrc_desc_pool_vaddr);
>   	if (ret)
> @@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   			if (deregister)
>   				guc_signal_context_fence(ce);
>   			if (destroyed) {
> +				intel_gt_pm_put_async(guc_to_gt(guc));
>   				release_guc_id(guc, ce);
>   				__guc_context_destroy(ce);
>   			}
> @@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> +
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
>   	if (unlikely(!guc_submission_initialized(guc))) {
> @@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>   
>   	flush_work(&guc->ct.requests.worker);
> +	guc_flush_destroyed_contexts(guc);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }
> @@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +static void destroyed_worker_func(struct work_struct *w);
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
> @@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   
>   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>   
> -	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list);
> -	ida_init(&guc->guc_ids);
> +	spin_lock_init(&guc->submission_state.lock);
> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> +	ida_init(&guc->submission_state.guc_ids);
> +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> +	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   		return;
>   
>   	guc_lrc_desc_pool_destroy(guc);
> +	guc_flush_destroyed_contexts(guc);
>   	i915_sched_engine_put(guc->sched_engine);
>   }
>   
> @@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
>   
>   static int new_guc_id(struct intel_guc *guc)
>   {
> -	return ida_simple_get(&guc->guc_ids, 0,
> -			      guc->num_guc_ids, GFP_KERNEL |
> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> +			      guc->submission_state.num_guc_ids, GFP_KERNEL |
>   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>   }
>   
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> +		ida_simple_remove(&guc->submission_state.guc_ids,
> +				  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	__release_guc_id(guc, ce);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int steal_guc_id(struct intel_guc *guc)
> @@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
>   	struct intel_context *ce;
>   	int guc_id;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
> -	if (!list_empty(&guc->guc_id_list)) {
> -		ce = list_first_entry(&guc->guc_id_list,
> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> @@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>   {
>   	int ret;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
>   	ret = new_guc_id(guc);
>   	if (unlikely(ret < 0)) {
> @@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   
>   try_again:
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   
>   	might_lock(&ce->guc_state.lock);
>   
> @@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	atomic_inc(&ce->guc_id.ref);
>   
>   out_unlock:
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/*
>   	 * -EAGAIN indicates no guc_id are available, let's retire any
> @@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	if (unlikely(context_guc_id_invalid(ce)))
>   		return;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>   	    !atomic_read(&ce->guc_id.ref))
> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		list_add_tail(&ce->guc_id.link,
> +			      &guc->submission_state.guc_id_list);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int __guc_action_register_context(struct intel_guc *guc,
> @@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   static void guc_lrc_desc_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	unsigned long flags;
> +	bool disabled;
>   
> +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> +	/* Seal race with Reset */
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	disabled = submission_disabled(guc);
> +	if (likely(!disabled)) {
> +		__intel_gt_pm_get(gt);
> +		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(disabled)) {
> +		release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +		return;
> +	}
> +
>   	deregister_context(ce, ce->guc_id.id, true);
>   }
>   
> @@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
>   	}
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(!submission_disabled(guc) &&
> +		   guc_submission_initialized(guc));
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		__release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void deregister_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +		guc_lrc_desc_unpin(ce);
> +		spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void destroyed_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.destroyed_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int tmp;
> +
> +	with_intel_gt_pm(gt, tmp)
> +		deregister_destroyed_contexts(guc);
> +}
> +
>   static void guc_context_destroy(struct kref *kref)
>   {
>   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	intel_wakeref_t wakeref;
>   	unsigned long flags;
> -	bool disabled;
> +	bool destroy;
>   
>   	/*
>   	 * If the guc_id is invalid this context has been stolen and we can free
>   	 * it immediately. Also can be freed immediately if the context is not
>   	 * registered with the GuC or the GuC is in the middle of a reset.
>   	 */
> -	if (context_guc_id_invalid(ce)) {
> -		__guc_context_destroy(ce);
> -		return;
> -	} else if (submission_disabled(guc) ||
> -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> -		release_guc_id(guc, ce);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	/*
> -	 * We have to acquire the context spinlock and check guc_id again, if it
> -	 * is valid it hasn't been stolen and needs to be deregistered. We
> -	 * delete this context from the list of unpinned guc_id available to
> -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> -	 * returns indicating this context has been deregistered the guc_id is
> -	 * returned to the pool of available guc_id.
> -	 */
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> -	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	if (!list_empty(&ce->guc_id.link))
> -		list_del_init(&ce->guc_id.link);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> -
> -	/* Seal race with Reset */
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	disabled = submission_disabled(guc);
> -	if (likely(!disabled)) {
> -		set_context_destroyed(ce);
> -		clr_context_registered(ce);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +	if (likely(!destroy)) {
> +		if (!list_empty(&ce->guc_id.link))
> +			list_del_init(&ce->guc_id.link);
> +		list_add_tail(&ce->destroyed_link,
> +			      &guc->submission_state.destroyed_contexts);
> +	} else {
> +		__release_guc_id(guc, ce);
>   	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -	if (unlikely(disabled)) {
> -		release_guc_id(guc, ce);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +	if (unlikely(destroy)) {
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	/*
> -	 * We defer GuC context deregistration until the context is destroyed
> -	 * in order to save on CTBs. With this optimization ideally we only need
> -	 * 1 CTB to register the context during the first pin and 1 CTB to
> -	 * deregister the context when the context is destroyed. Without this
> -	 * optimization, a CTB would be needed every pin & unpin.
> -	 *
> -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> -	 * from context_free_worker when runtime wakeref is not held.
> -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> -	 * in H2G CTB to deregister the context. A future patch may defer this
> -	 * H2G CTB if the runtime wakeref is zero.
> +	 * We use a worker to issue the H2G to deregister the context as we can
> +	 * take the GT PM for the first time which isn't allowed from an atomic
> +	 * context.
>   	 */
> -	with_intel_runtime_pm(runtime_pm, wakeref)
> -		guc_lrc_desc_unpin(ce);
> +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> @@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>   
>   void intel_guc_submission_init_early(struct intel_guc *guc)
>   {
> -	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> -	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> +	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>   	guc->submission_supported = __guc_submission_supported(guc);
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
> @@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   {
>   	struct intel_context *ce;
>   
> -	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> +	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
>   		drm_err(&guc_to_gt(guc)->i915->drm,
>   			"Invalid desc_idx %u, max %u",
> -			desc_idx, guc->max_guc_ids);
> +			desc_idx, guc->submission_state.max_guc_ids);
>   		return NULL;
>   	}
>   
> @@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   		intel_context_put(ce);
>   	} else if (context_destroyed(ce)) {
>   		/* Context has been destroyed */
> +		intel_gt_pm_put_async(guc_to_gt(guc));
>   		release_guc_id(guc, ce);
>   		__guc_context_destroy(ce);
>   	}
> @@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>   
>   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>   		   atomic_read(&guc->outstanding_submission_g2h));
> -	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> -	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC Number GuC IDs: %u\n",
> +		   guc->submission_state.num_guc_ids);
> +	drm_printf(p, "GuC Max GuC IDs: %u\n",
> +		   guc->submission_state.max_guc_ids);
>   	drm_printf(p, "GuC tasklet count: %u\n\n",
>   		   atomic_read(&sched_engine->tasklet.count));
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:36   ` John Harrison
  2021-09-10  0:34     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:36 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Sometimes it is desirable to queue work up for later if the GT PM isn't
> held and run that work on next GT PM unpark.
What is the reason for doing this? Why is it important? Why not just 
take the GT PM at the time the work is requested?

>
> Implemented with a list in the GT of all pending work, workqueues in
> the list, a callback to add a workqueue to the list, and finally a
> wakeref post_get callback that iterates / drains the list + queues the
> workqueues.
>
> First user of this is deregistration of GuC contexts.
This statement should be in the first paragraph but needs to be more 
detailed - why is it necessary to add all this extra complexity rather 
than just taking the GT PM?


>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/Makefile                 |  1 +
>   drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
>   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
>   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
>   drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
>   drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
>   10 files changed, 119 insertions(+), 7 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
>   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 642a5b5a1b81..579bdc069f25 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -103,6 +103,7 @@ gt-y += \
>   	gt/intel_gt_clock_utils.o \
>   	gt/intel_gt_irq.o \
>   	gt/intel_gt_pm.o \
> +	gt/intel_gt_pm_unpark_work.o \
>   	gt/intel_gt_pm_irq.o \
>   	gt/intel_gt_requests.o \
>   	gt/intel_gtt.o \
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index 62d40c986642..7e690e74baa2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
>   
>   	spin_lock_init(&gt->irq_lock);
>   
> +	spin_lock_init(&gt->pm_unpark_work_lock);
> +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
> +
>   	INIT_LIST_HEAD(&gt->closed_vma);
>   	spin_lock_init(&gt->closed_lock);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index dea8e2479897..564c11a3748b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
>   	return 0;
>   }
>   
> +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
> +{
> +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> +
> +	intel_gt_pm_unpark_work_queue(gt);
> +}
> +
>   static int __gt_park(struct intel_wakeref *wf)
>   {
>   	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
>   
>   static const struct intel_wakeref_ops wf_ops = {
>   	.get = __gt_unpark,
> +	.post_get = __gt_unpark_work_queue,
>   	.put = __gt_park,
>   };
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> new file mode 100644
> index 000000000000..23162dbd0c35
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_runtime_pm.h"
> +#include "intel_gt_pm.h"
> +
> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
> +{
> +	struct intel_gt_pm_unpark_work *work, *next;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> +	list_for_each_entry_safe(work, next,
> +				 &gt->pm_unpark_work_list, link) {
> +		list_del_init(&work->link);
> +		queue_work(system_unbound_wq, &work->worker);
> +	}
> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> +}
> +
> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> +				 struct intel_gt_pm_unpark_work *work)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> +	if (intel_gt_pm_is_awake(gt))
> +		queue_work(system_unbound_wq, &work->worker);
> +	else if (list_empty(&work->link))
> +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> +}
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> new file mode 100644
> index 000000000000..eaf1dc313aa2
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#ifndef INTEL_GT_PM_UNPARK_WORK_H
> +#define INTEL_GT_PM_UNPARK_WORK_H
> +
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +
> +struct intel_gt;
> +
> +/**
> + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
> + */
> +struct intel_gt_pm_unpark_work {
> +	/**
> +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
> +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
> +	 */
> +	struct list_head link;
> +	/** @worker: will be scheduled when GT unparked */
> +	struct work_struct worker;
> +};
> +
> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
> +
> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> +				 struct intel_gt_pm_unpark_work *work);
> +
> +static inline void
> +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
> +			     work_func_t fn)
> +{
> +	INIT_LIST_HEAD(&work->link);
> +	INIT_WORK(&work->worker, fn);
> +}
> +
> +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> index a81e21bf1bd1..4480312f0add 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> @@ -96,6 +96,16 @@ struct intel_gt {
>   	struct intel_wakeref wakeref;
>   	atomic_t user_wakeref;
>   
> +	/**
> +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
> +	 * unparked, protected by pm_unpark_work_lock
> +	 */
> +	struct list_head pm_unpark_work_list;
> +	/**
> +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
> +	 */
> +	spinlock_t pm_unpark_work_lock;
> +
>   	struct list_head closed_vma;
>   	spinlock_t closed_lock; /* guards the list of closed_vma */
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 7358883f1540..023953e77553 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -19,6 +19,7 @@
>   #include "intel_uc_fw.h"
>   #include "i915_utils.h"
>   #include "i915_vma.h"
> +#include "gt/intel_gt_pm_unpark_work.h"
>   
>   struct __guc_ads_blob;
>   
> @@ -78,11 +79,12 @@ struct intel_guc {
>   		 */
>   		struct list_head destroyed_contexts;
>   		/**
> -		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * @destroyed_worker: Worker to deregister contexts, need as we
>   		 * need to take a GT PM reference and can't from destroy
> -		 * function as it might be in an atomic context (no sleeping)
> +		 * function as it might be in an atomic context (no sleeping).
These corrections should be squashed into the previous patch that added 
these comments in the first place.

John.


> +		 * Worker only issues deregister when GT is unparked.
>   		 */
> -		struct work_struct destroyed_worker;
> +		struct intel_gt_pm_unpark_work destroyed_worker;
>   	} submission_state;
>   
>   	bool submission_supported;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f835e06e5f9f..dbf919801de2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>   	ida_init(&guc->submission_state.guc_ids);
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> +				     destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>   
>   static void destroyed_worker_func(struct work_struct *w)
>   {
> -	struct intel_guc *guc = container_of(w, struct intel_guc,
> +	struct intel_gt_pm_unpark_work *destroyed_worker =
> +		container_of(w, struct intel_gt_pm_unpark_work, worker);
> +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
>   					     submission_state.destroyed_worker);
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	int tmp;
>   
> -	with_intel_gt_pm(gt, tmp)
> +	with_intel_gt_pm_if_awake(gt, tmp)
>   		deregister_destroyed_contexts(guc);
> +
> +	if (!list_empty(&guc->submission_state.destroyed_contexts))
> +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
>   }
>   
>   static void guc_context_destroy(struct kref *kref)
> @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
>   	 * take the GT PM for the first time which isn't allowed from an atomic
>   	 * context.
>   	 */
> -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
> +				    &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> index dfd87d082218..282fc4f312e3 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.c
> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
>   
>   int __intel_wakeref_get_first(struct intel_wakeref *wf)
>   {
> +	bool do_post = false;
> +
>   	/*
>   	 * Treat get/put as different subclasses, as we may need to run
>   	 * the put callback from under the shrinker and do not want to
> @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
>   		}
>   
>   		smp_mb__before_atomic(); /* release wf->count */
> +		do_post = true;
>   	}
>   	atomic_inc(&wf->count);
> +	if (do_post && wf->ops->post_get)
> +		wf->ops->post_get(wf);
>   	mutex_unlock(&wf->mutex);
>   
>   	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index 545c8f277c46..ef7e6a698e8a 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
>   
>   struct intel_wakeref_ops {
>   	int (*get)(struct intel_wakeref *wf);
> +	void (*post_get)(struct intel_wakeref *wf);
>   	int (*put)(struct intel_wakeref *wf);
>   };
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:46   ` John Harrison
  2021-09-10  0:41     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:46 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a scheduling of user context could be enabled.
As with earlier PM patch, needs more explanation of what the problem is 
and why it is only now a problem.


>
> v2:
>   (Daniel Vetter)
>    - Add might_lock annotations to pin / unpin function
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>   5 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index c8595da64ad8..508cfe5770c0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>   	if (err)
>   		goto err_post_unpin;
>   
> +	intel_engine_pm_might_get(ce->engine);
> +
>   	if (unlikely(intel_context_is_closed(ce))) {
>   		err = -ENOENT;
>   		goto err_unlock;
> @@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
>   		return;
>   
>   	CE_TRACE(ce, "unpin\n");
> +	intel_engine_pm_might_put(ce->engine);
>   	ce->ops->unpin(ce);
>   	ce->ops->post_unpin(ce);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 17a5028ea177..3fe2ae1bcc26 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -9,6 +9,7 @@
>   #include "i915_request.h"
>   #include "intel_engine_types.h"
>   #include "intel_wakeref.h"
> +#include "intel_gt_pm.h"
>   
>   static inline bool
>   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> @@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>   	return intel_wakeref_get_if_active(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine))
> +		intel_wakeref_might_get(&engine->wakeref);
Why doesn't this need to iterate through the physical engines of the 
virtual engine?

John.

> +	intel_gt_pm_might_get(engine->gt);
> +}
> +
>   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_put(&engine->wakeref);
> @@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>   	intel_wakeref_unlock_wait(&engine->wakeref);
>   }
>   
> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> +{
> +	if (!intel_engine_is_virtual(engine))
> +		intel_wakeref_might_put(&engine->wakeref);
> +	intel_gt_pm_might_put(engine->gt);
> +}
> +
>   static inline struct i915_request *
>   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index a17bf0d4592b..3c173033ce23 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>   	return intel_wakeref_get_if_active(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_get(&gt->wakeref);
> +}
> +
>   static inline void intel_gt_pm_put(struct intel_gt *gt)
>   {
>   	intel_wakeref_put(&gt->wakeref);
> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> +{
> +	intel_wakeref_might_put(&gt->wakeref);
> +}
> +
>   #define with_intel_gt_pm(gt, tmp) \
>   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>   	     intel_gt_pm_put(gt), tmp = 0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index dbf919801de2..e0eed70f9b92 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>   
>   static int guc_context_pin(struct intel_context *ce, void *vaddr)
>   {
> -	return __guc_context_pin(ce, ce->engine, vaddr);
> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> +
> +	if (likely(!ret && !intel_context_is_barrier(ce)))
> +		intel_engine_pm_get(ce->engine);
> +
> +	return ret;
>   }
>   
>   static void guc_context_unpin(struct intel_context *ce)
> @@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
>   
>   	unpin_guc_id(guc, ce);
>   	lrc_unpin(ce);
> +
> +	if (likely(!intel_context_is_barrier(ce)))
> +		intel_engine_pm_put_async(ce->engine);
>   }
>   
>   static void guc_context_post_unpin(struct intel_context *ce)
> @@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> +	int ret = __guc_context_pin(ce, engine, vaddr);
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +
> +	if (likely(!ret))
> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +			intel_engine_pm_get(engine);
>   
> -	return __guc_context_pin(ce, engine, vaddr);
> +	return ret;
> +}
> +
> +static void guc_virtual_context_unpin(struct intel_context *ce)
> +{
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +	struct intel_engine_cs *engine;
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_barrier(ce));
> +
> +	unpin_guc_id(guc, ce);
> +	lrc_unpin(ce);
> +
> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +		intel_engine_pm_put_async(engine);
>   }
>   
>   static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   
>   	.pre_pin = guc_virtual_context_pre_pin,
>   	.pin = guc_virtual_context_pin,
> -	.unpin = guc_context_unpin,
> +	.unpin = guc_virtual_context_unpin,
>   	.post_unpin = guc_context_post_unpin,
>   
>   	.ban = guc_context_ban,
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index ef7e6a698e8a..dd530ae028e0 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -124,6 +124,12 @@ enum {
>   	__INTEL_WAKEREF_PUT_LAST_BIT__
>   };
>   
> +static inline void
> +intel_wakeref_might_get(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_put_flags: Release the wakeref
>    * @wf: the wakeref
> @@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>   }
>   
> +static inline void
> +intel_wakeref_might_put(struct intel_wakeref *wf)
> +{
> +	might_lock(&wf->mutex);
> +}
> +
>   /**
>    * intel_wakeref_lock: Lock the wakeref (mutex)
>    * @wf: the wakeref


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-09 22:51   ` John Harrison
  2021-09-13 16:54     ` Matthew Brost
  2021-09-13 16:55     ` Matthew Brost
  -1 siblings, 2 replies; 145+ messages in thread
From: John Harrison @ 2021-09-09 22:51 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Calling switch_to_kernel_context isn't needed if the engine PM reference
> is taken while all contexts are pinned. By not calling
> switch_to_kernel_context we save on issuing a request to the engine.
I thought the intention of the switch_to_kernel was to ensure that the 
GPU is not touching any user context and is basically idle. That is not 
a valid assumption with an external scheduler such as GuC. So why is the 
description above only mentioning PM references? What is the connection 
between the PM ref and the switch_to_kernel?

Also, the comment in the code does not mention anything about PM 
references, it just says 'not necessary with GuC' but no explanation at all.


> v2:
>   (Daniel Vetter)
>    - Add FIXME comment about pushing switch_to_kernel_context to backend
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 1f07ac4e0672..11fee66daf60 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	unsigned long flags;
>   	bool result = true;
>   
> +	/*
> +	 * No need to switch_to_kernel_context if GuC submission
> +	 *
> +	 * FIXME: This execlists specific backend behavior in generic code, this
"This execlists" -> "This is execlist"

"this should be" -> "it should be"

John.

> +	 * should be pushed to the backend.
> +	 */
> +	if (intel_engine_uses_guc(engine))
> +		return true;
> +
>   	/* GPU is pointing to the void, as good as in the kernel context. */
>   	if (intel_gt_is_wedged(engine->gt))
>   		return true;


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 02/27] drm/i915/guc: Allow flexible number of context ids
  2021-09-09 22:13   ` John Harrison
@ 2021-09-10  0:14     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-10  0:14 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:13:29PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Number of available GuC contexts ids might be limited.
> > Stop referring in code to macro and use variable instead.
> > 
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h            |  4 ++++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++++++------
> >   2 files changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 112dd29a63fe..6fd2719d1b75 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -60,6 +60,10 @@ struct intel_guc {
> >   	spinlock_t contexts_lock;
> >   	/** @guc_ids: used to allocate new guc_ids */
> >   	struct ida guc_ids;
> > +	/** @num_guc_ids: number of guc_ids that can be used */
> > +	u32 num_guc_ids;
> > +	/** @max_guc_ids: max number of guc_ids that can be used */
> > +	u32 max_guc_ids;
> How do these differ? The description needs to say how or why 'num' might be
> less than 'max'. And surely 'max' is not the count 'that can be used'? Num
> is how many can be used, but max is how many are physically possible or
> something?
>

Max is the possible per OS / PF / VF instance, while num is the current
setting. Makes more sense once SRIOV lands / if this is connected to
debugfs (e.g. only num can be set via debugfs).

Matt
 
> John.
> 
> >   	/**
> >   	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> >   	 */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 46158d996bf6..8235e49bb347 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > -	GEM_BUG_ON(index >= GUC_MAX_LRC_DESCRIPTORS);
> > +	GEM_BUG_ON(index >= guc->max_guc_ids);
> >   	return &base[index];
> >   }
> > @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
> >   {
> >   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
> > -	GEM_BUG_ON(id >= GUC_MAX_LRC_DESCRIPTORS);
> > +	GEM_BUG_ON(id >= guc->max_guc_ids);
> >   	return ce;
> >   }
> > @@ -363,8 +363,7 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
> >   	u32 size;
> >   	int ret;
> > -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> > -			  GUC_MAX_LRC_DESCRIPTORS);
> > +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
> >   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
> >   					     (void **)&guc->lrc_desc_pool_vaddr);
> >   	if (ret)
> > @@ -1193,7 +1192,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   static int new_guc_id(struct intel_guc *guc)
> >   {
> >   	return ida_simple_get(&guc->guc_ids, 0,
> > -			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
> > +			      guc->num_guc_ids, GFP_KERNEL |
> >   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> >   }
> > @@ -2704,6 +2703,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
> >   void intel_guc_submission_init_early(struct intel_guc *guc)
> >   {
> > +	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > +	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> >   	guc->submission_supported = __guc_submission_supported(guc);
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > @@ -2713,7 +2714,7 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   {
> >   	struct intel_context *ce;
> > -	if (unlikely(desc_idx >= GUC_MAX_LRC_DESCRIPTORS)) {
> > +	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> >   		drm_err(&guc_to_gt(guc)->i915->drm,
> >   			"Invalid desc_idx %u", desc_idx);
> >   		return NULL;
> > @@ -3063,6 +3064,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> >   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> >   		   atomic_read(&guc->outstanding_submission_g2h));
> > +	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> > +	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> >   	drm_printf(p, "GuC tasklet count: %u\n\n",
> >   		   atomic_read(&sched_engine->tasklet.count));
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 03/27] drm/i915/guc: Connect the number of guc_ids to debugfs
  2021-09-09 22:16   ` John Harrison
@ 2021-09-10  0:16     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-10  0:16 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:16:22PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > For testing purposes it may make sense to reduce the number of guc_ids
> > available to be allocated. Add debugfs support for setting the number of
> > guc_ids.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> It 'may make sense'? Is there an actual IGT/selftest test that uses this and
> needs to in order to not take several hours to run or something? Seems wrong
> to be adding a debugfs interface for something that would only be used for
> testing by internal developers who can quite easy just edit a define in the
> code to simplify their testing.
> 

Had an IGT (gem_exec_parallel) to toggle this to hit certain conditions
that are very hard to hit (guc_id exhaustion) but this was we punted
this problem to the user. Now we likely don't need that test and it is
probably ok to just drop this patch.

Matt

> John.
> 
> 
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    | 31 +++++++++++++++++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  3 +-
> >   2 files changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > index 887c8c8f35db..b88d343ee432 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > @@ -71,12 +71,43 @@ static bool intel_eval_slpc_support(void *data)
> >   	return intel_guc_slpc_is_used(guc);
> >   }
> > +static int guc_num_id_get(void *data, u64 *val)
> > +{
> > +	struct intel_guc *guc = data;
> > +
> > +	if (!intel_guc_submission_is_used(guc))
> > +		return -ENODEV;
> > +
> > +	*val = guc->num_guc_ids;
> > +
> > +	return 0;
> > +}
> > +
> > +static int guc_num_id_set(void *data, u64 val)
> > +{
> > +	struct intel_guc *guc = data;
> > +
> > +	if (!intel_guc_submission_is_used(guc))
> > +		return -ENODEV;
> > +
> > +	if (val > guc->max_guc_ids)
> > +		val = guc->max_guc_ids;
> > +	else if (val < 256)
> > +		val = 256;
> > +
> > +	guc->num_guc_ids = val;
> > +
> > +	return 0;
> > +}
> > +DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> > +
> >   void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
> >   {
> >   	static const struct debugfs_gt_file files[] = {
> >   		{ "guc_info", &guc_info_fops, NULL },
> >   		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
> >   		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
> > +		{ "guc_num_id", &guc_num_id_fops, NULL },
> >   	};
> >   	if (!intel_guc_is_supported(guc))
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 8235e49bb347..68742b612692 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -2716,7 +2716,8 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> >   		drm_err(&guc_to_gt(guc)->i915->drm,
> > -			"Invalid desc_idx %u", desc_idx);
> > +			"Invalid desc_idx %u, max %u",
> > +			desc_idx, guc->max_guc_ids);
> >   		return NULL;
> >   	}
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-09-09 22:28   ` John Harrison
@ 2021-09-10  0:21     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-10  0:21 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:28:51PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight.
> > 
> > FIXME: Move locking / structure changes into different patch
> This split needs to be done. It would also be helpful to have a more

Can do the split in the next rev.

> detailed explanation of what is going on and why the change is necessary. Is
> this a generic problem and the code currently in the tree is broken? Or is
> it something specific to parallel submission and isn't actually a problem
> until later in the series?
>

This a generic problem - we should always have PM reference when a user
context has submission enable (a follow on patch) or any G2H is in
flight to avoid reporting the GPU falsely idle.

Matt 

> John.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
> >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
> >   8 files changed, 199 insertions(+), 106 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index adfe49b53b1b..c8595da64ad8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 80bbdc7810f6..fd338a30617e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -190,22 +190,29 @@ struct intel_context {
> >   		/**
> >   		 * @id: unique handle which is used to communicate information
> >   		 * with the GuC about this context, protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		u16 id;
> >   		/**
> >   		 * @ref: the number of references to the guc_id, when
> >   		 * transitioning in and out of zero protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		atomic_t ref;
> >   		/**
> >   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> > -		 * still valid, protected by guc->contexts_lock
> > +		 * still valid, protected by guc->submission_state.lock
> >   		 */
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 70ea46d6cfb0..17a5028ea177 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..a17bf0d4592b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +#define with_intel_gt_pm_async(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put_async(gt), tmp = 0)
> > +#define with_intel_gt_pm_if_awake(gt, tmp) \
> > +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +#define with_intel_gt_pm_if_awake_async(gt, tmp) \
> > +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> > +	     intel_gt_pm_put_async(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > index 8ff582222aff..ba10bd374cee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > @@ -142,6 +142,7 @@ enum intel_guc_action {
> >   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
> >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
> >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> >   	INTEL_GUC_ACTION_LIMIT
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 6fd2719d1b75..7358883f1540 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -53,21 +53,37 @@ struct intel_guc {
> >   		void (*disable)(struct intel_guc *guc);
> >   	} interrupts;
> > -	/**
> > -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> > -	 * ce->guc_id.ref when transitioning in and out of zero
> > -	 */
> > -	spinlock_t contexts_lock;
> > -	/** @guc_ids: used to allocate new guc_ids */
> > -	struct ida guc_ids;
> > -	/** @num_guc_ids: number of guc_ids that can be used */
> > -	u32 num_guc_ids;
> > -	/** @max_guc_ids: max number of guc_ids that can be used */
> > -	u32 max_guc_ids;
> > -	/**
> > -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > -	 */
> > -	struct list_head guc_id_list;
> > +	struct {
> > +		/**
> > +		 * @lock: protects everything in submission_state, ce->guc_id,
> > +		 * and ce->destroyed_link
> > +		 */
> > +		spinlock_t lock;
> > +		/**
> > +		 * @guc_ids: used to allocate new guc_ids
> > +		 */
> > +		struct ida guc_ids;
> > +		/** @num_guc_ids: number of guc_ids that can be used */
> > +		u32 num_guc_ids;
> > +		/** @max_guc_ids: max number of guc_ids that can be used */
> > +		u32 max_guc_ids;
> > +		/**
> > +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > +		 * refs
> > +		 */
> > +		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> > +	} submission_state;
> >   	bool submission_supported;
> >   	bool submission_selected;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > index b88d343ee432..27655460ee84 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > @@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
> >   	if (!intel_guc_submission_is_used(guc))
> >   		return -ENODEV;
> > -	*val = guc->num_guc_ids;
> > +	*val = guc->submission_state.num_guc_ids;
> >   	return 0;
> >   }
> > @@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
> >   static int guc_num_id_set(void *data, u64 val)
> >   {
> >   	struct intel_guc *guc = data;
> > +	unsigned long flags;
> >   	if (!intel_guc_submission_is_used(guc))
> >   		return -ENODEV;
> > -	if (val > guc->max_guc_ids)
> > -		val = guc->max_guc_ids;
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +
> > +	if (val > guc->submission_state.max_guc_ids)
> > +		val = guc->submission_state.max_guc_ids;
> >   	else if (val < 256)
> >   		val = 256;
> > -	guc->num_guc_ids = val;
> > +	guc->submission_state.num_guc_ids = val;
> > +
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	return 0;
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 68742b612692..f835e06e5f9f 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -86,9 +86,9 @@
> >    * submitting at a time. Currently only 1 sched_engine used for all of GuC
> >    * submission but that could change in the future.
> >    *
> > - * guc->contexts_lock
> > - * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> > - * submission can hold this at a time.
> > + * guc->submission_state.lock
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -100,7 +100,7 @@
> >    *
> >    * Lock ordering rules:
> >    * sched_engine->lock -> ce->guc_state.lock
> > - * guc->contexts_lock -> ce->guc_state.lock
> > + * guc->submission_state.lock -> ce->guc_state.lock
> >    *
> >    * Reset races:
> >    * When a GPU full reset is triggered it is assumed that some G2H responses to
> > @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > -	GEM_BUG_ON(index >= guc->max_guc_ids);
> > +	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
> >   	return &base[index];
> >   }
> > @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
> >   {
> >   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
> > -	GEM_BUG_ON(id >= guc->max_guc_ids);
> > +	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
> >   	return ce;
> >   }
> > @@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
> >   	u32 size;
> >   	int ret;
> > -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
> > +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> > +			  guc->submission_state.max_guc_ids);
> >   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
> >   					     (void **)&guc->lrc_desc_pool_vaddr);
> >   	if (ret)
> > @@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	if (unlikely(!guc_submission_initialized(guc))) {
> > @@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	flush_work(&guc->ct.requests.worker);
> > +	guc_flush_destroyed_contexts(guc);
> >   	scrub_guc_desc_for_outstanding_g2h(guc);
> >   }
> > @@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> > -	spin_lock_init(&guc->contexts_lock);
> > -	INIT_LIST_HEAD(&guc->guc_id_list);
> > -	ida_init(&guc->guc_ids);
> > +	spin_lock_init(&guc->submission_state.lock);
> > +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> > +	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
> >   static int new_guc_id(struct intel_guc *guc)
> >   {
> > -	return ida_simple_get(&guc->guc_ids, 0,
> > -			      guc->num_guc_ids, GFP_KERNEL |
> > +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > +			      guc->submission_state.num_guc_ids, GFP_KERNEL |
> >   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> >   }
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> > +		ida_simple_remove(&guc->submission_state.guc_ids,
> > +				  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	unsigned long flags;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	__release_guc_id(guc, ce);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int steal_guc_id(struct intel_guc *guc)
> > @@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
> >   	struct intel_context *ce;
> >   	int guc_id;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> > -	if (!list_empty(&guc->guc_id_list)) {
> > -		ce = list_first_entry(&guc->guc_id_list,
> > +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > +		ce = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > @@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
> >   {
> >   	int ret;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> >   	ret = new_guc_id(guc);
> >   	if (unlikely(ret < 0)) {
> > @@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> >   try_again:
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	might_lock(&ce->guc_state.lock);
> > @@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	atomic_inc(&ce->guc_id.ref);
> >   out_unlock:
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/*
> >   	 * -EAGAIN indicates no guc_id are available, let's retire any
> > @@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	if (unlikely(context_guc_id_invalid(ce)))
> >   		return;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
> >   	    !atomic_read(&ce->guc_id.ref))
> > -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		list_add_tail(&ce->guc_id.link,
> > +			      &guc->submission_state.guc_id_list);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int __guc_action_register_context(struct intel_guc *guc,
> > @@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id, true);
> >   }
> > @@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +		guc_lrc_desc_unpin(ce);
> > +		spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> >   	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
> >   void intel_guc_submission_init_early(struct intel_guc *guc)
> >   {
> > -	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > -	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > +	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> >   	guc->submission_supported = __guc_submission_supported(guc);
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > @@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   {
> >   	struct intel_context *ce;
> > -	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> > +	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
> >   		drm_err(&guc_to_gt(guc)->i915->drm,
> >   			"Invalid desc_idx %u, max %u",
> > -			desc_idx, guc->max_guc_ids);
> > +			desc_idx, guc->submission_state.max_guc_ids);
> >   		return NULL;
> >   	}
> > @@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> > @@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> >   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> >   		   atomic_read(&guc->outstanding_submission_g2h));
> > -	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> > -	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> > +	drm_printf(p, "GuC Number GuC IDs: %u\n",
> > +		   guc->submission_state.num_guc_ids);
> > +	drm_printf(p, "GuC Max GuC IDs: %u\n",
> > +		   guc->submission_state.max_guc_ids);
> >   	drm_printf(p, "GuC tasklet count: %u\n\n",
> >   		   atomic_read(&sched_engine->tasklet.count));
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-09-09 22:36   ` John Harrison
@ 2021-09-10  0:34     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-10  0:34 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:36:44PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Sometimes it is desirable to queue work up for later if the GT PM isn't
> > held and run that work on next GT PM unpark.
> What is the reason for doing this? Why is it important? Why not just take
> the GT PM at the time the work is requested?
>

This was suggestion from a long time back, don't take GT PM (waking the
GPU) if is is idle. I believe Daniele suggested this a couple years ago
/ we still have this FIXME comment in DII.
 
> > 
> > Implemented with a list in the GT of all pending work, workqueues in
> > the list, a callback to add a workqueue to the list, and finally a
> > wakeref post_get callback that iterates / drains the list + queues the
> > workqueues.
> > 
> > First user of this is deregistration of GuC contexts.
> This statement should be in the first paragraph but needs to be more
> detailed - why is it necessary to add all this extra complexity rather than
> just taking the GT PM?
> 
>

This is just an optimization - don't take the GT PM when deregistering a
context if the GT is idle. If you think it is too complex we can can
delete this but IMO it really isn't any more complex than the previous
patch as either way we need a list + worker. Also once we have this
framework in place we might be able to find other users of this.
 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Makefile                 |  1 +
> >   drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
> >   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
> >   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
> >   drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
> >   drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
> >   10 files changed, 119 insertions(+), 7 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> >   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > index 642a5b5a1b81..579bdc069f25 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -103,6 +103,7 @@ gt-y += \
> >   	gt/intel_gt_clock_utils.o \
> >   	gt/intel_gt_irq.o \
> >   	gt/intel_gt_pm.o \
> > +	gt/intel_gt_pm_unpark_work.o \
> >   	gt/intel_gt_pm_irq.o \
> >   	gt/intel_gt_requests.o \
> >   	gt/intel_gtt.o \
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> > index 62d40c986642..7e690e74baa2 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> > @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
> >   	spin_lock_init(&gt->irq_lock);
> > +	spin_lock_init(&gt->pm_unpark_work_lock);
> > +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
> > +
> >   	INIT_LIST_HEAD(&gt->closed_vma);
> >   	spin_lock_init(&gt->closed_lock);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > index dea8e2479897..564c11a3748b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
> >   	return 0;
> >   }
> > +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
> > +{
> > +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > +
> > +	intel_gt_pm_unpark_work_queue(gt);
> > +}
> > +
> >   static int __gt_park(struct intel_wakeref *wf)
> >   {
> >   	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
> >   static const struct intel_wakeref_ops wf_ops = {
> >   	.get = __gt_unpark,
> > +	.post_get = __gt_unpark_work_queue,
> >   	.put = __gt_park,
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > new file mode 100644
> > index 000000000000..23162dbd0c35
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > @@ -0,0 +1,35 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2021 Intel Corporation
> > + */
> > +
> > +#include "i915_drv.h"
> > +#include "intel_runtime_pm.h"
> > +#include "intel_gt_pm.h"
> > +
> > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
> > +{
> > +	struct intel_gt_pm_unpark_work *work, *next;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > +	list_for_each_entry_safe(work, next,
> > +				 &gt->pm_unpark_work_list, link) {
> > +		list_del_init(&work->link);
> > +		queue_work(system_unbound_wq, &work->worker);
> > +	}
> > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > +}
> > +
> > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > +				 struct intel_gt_pm_unpark_work *work)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > +	if (intel_gt_pm_is_awake(gt))
> > +		queue_work(system_unbound_wq, &work->worker);
> > +	else if (list_empty(&work->link))
> > +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
> > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > +}
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > new file mode 100644
> > index 000000000000..eaf1dc313aa2
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2021 Intel Corporation
> > + */
> > +
> > +#ifndef INTEL_GT_PM_UNPARK_WORK_H
> > +#define INTEL_GT_PM_UNPARK_WORK_H
> > +
> > +#include <linux/list.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct intel_gt;
> > +
> > +/**
> > + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
> > + */
> > +struct intel_gt_pm_unpark_work {
> > +	/**
> > +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
> > +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
> > +	 */
> > +	struct list_head link;
> > +	/** @worker: will be scheduled when GT unparked */
> > +	struct work_struct worker;
> > +};
> > +
> > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
> > +
> > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > +				 struct intel_gt_pm_unpark_work *work);
> > +
> > +static inline void
> > +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
> > +			     work_func_t fn)
> > +{
> > +	INIT_LIST_HEAD(&work->link);
> > +	INIT_WORK(&work->worker, fn);
> > +}
> > +
> > +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > index a81e21bf1bd1..4480312f0add 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > @@ -96,6 +96,16 @@ struct intel_gt {
> >   	struct intel_wakeref wakeref;
> >   	atomic_t user_wakeref;
> > +	/**
> > +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
> > +	 * unparked, protected by pm_unpark_work_lock
> > +	 */
> > +	struct list_head pm_unpark_work_list;
> > +	/**
> > +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
> > +	 */
> > +	spinlock_t pm_unpark_work_lock;
> > +
> >   	struct list_head closed_vma;
> >   	spinlock_t closed_lock; /* guards the list of closed_vma */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 7358883f1540..023953e77553 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -19,6 +19,7 @@
> >   #include "intel_uc_fw.h"
> >   #include "i915_utils.h"
> >   #include "i915_vma.h"
> > +#include "gt/intel_gt_pm_unpark_work.h"
> >   struct __guc_ads_blob;
> > @@ -78,11 +79,12 @@ struct intel_guc {
> >   		 */
> >   		struct list_head destroyed_contexts;
> >   		/**
> > -		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * @destroyed_worker: Worker to deregister contexts, need as we
> >   		 * need to take a GT PM reference and can't from destroy
> > -		 * function as it might be in an atomic context (no sleeping)
> > +		 * function as it might be in an atomic context (no sleeping).
> These corrections should be squashed into the previous patch that added
> these comments in the first place.
> 

Yep.

Matt

> John.
> 
> 
> > +		 * Worker only issues deregister when GT is unparked.
> >   		 */
> > -		struct work_struct destroyed_worker;
> > +		struct intel_gt_pm_unpark_work destroyed_worker;
> >   	} submission_state;
> >   	bool submission_supported;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f835e06e5f9f..dbf919801de2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> > +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> > +				     destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
> >   static void destroyed_worker_func(struct work_struct *w)
> >   {
> > -	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +	struct intel_gt_pm_unpark_work *destroyed_worker =
> > +		container_of(w, struct intel_gt_pm_unpark_work, worker);
> > +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
> >   					     submission_state.destroyed_worker);
> >   	struct intel_gt *gt = guc_to_gt(guc);
> >   	int tmp;
> > -	with_intel_gt_pm(gt, tmp)
> > +	with_intel_gt_pm_if_awake(gt, tmp)
> >   		deregister_destroyed_contexts(guc);
> > +
> > +	if (!list_empty(&guc->submission_state.destroyed_contexts))
> > +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
> >   }
> >   static void guc_context_destroy(struct kref *kref)
> > @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
> >   	 * take the GT PM for the first time which isn't allowed from an atomic
> >   	 * context.
> >   	 */
> > -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> > +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
> > +				    &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> > index dfd87d082218..282fc4f312e3 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
> >   int __intel_wakeref_get_first(struct intel_wakeref *wf)
> >   {
> > +	bool do_post = false;
> > +
> >   	/*
> >   	 * Treat get/put as different subclasses, as we may need to run
> >   	 * the put callback from under the shrinker and do not want to
> > @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
> >   		}
> >   		smp_mb__before_atomic(); /* release wf->count */
> > +		do_post = true;
> >   	}
> >   	atomic_inc(&wf->count);
> > +	if (do_post && wf->ops->post_get)
> > +		wf->ops->post_get(wf);
> >   	mutex_unlock(&wf->mutex);
> >   	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > index 545c8f277c46..ef7e6a698e8a 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
> >   struct intel_wakeref_ops {
> >   	int (*get)(struct intel_wakeref *wf);
> > +	void (*post_get)(struct intel_wakeref *wf);
> >   	int (*put)(struct intel_wakeref *wf);
> >   };
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-09-09 22:46   ` John Harrison
@ 2021-09-10  0:41     ` Matthew Brost
  2021-09-13 22:26       ` John Harrison
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-10  0:41 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:46:43PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a scheduling of user context could be enabled.
> As with earlier PM patch, needs more explanation of what the problem is and
> why it is only now a problem.
> 
> 

Same explaination, will add here.

> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add might_lock annotations to pin / unpin function
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> >   drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> >   5 files changed, 73 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index c8595da64ad8..508cfe5770c0 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >   	if (err)
> >   		goto err_post_unpin;
> > +	intel_engine_pm_might_get(ce->engine);
> > +
> >   	if (unlikely(intel_context_is_closed(ce))) {
> >   		err = -ENOENT;
> >   		goto err_unlock;
> > @@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
> >   		return;
> >   	CE_TRACE(ce, "unpin\n");
> > +	intel_engine_pm_might_put(ce->engine);
> >   	ce->ops->unpin(ce);
> >   	ce->ops->post_unpin(ce);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 17a5028ea177..3fe2ae1bcc26 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -9,6 +9,7 @@
> >   #include "i915_request.h"
> >   #include "intel_engine_types.h"
> >   #include "intel_wakeref.h"
> > +#include "intel_gt_pm.h"
> >   static inline bool
> >   intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > @@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> >   	return intel_wakeref_get_if_active(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine))
> > +		intel_wakeref_might_get(&engine->wakeref);
> Why doesn't this need to iterate through the physical engines of the virtual
> engine?
> 

Yea, technically it should. This is just an annotation though to check
if we do something horribly wrong in our code. If we use any physical
engine in our stack this annotation should pop and we can fix it. I just
don't see what making this 100% correct for virtual engines buys us. If
you want I can fix this but thinking the more complex we make this
annotation the less likely it just gets compiled out with lockdep off
which is what we are aiming for.

Matt

> John.
> 
> > +	intel_gt_pm_might_get(engine->gt);
> > +}
> > +
> >   static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_put(&engine->wakeref);
> > @@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> >   	intel_wakeref_unlock_wait(&engine->wakeref);
> >   }
> > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > +{
> > +	if (!intel_engine_is_virtual(engine))
> > +		intel_wakeref_might_put(&engine->wakeref);
> > +	intel_gt_pm_might_put(engine->gt);
> > +}
> > +
> >   static inline struct i915_request *
> >   intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> >   {
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index a17bf0d4592b..3c173033ce23 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> >   	return intel_wakeref_get_if_active(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_get(&gt->wakeref);
> > +}
> > +
> >   static inline void intel_gt_pm_put(struct intel_gt *gt)
> >   {
> >   	intel_wakeref_put(&gt->wakeref);
> > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > +{
> > +	intel_wakeref_might_put(&gt->wakeref);
> > +}
> > +
> >   #define with_intel_gt_pm(gt, tmp) \
> >   	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> >   	     intel_gt_pm_put(gt), tmp = 0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index dbf919801de2..e0eed70f9b92 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> >   static int guc_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > +
> > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > +		intel_engine_pm_get(ce->engine);
> > +
> > +	return ret;
> >   }
> >   static void guc_context_unpin(struct intel_context *ce)
> > @@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
> >   	unpin_guc_id(guc, ce);
> >   	lrc_unpin(ce);
> > +
> > +	if (likely(!intel_context_is_barrier(ce)))
> > +		intel_engine_pm_put_async(ce->engine);
> >   }
> >   static void guc_context_post_unpin(struct intel_context *ce)
> > @@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> >   static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +
> > +	if (likely(!ret))
> > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +			intel_engine_pm_get(engine);
> > -	return __guc_context_pin(ce, engine, vaddr);
> > +	return ret;
> > +}
> > +
> > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > +{
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +
> > +	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > +
> > +	unpin_guc_id(guc, ce);
> > +	lrc_unpin(ce);
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +		intel_engine_pm_put_async(engine);
> >   }
> >   static void guc_virtual_context_enter(struct intel_context *ce)
> > @@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.pre_pin = guc_virtual_context_pre_pin,
> >   	.pin = guc_virtual_context_pin,
> > -	.unpin = guc_context_unpin,
> > +	.unpin = guc_virtual_context_unpin,
> >   	.post_unpin = guc_context_post_unpin,
> >   	.ban = guc_context_ban,
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > index ef7e6a698e8a..dd530ae028e0 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > @@ -124,6 +124,12 @@ enum {
> >   	__INTEL_WAKEREF_PUT_LAST_BIT__
> >   };
> > +static inline void
> > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_put_flags: Release the wakeref
> >    * @wf: the wakeref
> > @@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> >   			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> >   }
> > +static inline void
> > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > +{
> > +	might_lock(&wf->mutex);
> > +}
> > +
> >   /**
> >    * intel_wakeref_lock: Lock the wakeref (mutex)
> >    * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
  (?)
@ 2021-09-10  8:36   ` Tvrtko Ursulin
  2021-09-10 20:09     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-10  8:36 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu


On 20/08/2021 23:44, Matthew Brost wrote:
> Sometimes it is desirable to queue work up for later if the GT PM isn't
> held and run that work on next GT PM unpark.

Sounds maybe plausible, but it depends how much work can happen on 
unpark and whether it can have too much of a negative impact on latency 
for interactive loads? Or from a reverse angle, why the work wouldn't be 
done on parking?

Also what kind of mechanism for dealing with too much stuff being put on 
this list you have? Can there be pressure which triggers (or would need 
to trigger) these deregistrations to happen at runtime (no park/unpark 
transitions)?

> Implemented with a list in the GT of all pending work, workqueues in
> the list, a callback to add a workqueue to the list, and finally a
> wakeref post_get callback that iterates / drains the list + queues the
> workqueues.
> 
> First user of this is deregistration of GuC contexts.

Does first imply there are more incoming?

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/Makefile                 |  1 +
>   drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
>   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
>   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
>   drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
>   drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
>   10 files changed, 119 insertions(+), 7 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
>   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 642a5b5a1b81..579bdc069f25 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -103,6 +103,7 @@ gt-y += \
>   	gt/intel_gt_clock_utils.o \
>   	gt/intel_gt_irq.o \
>   	gt/intel_gt_pm.o \
> +	gt/intel_gt_pm_unpark_work.o \
>   	gt/intel_gt_pm_irq.o \
>   	gt/intel_gt_requests.o \
>   	gt/intel_gtt.o \
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index 62d40c986642..7e690e74baa2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
>   
>   	spin_lock_init(&gt->irq_lock);
>   
> +	spin_lock_init(&gt->pm_unpark_work_lock);
> +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
> +
>   	INIT_LIST_HEAD(&gt->closed_vma);
>   	spin_lock_init(&gt->closed_lock);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index dea8e2479897..564c11a3748b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
>   	return 0;
>   }
>   
> +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
> +{
> +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> +
> +	intel_gt_pm_unpark_work_queue(gt);
> +}
> +
>   static int __gt_park(struct intel_wakeref *wf)
>   {
>   	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
>   
>   static const struct intel_wakeref_ops wf_ops = {
>   	.get = __gt_unpark,
> +	.post_get = __gt_unpark_work_queue,
>   	.put = __gt_park,
>   };
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> new file mode 100644
> index 000000000000..23162dbd0c35
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_runtime_pm.h"
> +#include "intel_gt_pm.h"
> +
> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
> +{
> +	struct intel_gt_pm_unpark_work *work, *next;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> +	list_for_each_entry_safe(work, next,
> +				 &gt->pm_unpark_work_list, link) {
> +		list_del_init(&work->link);
> +		queue_work(system_unbound_wq, &work->worker);
> +	}
> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> +}
> +
> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> +				 struct intel_gt_pm_unpark_work *work)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> +	if (intel_gt_pm_is_awake(gt))
> +		queue_work(system_unbound_wq, &work->worker);
> +	else if (list_empty(&work->link))

What's the list_empty check for, something can race by design?

> +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> +}
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> new file mode 100644
> index 000000000000..eaf1dc313aa2
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#ifndef INTEL_GT_PM_UNPARK_WORK_H
> +#define INTEL_GT_PM_UNPARK_WORK_H
> +
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +
> +struct intel_gt;
> +
> +/**
> + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
> + */
> +struct intel_gt_pm_unpark_work {
> +	/**
> +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
> +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
> +	 */
> +	struct list_head link;
> +	/** @worker: will be scheduled when GT unparked */
> +	struct work_struct worker;
> +};
> +
> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
> +
> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> +				 struct intel_gt_pm_unpark_work *work);
> +
> +static inline void
> +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
> +			     work_func_t fn)
> +{
> +	INIT_LIST_HEAD(&work->link);
> +	INIT_WORK(&work->worker, fn);
> +}
> +
> +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> index a81e21bf1bd1..4480312f0add 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> @@ -96,6 +96,16 @@ struct intel_gt {
>   	struct intel_wakeref wakeref;
>   	atomic_t user_wakeref;
>   
> +	/**
> +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
> +	 * unparked, protected by pm_unpark_work_lock
> +	 */
> +	struct list_head pm_unpark_work_list;
> +	/**
> +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
> +	 */
> +	spinlock_t pm_unpark_work_lock;
> +
>   	struct list_head closed_vma;
>   	spinlock_t closed_lock; /* guards the list of closed_vma */
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 7358883f1540..023953e77553 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -19,6 +19,7 @@
>   #include "intel_uc_fw.h"
>   #include "i915_utils.h"
>   #include "i915_vma.h"
> +#include "gt/intel_gt_pm_unpark_work.h"
>   
>   struct __guc_ads_blob;
>   
> @@ -78,11 +79,12 @@ struct intel_guc {
>   		 */
>   		struct list_head destroyed_contexts;
>   		/**
> -		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * @destroyed_worker: Worker to deregister contexts, need as we
>   		 * need to take a GT PM reference and can't from destroy
> -		 * function as it might be in an atomic context (no sleeping)
> +		 * function as it might be in an atomic context (no sleeping).
> +		 * Worker only issues deregister when GT is unparked.
>   		 */
> -		struct work_struct destroyed_worker;
> +		struct intel_gt_pm_unpark_work destroyed_worker;
>   	} submission_state;
>   
>   	bool submission_supported;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f835e06e5f9f..dbf919801de2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>   	ida_init(&guc->submission_state.guc_ids);
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> +				     destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>   
>   static void destroyed_worker_func(struct work_struct *w)
>   {
> -	struct intel_guc *guc = container_of(w, struct intel_guc,
> +	struct intel_gt_pm_unpark_work *destroyed_worker =
> +		container_of(w, struct intel_gt_pm_unpark_work, worker);
> +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
>   					     submission_state.destroyed_worker);
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	int tmp;
>   
> -	with_intel_gt_pm(gt, tmp)
> +	with_intel_gt_pm_if_awake(gt, tmp)
>   		deregister_destroyed_contexts(guc);
> +
> +	if (!list_empty(&guc->submission_state.destroyed_contexts))
> +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);

This is the worker itself, right?

There's a "if awake" here followed by another "if awake" inside 
intel_gt_pm_unpark_work_add which raises questions.

Second question is what's the list_empty for - why is the state of the 
list itself relevant to a single worker deciding whether to re-add 
itself to it or not? And is there a lock protecting this list?

On the overall it feels questionable to have unpark work which 
apparently can race with subsequent parking. Presumably you cannot have 
it run sync on unpark due execution context issues?

>   }
>   
>   static void guc_context_destroy(struct kref *kref)
> @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
>   	 * take the GT PM for the first time which isn't allowed from an atomic
>   	 * context.
>   	 */
> -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
> +				    &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> index dfd87d082218..282fc4f312e3 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.c
> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
>   
>   int __intel_wakeref_get_first(struct intel_wakeref *wf)
>   {
> +	bool do_post = false;
> +
>   	/*
>   	 * Treat get/put as different subclasses, as we may need to run
>   	 * the put callback from under the shrinker and do not want to
> @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
>   		}
>   
>   		smp_mb__before_atomic(); /* release wf->count */
> +		do_post = true;
>   	}
>   	atomic_inc(&wf->count);
> +	if (do_post && wf->ops->post_get)
> +		wf->ops->post_get(wf);

You want this hook under the wf->mutex and why?

Regards,

Tvrtko

>   	mutex_unlock(&wf->mutex);
>   
>   	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index 545c8f277c46..ef7e6a698e8a 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
>   
>   struct intel_wakeref_ops {
>   	int (*get)(struct intel_wakeref *wf);
> +	void (*post_get)(struct intel_wakeref *wf);
>   	int (*put)(struct intel_wakeref *wf);
>   };
>   
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-10 11:12   ` Tvrtko Ursulin
  2021-09-10 19:49     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-10 11:12 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu


On 20/08/2021 23:44, Matthew Brost wrote:
> Add logical engine mapping. This is required for split-frame, as
> workloads need to be placed on engines in a logically contiguous manner.
> 
> v2:
>   (Daniel Vetter)
>    - Add kernel doc for new fields
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
>   .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>   5 files changed, 60 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 0d9105a31d84..4d790f9a65dd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>   	GEM_DEBUG_WARN_ON(iir);
>   }
>   
> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> +			      u8 logical_instance)
>   {
>   	const struct engine_info *info = &intel_engines[id];
>   	struct drm_i915_private *i915 = gt->i915;
> @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>   
>   	engine->class = info->class;
>   	engine->instance = info->instance;
> +	engine->logical_mask = BIT(logical_instance);
>   	__sprint_engine_name(engine);
>   
>   	engine->props.heartbeat_interval_ms =
> @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>   	return info->engine_mask;
>   }
>   
> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> +				 u8 class, const u8 *map, u8 num_instances)
> +{
> +	int i, j;
> +	u8 current_logical_id = 0;
> +
> +	for (j = 0; j < num_instances; ++j) {
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			if (!HAS_ENGINE(gt, i) ||
> +			    intel_engines[i].class != class)
> +				continue;
> +
> +			if (intel_engines[i].instance == map[j]) {
> +				logical_ids[intel_engines[i].instance] =
> +					current_logical_id++;
> +				break;
> +			}
> +		}
> +	}
> +}
> +
> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> +{
> +	int i;
> +	u8 map[MAX_ENGINE_INSTANCE + 1];
> +
> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> +		map[i] = i;

What's the point of the map array since it is 1:1 with instance?

> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> +}
> +
>   /**
>    * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>    * @gt: pointer to struct intel_gt
> @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	struct drm_i915_private *i915 = gt->i915;
>   	const unsigned int engine_mask = init_engine_mask(gt);
>   	unsigned int mask = 0;
> -	unsigned int i;
> +	unsigned int i, class;
> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>   	int err;
>   
>   	drm_WARN_ON(&i915->drm, engine_mask == 0);
> @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>   	if (i915_inject_probe_failure(i915))
>   		return -ENODEV;
>   
> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> -		if (!HAS_ENGINE(gt, i))
> -			continue;
> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> +		setup_logical_ids(gt, logical_ids, class);
>   
> -		err = intel_engine_setup(gt, i);
> -		if (err)
> -			goto cleanup;
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			u8 instance = intel_engines[i].instance;
> +
> +			if (intel_engines[i].class != class ||
> +			    !HAS_ENGINE(gt, i))
> +				continue;
>   
> -		mask |= BIT(i);
> +			err = intel_engine_setup(gt, i,
> +						 logical_ids[instance]);
> +			if (err)
> +				goto cleanup;
> +
> +			mask |= BIT(i);

I still this there is a less clunky way to set this up in less code and 
more readable at the same time. Like do it in two passes so you can 
iterate gt->engine_class[] array instead of having to implement a skip 
condition (both on class and HAS_ENGINE at two places) and also avoid 
walking the flat intel_engines array recursively.

> +		}
>   	}
>   
>   	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index ed91bcff20eb..fddf35546b58 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -266,6 +266,11 @@ struct intel_engine_cs {
>   	unsigned int guc_id;
>   
>   	intel_engine_mask_t mask;
> +	/**
> +	 * @logical_mask: logical mask of engine, reported to user space via
> +	 * query IOCTL and used to communicate with the GuC in logical space
> +	 */
> +	intel_engine_mask_t logical_mask;

You could prefix the new field with uabi_ to match the existing scheme 
and to signify to anyone who might be touching it in the future it 
should not be changed.

Also, I think comment should explain what is logical space ie. how the 
numbering works.

Not sure the part about GuC needs to be in the comment since uapi is 
supposed to be backend agnostic.

Regards,

Tvrtko

>   
>   	u8 class;
>   	u8 instance;
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index cafb0608ffb4..813a6de01382 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   
>   		ve->siblings[ve->num_siblings++] = sibling;
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		/*
>   		 * All physical engines must be compatible for their emission
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 6926919bcac6..9f5f43a16182 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>   	for_each_engine(engine, gt, id) {
>   		u8 guc_class = engine_class_to_guc_class(engine->class);
>   
> -		system_info->mapping_table[guc_class][engine->instance] =
> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>   			engine->instance;
>   	}
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e0eed70f9b92..ffafbac7335e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>   	return __guc_action_deregister_context(guc, guc_id, loop);
>   }
>   
> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> -{
> -	switch (class) {
> -	case RENDER_CLASS:
> -		return mask >> RCS0;
> -	case VIDEO_ENHANCEMENT_CLASS:
> -		return mask >> VECS0;
> -	case VIDEO_DECODE_CLASS:
> -		return mask >> VCS0;
> -	case COPY_ENGINE_CLASS:
> -		return mask >> BCS0;
> -	default:
> -		MISSING_CASE(class);
> -		return 0;
> -	}
> -}
> -
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   
>   	desc = __get_lrc_desc(guc, desc_idx);
>   	desc->engine_class = engine_class_to_guc_class(engine->class);
> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> -						      engine->mask);
> +	desc->engine_submit_mask = engine->logical_mask;
>   	desc->hw_context_desc = ce->lrc.lrca;
>   	desc->priority = ce->guc_state.prio;
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>   		}
>   
>   		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>   
>   		if (n != 0 && ve->base.class != sibling->class) {
>   			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-10 11:25   ` Tvrtko Ursulin
  2021-09-10 20:49     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-10 11:25 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu


On 20/08/2021 23:44, Matthew Brost wrote:
> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> mid BB. To safely enable preemption at the BB boundary, a handshake
> between to parent and child is needed. This is implemented via custom
> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> default if a context is configured by set parallel extension.

FWIW I think it's wrong to hardcode the requirements of a particular 
hardware generation fixed media pipeline into the uapi. IMO better 
solution was when concept of parallel submission was decoupled from the 
no preemption mid batch preambles. Otherwise might as well call the 
extension I915_CONTEXT_ENGINES_EXT_MEDIA_SPLIT_FRAME_SUBMIT or something.

Regards,

Tvrtko
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
>   4 files changed, 287 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 5615be32879c..2de62649e275 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>   	GEM_BUG_ON(intel_context_is_child(child));
>   	GEM_BUG_ON(intel_context_is_parent(child));
>   
> -	parent->guc_number_children++;
> +	child->guc_child_index = parent->guc_number_children++;
>   	list_add_tail(&child->guc_child_link,
>   		      &parent->guc_child_list);
>   	child->parent = parent;
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 713d85b0b364..727f91e7f7c2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -246,6 +246,9 @@ struct intel_context {
>   		/** @guc_number_children: number of children if parent */
>   		u8 guc_number_children;
>   
> +		/** @guc_child_index: index into guc_child_list if child */
> +		u8 guc_child_index;
> +
>   		/**
>   		 * @parent_page: page in context used by parent for work queue,
>   		 * work queue descriptor
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 6cd26dc060d1..9f61cfa5566a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -188,7 +188,7 @@ struct guc_process_desc {
>   	u32 wq_status;
>   	u32 engine_presence;
>   	u32 priority;
> -	u32 reserved[30];
> +	u32 reserved[36];
>   } __packed;
>   
>   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 91330525330d..1a18f99bf12a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -11,6 +11,7 @@
>   #include "gt/intel_context.h"
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
> +#include "gt/intel_gpu_commands.h"
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
> @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
>   
>   /*
>    * When using multi-lrc submission an extra page in the context state is
> - * reserved for the process descriptor and work queue.
> + * reserved for the process descriptor, work queue, and preempt BB boundary
> + * handshake between the parent + childlren contexts.
>    *
>    * The layout of this page is below:
>    * 0						guc_process_desc
> + * + sizeof(struct guc_process_desc)		child go
> + * + CACHELINE_BYTES				child join ...
> + * + CACHELINE_BYTES ...
>    * ...						unused
>    * PAGE_SIZE / 2				work queue start
>    * ...						work queue
> @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>   	return __guc_action_deregister_context(guc, guc_id, loop);
>   }
>   
> +static inline void clear_children_join_go_memory(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +	u8 i;
> +
> +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> +}
> +
> +static inline u32 get_children_go_value(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +
> +	return mem[0];
> +}
> +
> +static inline u32 get_children_join_value(struct intel_context *ce,
> +					  u8 child_index)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +
> +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> +}
> +
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   			guc_context_policy_init(engine, desc);
>   		}
> +
> +		clear_children_join_go_memory(ce);
>   	}
>   
>   	/*
> @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> +/*
> + * The below override of the breadcrumbs is enabled when the user configures a
> + * context for parallel submission (multi-lrc, parent-child).
> + *
> + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> + * safely preempt all the hw contexts configured for parallel submission
> + * between each BB. The contract between the i915 and GuC is if the parent
> + * context can be preempted, all the children can be preempted, and the GuC will
> + * always try to preempt the parent before the children. A handshake between the
> + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> + * creating a window to preempt between each set of BBs.
> + */
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags);
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags);
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs);
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs);
> +
>   static struct intel_context *
>   guc_create_parallel(struct intel_engine_cs **engines,
>   		    unsigned int num_siblings,
> @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->engine->emit_bb_start =
> +		emit_bb_start_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb =
> +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb_dw =
> +		12 + 4 * parent->guc_number_children;
> +	for_each_child(parent, ce) {
> +		ce->engine->emit_bb_start =
> +			emit_bb_start_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb =
> +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb_dw = 16;
> +	}
> +
>   	kfree(siblings);
>   	return parent;
>   
> @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
>   
> +static inline u32 get_children_go_addr(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return i915_ggtt_offset(ce->state) +
> +		__get_process_desc_offset(ce) +
> +		sizeof(struct guc_process_desc);
> +}
> +
> +static inline u32 get_children_join_addr(struct intel_context *ce,
> +					 u8 child_index)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> +}
> +
> +#define PARENT_GO_BB			1
> +#define PARENT_GO_FINI_BREADCRUMB	0
> +#define CHILD_GO_BB			1
> +#define CHILD_GO_FINI_BREADCRUMB	0
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Wait on chidlren */
> +	for (i = 0; i < ce->guc_number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_BB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_BB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	cs = intel_ring_begin(rq, 12);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_BB,
> +				  get_children_join_addr(ce->parent,
> +							 ce->guc_child_index),
> +				  0);
> +
> +	/* Wait parent on for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_BB;
> +	*cs++ = get_children_go_addr(ce->parent);
> +	*cs++ = 0;
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->guc_number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_FINI_BREADCRUMB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_FINI_BREADCRUMB,
> +				  get_children_join_addr(ce->parent,
> +							 ce->guc_child_index),
> +				  0);
> +
> +	/* Wait parent on for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> +	*cs++ = get_children_go_addr(ce->parent);
> +	*cs++ = 0;
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
>   static struct intel_context *
>   g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   {
> @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   			drm_printf(p, "\t\tWQI Status: %u\n\n",
>   				   READ_ONCE(desc->wq_status));
>   
> +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> +				   ce->guc_number_children);
> +			if (ce->engine->emit_bb_start ==
> +			    emit_bb_start_parent_no_preempt_mid_batch) {
> +				u8 i;
> +
> +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> +					   get_children_go_value(ce));
> +				for (i = 0; i < ce->guc_number_children; ++i)
> +					drm_printf(p, "\t\tChildren Join: %u\n",
> +						   get_children_join_value(ce, i));
> +			}
> +
>   			for_each_child(ce, child)
>   				guc_log_context(p, child);
>   		}
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-10 11:12   ` Tvrtko Ursulin
@ 2021-09-10 19:49     ` Matthew Brost
  2021-09-13  9:24       ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-10 19:49 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Fri, Sep 10, 2021 at 12:12:42PM +0100, Tvrtko Ursulin wrote:
> 
> On 20/08/2021 23:44, Matthew Brost wrote:
> > Add logical engine mapping. This is required for split-frame, as
> > workloads need to be placed on engines in a logically contiguous manner.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add kernel doc for new fields
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
> >   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
> >   .../drm/i915/gt/intel_execlists_submission.c  |  1 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
> >   5 files changed, 60 insertions(+), 29 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > index 0d9105a31d84..4d790f9a65dd 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
> >   	GEM_DEBUG_WARN_ON(iir);
> >   }
> > -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> > +			      u8 logical_instance)
> >   {
> >   	const struct engine_info *info = &intel_engines[id];
> >   	struct drm_i915_private *i915 = gt->i915;
> > @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> >   	engine->class = info->class;
> >   	engine->instance = info->instance;
> > +	engine->logical_mask = BIT(logical_instance);
> >   	__sprint_engine_name(engine);
> >   	engine->props.heartbeat_interval_ms =
> > @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
> >   	return info->engine_mask;
> >   }
> > +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> > +				 u8 class, const u8 *map, u8 num_instances)
> > +{
> > +	int i, j;
> > +	u8 current_logical_id = 0;
> > +
> > +	for (j = 0; j < num_instances; ++j) {
> > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > +			if (!HAS_ENGINE(gt, i) ||
> > +			    intel_engines[i].class != class)
> > +				continue;
> > +
> > +			if (intel_engines[i].instance == map[j]) {
> > +				logical_ids[intel_engines[i].instance] =
> > +					current_logical_id++;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> > +{
> > +	int i;
> > +	u8 map[MAX_ENGINE_INSTANCE + 1];
> > +
> > +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> > +		map[i] = i;
> 
> What's the point of the map array since it is 1:1 with instance?
> 

Future products do not have a 1 to 1 mapping and that mapping can change
based on fusing, e.g. XeHP SDV.

Also technically ICL / TGL / ADL physical instance 2 maps to logical
instance 1.

> > +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> > +}
> > +
> >   /**
> >    * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
> >    * @gt: pointer to struct intel_gt
> > @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> >   	struct drm_i915_private *i915 = gt->i915;
> >   	const unsigned int engine_mask = init_engine_mask(gt);
> >   	unsigned int mask = 0;
> > -	unsigned int i;
> > +	unsigned int i, class;
> > +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
> >   	int err;
> >   	drm_WARN_ON(&i915->drm, engine_mask == 0);
> > @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> >   	if (i915_inject_probe_failure(i915))
> >   		return -ENODEV;
> > -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> > -		if (!HAS_ENGINE(gt, i))
> > -			continue;
> > +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> > +		setup_logical_ids(gt, logical_ids, class);
> > -		err = intel_engine_setup(gt, i);
> > -		if (err)
> > -			goto cleanup;
> > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > +			u8 instance = intel_engines[i].instance;
> > +
> > +			if (intel_engines[i].class != class ||
> > +			    !HAS_ENGINE(gt, i))
> > +				continue;
> > -		mask |= BIT(i);
> > +			err = intel_engine_setup(gt, i,
> > +						 logical_ids[instance]);
> > +			if (err)
> > +				goto cleanup;
> > +
> > +			mask |= BIT(i);
> 
> I still this there is a less clunky way to set this up in less code and more
> readable at the same time. Like do it in two passes so you can iterate
> gt->engine_class[] array instead of having to implement a skip condition
> (both on class and HAS_ENGINE at two places) and also avoid walking the flat
> intel_engines array recursively.
>

Kinda a bikeshed arguing about a pretty simple loop structure, don't you
think? I personally like the way it laid out.

Pseudo code for your suggestion?
 
> > +		}
> >   	}
> >   	/*
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > index ed91bcff20eb..fddf35546b58 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > @@ -266,6 +266,11 @@ struct intel_engine_cs {
> >   	unsigned int guc_id;
> >   	intel_engine_mask_t mask;
> > +	/**
> > +	 * @logical_mask: logical mask of engine, reported to user space via
> > +	 * query IOCTL and used to communicate with the GuC in logical space
> > +	 */
> > +	intel_engine_mask_t logical_mask;
> 
> You could prefix the new field with uabi_ to match the existing scheme and
> to signify to anyone who might be touching it in the future it should not be
> changed.

This is kinda uabi, but also kinda isn't. We do report a logical
instance via IOCTL but it also really is tied the GuC backend as we only
can communicate with the GuC in logical space. IMO we should leave as
is.

> 
> Also, I think comment should explain what is logical space ie. how the
> numbering works.
>

Don't I already do that? I suppose I could add something like:

The logical mask within engine class must be contiguous across all
instances.

> Not sure the part about GuC needs to be in the comment since uapi is
> supposed to be backend agnostic.
> 

Communicating with the GuC in logical space is a pretty key point here.
The communication with the GuC in logical space is backend specific but
how our hardware works (e.g. split frame workloads must be placed
logical contiguous) is not. Mentioning the GuC requirement here makes
sense to me for completeness.

Matt

> Regards,
> 
> Tvrtko
> 
> >   	u8 class;
> >   	u8 instance;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index cafb0608ffb4..813a6de01382 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> >   		ve->siblings[ve->num_siblings++] = sibling;
> >   		ve->base.mask |= sibling->mask;
> > +		ve->base.logical_mask |= sibling->logical_mask;
> >   		/*
> >   		 * All physical engines must be compatible for their emission
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > index 6926919bcac6..9f5f43a16182 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
> >   	for_each_engine(engine, gt, id) {
> >   		u8 guc_class = engine_class_to_guc_class(engine->class);
> > -		system_info->mapping_table[guc_class][engine->instance] =
> > +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
> >   			engine->instance;
> >   	}
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index e0eed70f9b92..ffafbac7335e 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> >   	return __guc_action_deregister_context(guc, guc_id, loop);
> >   }
> > -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> > -{
> > -	switch (class) {
> > -	case RENDER_CLASS:
> > -		return mask >> RCS0;
> > -	case VIDEO_ENHANCEMENT_CLASS:
> > -		return mask >> VECS0;
> > -	case VIDEO_DECODE_CLASS:
> > -		return mask >> VCS0;
> > -	case COPY_ENGINE_CLASS:
> > -		return mask >> BCS0;
> > -	default:
> > -		MISSING_CASE(class);
> > -		return 0;
> > -	}
> > -}
> > -
> >   static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   				    struct guc_lrc_desc *desc)
> >   {
> > @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc = __get_lrc_desc(guc, desc_idx);
> >   	desc->engine_class = engine_class_to_guc_class(engine->class);
> > -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> > -						      engine->mask);
> > +	desc->engine_submit_mask = engine->logical_mask;
> >   	desc->hw_context_desc = ce->lrc.lrca;
> >   	desc->priority = ce->guc_state.prio;
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> >   		}
> >   		ve->base.mask |= sibling->mask;
> > +		ve->base.logical_mask |= sibling->logical_mask;
> >   		if (n != 0 && ve->base.class != sibling->class) {
> >   			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-09-10  8:36   ` Tvrtko Ursulin
@ 2021-09-10 20:09     ` Matthew Brost
  2021-09-13 10:33       ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-10 20:09 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Fri, Sep 10, 2021 at 09:36:17AM +0100, Tvrtko Ursulin wrote:
> 
> On 20/08/2021 23:44, Matthew Brost wrote:
> > Sometimes it is desirable to queue work up for later if the GT PM isn't
> > held and run that work on next GT PM unpark.
> 
> Sounds maybe plausible, but it depends how much work can happen on unpark
> and whether it can have too much of a negative impact on latency for
> interactive loads? Or from a reverse angle, why the work wouldn't be done on

All it is does is add an interface to kick a work queue on unpark. i.e.
All the actually work is done async in the work queue so it shouldn't
add any latency.

> parking?
> 
> Also what kind of mechanism for dealing with too much stuff being put on
> this list you have? Can there be pressure which triggers (or would need to

No limits on pressure. See above, I don't think this is a concern.

> trigger) these deregistrations to happen at runtime (no park/unpark
> transitions)?
>
> > Implemented with a list in the GT of all pending work, workqueues in
> > the list, a callback to add a workqueue to the list, and finally a
> > wakeref post_get callback that iterates / drains the list + queues the
> > workqueues.
> > 
> > First user of this is deregistration of GuC contexts.
> 
> Does first imply there are more incoming?
>

Haven't found another user yet but this is generic mechanism so we can
add more in the future if other use cases arrise.
 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Makefile                 |  1 +
> >   drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
> >   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
> >   .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
> >   drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
> >   drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
> >   10 files changed, 119 insertions(+), 7 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> >   create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > index 642a5b5a1b81..579bdc069f25 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -103,6 +103,7 @@ gt-y += \
> >   	gt/intel_gt_clock_utils.o \
> >   	gt/intel_gt_irq.o \
> >   	gt/intel_gt_pm.o \
> > +	gt/intel_gt_pm_unpark_work.o \
> >   	gt/intel_gt_pm_irq.o \
> >   	gt/intel_gt_requests.o \
> >   	gt/intel_gtt.o \
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> > index 62d40c986642..7e690e74baa2 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> > @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
> >   	spin_lock_init(&gt->irq_lock);
> > +	spin_lock_init(&gt->pm_unpark_work_lock);
> > +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
> > +
> >   	INIT_LIST_HEAD(&gt->closed_vma);
> >   	spin_lock_init(&gt->closed_lock);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > index dea8e2479897..564c11a3748b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
> >   	return 0;
> >   }
> > +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
> > +{
> > +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > +
> > +	intel_gt_pm_unpark_work_queue(gt);
> > +}
> > +
> >   static int __gt_park(struct intel_wakeref *wf)
> >   {
> >   	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
> >   static const struct intel_wakeref_ops wf_ops = {
> >   	.get = __gt_unpark,
> > +	.post_get = __gt_unpark_work_queue,
> >   	.put = __gt_park,
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > new file mode 100644
> > index 000000000000..23162dbd0c35
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > @@ -0,0 +1,35 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2021 Intel Corporation
> > + */
> > +
> > +#include "i915_drv.h"
> > +#include "intel_runtime_pm.h"
> > +#include "intel_gt_pm.h"
> > +
> > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
> > +{
> > +	struct intel_gt_pm_unpark_work *work, *next;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > +	list_for_each_entry_safe(work, next,
> > +				 &gt->pm_unpark_work_list, link) {
> > +		list_del_init(&work->link);
> > +		queue_work(system_unbound_wq, &work->worker);
> > +	}
> > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > +}
> > +
> > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > +				 struct intel_gt_pm_unpark_work *work)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > +	if (intel_gt_pm_is_awake(gt))
> > +		queue_work(system_unbound_wq, &work->worker);
> > +	else if (list_empty(&work->link))
> 
> What's the list_empty check for, something can race by design?
> 

This function is allowed to be called twice, e.g. Two contexts can be
tried to be deregistered while the GT PM is ideal but only first context
results in the worker being added to the list of workers to be kicked on
unpark.

> > +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
> > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > +}
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > new file mode 100644
> > index 000000000000..eaf1dc313aa2
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2021 Intel Corporation
> > + */
> > +
> > +#ifndef INTEL_GT_PM_UNPARK_WORK_H
> > +#define INTEL_GT_PM_UNPARK_WORK_H
> > +
> > +#include <linux/list.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct intel_gt;
> > +
> > +/**
> > + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
> > + */
> > +struct intel_gt_pm_unpark_work {
> > +	/**
> > +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
> > +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
> > +	 */
> > +	struct list_head link;
> > +	/** @worker: will be scheduled when GT unparked */
> > +	struct work_struct worker;
> > +};
> > +
> > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
> > +
> > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > +				 struct intel_gt_pm_unpark_work *work);
> > +
> > +static inline void
> > +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
> > +			     work_func_t fn)
> > +{
> > +	INIT_LIST_HEAD(&work->link);
> > +	INIT_WORK(&work->worker, fn);
> > +}
> > +
> > +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > index a81e21bf1bd1..4480312f0add 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > @@ -96,6 +96,16 @@ struct intel_gt {
> >   	struct intel_wakeref wakeref;
> >   	atomic_t user_wakeref;
> > +	/**
> > +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
> > +	 * unparked, protected by pm_unpark_work_lock
> > +	 */
> > +	struct list_head pm_unpark_work_list;
> > +	/**
> > +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
> > +	 */
> > +	spinlock_t pm_unpark_work_lock;
> > +
> >   	struct list_head closed_vma;
> >   	spinlock_t closed_lock; /* guards the list of closed_vma */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 7358883f1540..023953e77553 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -19,6 +19,7 @@
> >   #include "intel_uc_fw.h"
> >   #include "i915_utils.h"
> >   #include "i915_vma.h"
> > +#include "gt/intel_gt_pm_unpark_work.h"
> >   struct __guc_ads_blob;
> > @@ -78,11 +79,12 @@ struct intel_guc {
> >   		 */
> >   		struct list_head destroyed_contexts;
> >   		/**
> > -		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * @destroyed_worker: Worker to deregister contexts, need as we
> >   		 * need to take a GT PM reference and can't from destroy
> > -		 * function as it might be in an atomic context (no sleeping)
> > +		 * function as it might be in an atomic context (no sleeping).
> > +		 * Worker only issues deregister when GT is unparked.
> >   		 */
> > -		struct work_struct destroyed_worker;
> > +		struct intel_gt_pm_unpark_work destroyed_worker;
> >   	} submission_state;
> >   	bool submission_supported;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f835e06e5f9f..dbf919801de2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> >   	ida_init(&guc->submission_state.guc_ids);
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> > +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> > +				     destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
> >   static void destroyed_worker_func(struct work_struct *w)
> >   {
> > -	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +	struct intel_gt_pm_unpark_work *destroyed_worker =
> > +		container_of(w, struct intel_gt_pm_unpark_work, worker);
> > +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
> >   					     submission_state.destroyed_worker);
> >   	struct intel_gt *gt = guc_to_gt(guc);
> >   	int tmp;
> > -	with_intel_gt_pm(gt, tmp)
> > +	with_intel_gt_pm_if_awake(gt, tmp)
> >   		deregister_destroyed_contexts(guc);
> > +
> > +	if (!list_empty(&guc->submission_state.destroyed_contexts))
> > +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
> 
> This is the worker itself, right?
> 

Yes.

> There's a "if awake" here followed by another "if awake" inside
> intel_gt_pm_unpark_work_add which raises questions.
>

Even in the worker we only deregister the context if the GT is awake.
 
> Second question is what's the list_empty for - why is the state of the list
> itself relevant to a single worker deciding whether to re-add itself to it
> or not? And is there a lock protecting this list?
>

Yes we have locking - pm_unpark_work_lock. It is basically all 2
threads to call intel_gt_pm_unpark_work_add while the GT is idle - in
that case only the first call adds the worker to the list of workers to
kick when unparked (same explaination as above).
 
> On the overall it feels questionable to have unpark work which apparently
> can race with subsequent parking. Presumably you cannot have it run sync on
> unpark due execution context issues?
> 

No race, just allowed to be called twice.

> >   }
> >   static void guc_context_destroy(struct kref *kref)
> > @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
> >   	 * take the GT PM for the first time which isn't allowed from an atomic
> >   	 * context.
> >   	 */
> > -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> > +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
> > +				    &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> > index dfd87d082218..282fc4f312e3 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
> >   int __intel_wakeref_get_first(struct intel_wakeref *wf)
> >   {
> > +	bool do_post = false;
> > +
> >   	/*
> >   	 * Treat get/put as different subclasses, as we may need to run
> >   	 * the put callback from under the shrinker and do not want to
> > @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
> >   		}
> >   		smp_mb__before_atomic(); /* release wf->count */
> > +		do_post = true;
> >   	}
> >   	atomic_inc(&wf->count);
> > +	if (do_post && wf->ops->post_get)
> > +		wf->ops->post_get(wf);
> 
> You want this hook under the wf->mutex and why?
> 

I didn't really think about this but everything else in under the mutex
so I included this under the mutex too. In this case this post_get op
could likely be outside the mutex but I think it harmless to keep it
under the mutex. For future safety, I think it should stay under the
mutex.

Matt

> Regards,
> 
> Tvrtko
> 
> >   	mutex_unlock(&wf->mutex);
> >   	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
> > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > index 545c8f277c46..ef7e6a698e8a 100644
> > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
> >   struct intel_wakeref_ops {
> >   	int (*get)(struct intel_wakeref *wf);
> > +	void (*post_get)(struct intel_wakeref *wf);
> >   	int (*put)(struct intel_wakeref *wf);
> >   };
> > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-09-10 11:25   ` Tvrtko Ursulin
@ 2021-09-10 20:49     ` Matthew Brost
  2021-09-13 10:52       ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-10 20:49 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Fri, Sep 10, 2021 at 12:25:43PM +0100, Tvrtko Ursulin wrote:
> 
> On 20/08/2021 23:44, Matthew Brost wrote:
> > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > mid BB. To safely enable preemption at the BB boundary, a handshake
> > between to parent and child is needed. This is implemented via custom
> > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> > default if a context is configured by set parallel extension.
> 
> FWIW I think it's wrong to hardcode the requirements of a particular
> hardware generation fixed media pipeline into the uapi. IMO better solution
> was when concept of parallel submission was decoupled from the no preemption
> mid batch preambles. Otherwise might as well call the extension
> I915_CONTEXT_ENGINES_EXT_MEDIA_SPLIT_FRAME_SUBMIT or something.
> 

I don't disagree but this where we landed per Daniel Vetter's feedback -
default to what our current hardware supports and extend it later to
newer hardware / requirements as needed.

Matt

> Regards,
> 
> Tvrtko
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
> >   4 files changed, 287 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 5615be32879c..2de62649e275 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> >   	GEM_BUG_ON(intel_context_is_child(child));
> >   	GEM_BUG_ON(intel_context_is_parent(child));
> > -	parent->guc_number_children++;
> > +	child->guc_child_index = parent->guc_number_children++;
> >   	list_add_tail(&child->guc_child_link,
> >   		      &parent->guc_child_list);
> >   	child->parent = parent;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 713d85b0b364..727f91e7f7c2 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -246,6 +246,9 @@ struct intel_context {
> >   		/** @guc_number_children: number of children if parent */
> >   		u8 guc_number_children;
> > +		/** @guc_child_index: index into guc_child_list if child */
> > +		u8 guc_child_index;
> > +
> >   		/**
> >   		 * @parent_page: page in context used by parent for work queue,
> >   		 * work queue descriptor
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index 6cd26dc060d1..9f61cfa5566a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -188,7 +188,7 @@ struct guc_process_desc {
> >   	u32 wq_status;
> >   	u32 engine_presence;
> >   	u32 priority;
> > -	u32 reserved[30];
> > +	u32 reserved[36];
> >   } __packed;
> >   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 91330525330d..1a18f99bf12a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -11,6 +11,7 @@
> >   #include "gt/intel_context.h"
> >   #include "gt/intel_engine_pm.h"
> >   #include "gt/intel_engine_heartbeat.h"
> > +#include "gt/intel_gpu_commands.h"
> >   #include "gt/intel_gt.h"
> >   #include "gt/intel_gt_irq.h"
> >   #include "gt/intel_gt_pm.h"
> > @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> >   /*
> >    * When using multi-lrc submission an extra page in the context state is
> > - * reserved for the process descriptor and work queue.
> > + * reserved for the process descriptor, work queue, and preempt BB boundary
> > + * handshake between the parent + childlren contexts.
> >    *
> >    * The layout of this page is below:
> >    * 0						guc_process_desc
> > + * + sizeof(struct guc_process_desc)		child go
> > + * + CACHELINE_BYTES				child join ...
> > + * + CACHELINE_BYTES ...
> >    * ...						unused
> >    * PAGE_SIZE / 2				work queue start
> >    * ...						work queue
> > @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> >   	return __guc_action_deregister_context(guc, guc_id, loop);
> >   }
> > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +	u8 i;
> > +
> > +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> > +}
> > +
> > +static inline u32 get_children_go_value(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +
> > +	return mem[0];
> > +}
> > +
> > +static inline u32 get_children_join_value(struct intel_context *ce,
> > +					  u8 child_index)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +
> > +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> > +}
> > +
> >   static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   				    struct guc_lrc_desc *desc)
> >   {
> > @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   			guc_context_policy_init(engine, desc);
> >   		}
> > +
> > +		clear_children_join_go_memory(ce);
> >   	}
> >   	/*
> > @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > +/*
> > + * The below override of the breadcrumbs is enabled when the user configures a
> > + * context for parallel submission (multi-lrc, parent-child).
> > + *
> > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > + * safely preempt all the hw contexts configured for parallel submission
> > + * between each BB. The contract between the i915 and GuC is if the parent
> > + * context can be preempted, all the children can be preempted, and the GuC will
> > + * always try to preempt the parent before the children. A handshake between the
> > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > + * creating a window to preempt between each set of BBs.
> > + */
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags);
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags);
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs);
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs);
> > +
> >   static struct intel_context *
> >   guc_create_parallel(struct intel_engine_cs **engines,
> >   		    unsigned int num_siblings,
> > @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->engine->emit_bb_start =
> > +		emit_bb_start_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb =
> > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb_dw =
> > +		12 + 4 * parent->guc_number_children;
> > +	for_each_child(parent, ce) {
> > +		ce->engine->emit_bb_start =
> > +			emit_bb_start_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb =
> > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > +	}
> > +
> >   	kfree(siblings);
> >   	return parent;
> > @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return i915_ggtt_offset(ce->state) +
> > +		__get_process_desc_offset(ce) +
> > +		sizeof(struct guc_process_desc);
> > +}
> > +
> > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > +					 u8 child_index)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> > +}
> > +
> > +#define PARENT_GO_BB			1
> > +#define PARENT_GO_FINI_BREADCRUMB	0
> > +#define CHILD_GO_BB			1
> > +#define CHILD_GO_FINI_BREADCRUMB	0
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Wait on chidlren */
> > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_BB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_BB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +	*cs++ = MI_NOOP;
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	cs = intel_ring_begin(rq, 12);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_BB,
> > +				  get_children_join_addr(ce->parent,
> > +							 ce->guc_child_index),
> > +				  0);
> > +
> > +	/* Wait parent on for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_BB;
> > +	*cs++ = get_children_go_addr(ce->parent);
> > +	*cs++ = 0;
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_FINI_BREADCRUMB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_FINI_BREADCRUMB,
> > +				  get_children_join_addr(ce->parent,
> > +							 ce->guc_child_index),
> > +				  0);
> > +
> > +	/* Wait parent on for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > +	*cs++ = get_children_go_addr(ce->parent);
> > +	*cs++ = 0;
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> >   static struct intel_context *
> >   g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   {
> > @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   			drm_printf(p, "\t\tWQI Status: %u\n\n",
> >   				   READ_ONCE(desc->wq_status));
> > +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> > +				   ce->guc_number_children);
> > +			if (ce->engine->emit_bb_start ==
> > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > +				u8 i;
> > +
> > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > +					   get_children_go_value(ce));
> > +				for (i = 0; i < ce->guc_number_children; ++i)
> > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > +						   get_children_join_value(ce, i));
> > +			}
> > +
> >   			for_each_child(ce, child)
> >   				guc_log_context(p, child);
> >   		}
> > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-10 19:49     ` Matthew Brost
@ 2021-09-13  9:24       ` Tvrtko Ursulin
  2021-09-13 16:50         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-13  9:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 10/09/2021 20:49, Matthew Brost wrote:
> On Fri, Sep 10, 2021 at 12:12:42PM +0100, Tvrtko Ursulin wrote:
>>
>> On 20/08/2021 23:44, Matthew Brost wrote:
>>> Add logical engine mapping. This is required for split-frame, as
>>> workloads need to be placed on engines in a logically contiguous manner.
>>>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Add kernel doc for new fields
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>>>    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
>>>    .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>>>    5 files changed, 60 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> index 0d9105a31d84..4d790f9a65dd 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>>>    	GEM_DEBUG_WARN_ON(iir);
>>>    }
>>> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>>> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
>>> +			      u8 logical_instance)
>>>    {
>>>    	const struct engine_info *info = &intel_engines[id];
>>>    	struct drm_i915_private *i915 = gt->i915;
>>> @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>>>    	engine->class = info->class;
>>>    	engine->instance = info->instance;
>>> +	engine->logical_mask = BIT(logical_instance);
>>>    	__sprint_engine_name(engine);
>>>    	engine->props.heartbeat_interval_ms =
>>> @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>>>    	return info->engine_mask;
>>>    }
>>> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
>>> +				 u8 class, const u8 *map, u8 num_instances)
>>> +{
>>> +	int i, j;
>>> +	u8 current_logical_id = 0;
>>> +
>>> +	for (j = 0; j < num_instances; ++j) {
>>> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
>>> +			if (!HAS_ENGINE(gt, i) ||
>>> +			    intel_engines[i].class != class)
>>> +				continue;
>>> +
>>> +			if (intel_engines[i].instance == map[j]) {
>>> +				logical_ids[intel_engines[i].instance] =
>>> +					current_logical_id++;
>>> +				break;
>>> +			}
>>> +		}
>>> +	}
>>> +}
>>> +
>>> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
>>> +{
>>> +	int i;
>>> +	u8 map[MAX_ENGINE_INSTANCE + 1];
>>> +
>>> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
>>> +		map[i] = i;
>>
>> What's the point of the map array since it is 1:1 with instance?
>>
> 
> Future products do not have a 1 to 1 mapping and that mapping can change
> based on fusing, e.g. XeHP SDV.
> 
> Also technically ICL / TGL / ADL physical instance 2 maps to logical
> instance 1.

I don't follow the argument. All I can see is that "map[i] = i" always 
in the proposed code, which is then used to check "instance == 
map[instance]". So I'd suggest to remove this array from the code until 
there is a need for it.

>>> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
>>> +}
>>> +
>>>    /**
>>>     * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>>>     * @gt: pointer to struct intel_gt
>>> @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>>>    	struct drm_i915_private *i915 = gt->i915;
>>>    	const unsigned int engine_mask = init_engine_mask(gt);
>>>    	unsigned int mask = 0;
>>> -	unsigned int i;
>>> +	unsigned int i, class;
>>> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>>>    	int err;
>>>    	drm_WARN_ON(&i915->drm, engine_mask == 0);
>>> @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>>>    	if (i915_inject_probe_failure(i915))
>>>    		return -ENODEV;
>>> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
>>> -		if (!HAS_ENGINE(gt, i))
>>> -			continue;
>>> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
>>> +		setup_logical_ids(gt, logical_ids, class);
>>> -		err = intel_engine_setup(gt, i);
>>> -		if (err)
>>> -			goto cleanup;
>>> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
>>> +			u8 instance = intel_engines[i].instance;
>>> +
>>> +			if (intel_engines[i].class != class ||
>>> +			    !HAS_ENGINE(gt, i))
>>> +				continue;
>>> -		mask |= BIT(i);
>>> +			err = intel_engine_setup(gt, i,
>>> +						 logical_ids[instance]);
>>> +			if (err)
>>> +				goto cleanup;
>>> +
>>> +			mask |= BIT(i);
>>
>> I still this there is a less clunky way to set this up in less code and more
>> readable at the same time. Like do it in two passes so you can iterate
>> gt->engine_class[] array instead of having to implement a skip condition
>> (both on class and HAS_ENGINE at two places) and also avoid walking the flat
>> intel_engines array recursively.
>>
> 
> Kinda a bikeshed arguing about a pretty simple loop structure, don't you
> think? I personally like the way it laid out.
> 
> Pseudo code for your suggestion?

Leave the existing setup loop as is and add an additional "for engine 
class" walk after it. That way you can walk already setup 
gt->engine_class[] array so wouldn't need to skip wrong classes and have 
HAS_ENGINE checks when walking the flat intel_engines[] array several 
times. It also applies to the helper which counts logical instances per 
class.

>   
>>> +		}
>>>    	}
>>>    	/*
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> index ed91bcff20eb..fddf35546b58 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> @@ -266,6 +266,11 @@ struct intel_engine_cs {
>>>    	unsigned int guc_id;
>>>    	intel_engine_mask_t mask;
>>> +	/**
>>> +	 * @logical_mask: logical mask of engine, reported to user space via
>>> +	 * query IOCTL and used to communicate with the GuC in logical space
>>> +	 */
>>> +	intel_engine_mask_t logical_mask;
>>
>> You could prefix the new field with uabi_ to match the existing scheme and
>> to signify to anyone who might be touching it in the future it should not be
>> changed.
> 
> This is kinda uabi, but also kinda isn't. We do report a logical
> instance via IOCTL but it also really is tied the GuC backend as we only
> can communicate with the GuC in logical space. IMO we should leave as
> is.

Perhaps it would be best to call the new field uabi_logical_instance so 
it's clear it is reported in the query directly and do the BIT() 
transformation in the GuC backend?

> 
>>
>> Also, I think comment should explain what is logical space ie. how the
>> numbering works.
>>
> 
> Don't I already do that? I suppose I could add something like:

Where is it? Can't see it in the uapi kerneldoc AFAICS (for the new 
query) or here.

> 
> The logical mask within engine class must be contiguous across all
> instances.

Best not to start mentioning the mask for the first time. Just explain 
what logical numbering is in terms of how engines are enumerated in 
order of physical instances but skipping the fused off ones. In the 
kerneldoc for the new query is I think the right place.

>> Not sure the part about GuC needs to be in the comment since uapi is
>> supposed to be backend agnostic.
>>
> 
> Communicating with the GuC in logical space is a pretty key point here.
> The communication with the GuC in logical space is backend specific but
> how our hardware works (e.g. split frame workloads must be placed
> logical contiguous) is not. Mentioning the GuC requirement here makes
> sense to me for completeness.

Yeah might be, I was thinking more about the new query. Query definitely 
is backend agnostic but yes it is fine to say in the comment here the 
new field is used both for the query and for communicating with GuC.

Regards,

Tvrtko

> 
> Matt
> 
>> Regards,
>>
>> Tvrtko
>>
>>>    	u8 class;
>>>    	u8 instance;
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> index cafb0608ffb4..813a6de01382 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>>>    		ve->siblings[ve->num_siblings++] = sibling;
>>>    		ve->base.mask |= sibling->mask;
>>> +		ve->base.logical_mask |= sibling->logical_mask;
>>>    		/*
>>>    		 * All physical engines must be compatible for their emission
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>> index 6926919bcac6..9f5f43a16182 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>>>    	for_each_engine(engine, gt, id) {
>>>    		u8 guc_class = engine_class_to_guc_class(engine->class);
>>> -		system_info->mapping_table[guc_class][engine->instance] =
>>> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>>>    			engine->instance;
>>>    	}
>>>    }
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index e0eed70f9b92..ffafbac7335e 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>>>    	return __guc_action_deregister_context(guc, guc_id, loop);
>>>    }
>>> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
>>> -{
>>> -	switch (class) {
>>> -	case RENDER_CLASS:
>>> -		return mask >> RCS0;
>>> -	case VIDEO_ENHANCEMENT_CLASS:
>>> -		return mask >> VECS0;
>>> -	case VIDEO_DECODE_CLASS:
>>> -		return mask >> VCS0;
>>> -	case COPY_ENGINE_CLASS:
>>> -		return mask >> BCS0;
>>> -	default:
>>> -		MISSING_CASE(class);
>>> -		return 0;
>>> -	}
>>> -}
>>> -
>>>    static void guc_context_policy_init(struct intel_engine_cs *engine,
>>>    				    struct guc_lrc_desc *desc)
>>>    {
>>> @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	desc = __get_lrc_desc(guc, desc_idx);
>>>    	desc->engine_class = engine_class_to_guc_class(engine->class);
>>> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
>>> -						      engine->mask);
>>> +	desc->engine_submit_mask = engine->logical_mask;
>>>    	desc->hw_context_desc = ce->lrc.lrca;
>>>    	desc->priority = ce->guc_state.prio;
>>>    	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>> @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>>>    		}
>>>    		ve->base.mask |= sibling->mask;
>>> +		ve->base.logical_mask |= sibling->logical_mask;
>>>    		if (n != 0 && ve->base.class != sibling->class) {
>>>    			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
>>>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
  (?)
@ 2021-09-13  9:55   ` Tvrtko Ursulin
  2021-09-13 17:12     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-13  9:55 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu


On 20/08/2021 23:44, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a deregister context H2G is in flight.
> 
> FIXME: Move locking / structure changes into different patch
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
>   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
>   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
>   8 files changed, 199 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index adfe49b53b1b..c8595da64ad8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   	ce->guc_id.id = GUC_INVALID_LRC_ID;
>   	INIT_LIST_HEAD(&ce->guc_id.link);
>   
> +	INIT_LIST_HEAD(&ce->destroyed_link);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 80bbdc7810f6..fd338a30617e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -190,22 +190,29 @@ struct intel_context {
>   		/**
>   		 * @id: unique handle which is used to communicate information
>   		 * with the GuC about this context, protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		u16 id;
>   		/**
>   		 * @ref: the number of references to the guc_id, when
>   		 * transitioning in and out of zero protected by
> -		 * guc->contexts_lock
> +		 * guc->submission_state.lock
>   		 */
>   		atomic_t ref;
>   		/**
>   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> -		 * still valid, protected by guc->contexts_lock
> +		 * still valid, protected by guc->submission_state.lock
>   		 */
>   		struct list_head link;
>   	} guc_id;
>   
> +	/**
> +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> +	 * list when context is pending to be destroyed (deregistered with the
> +	 * GuC), protected by guc->submission_state.lock
> +	 */
> +	struct list_head destroyed_link;
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> index 70ea46d6cfb0..17a5028ea177 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>   	return intel_wakeref_is_active(&engine->wakeref);
>   }
>   
> +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> +{
> +	__intel_wakeref_get(&engine->wakeref);
> +}
> +
>   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
>   {
>   	intel_wakeref_get(&engine->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index d0588d8aaa44..a17bf0d4592b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>   	intel_wakeref_put_async(&gt->wakeref);
>   }
>   
> +#define with_intel_gt_pm(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +#define with_intel_gt_pm_async(gt, tmp) \
> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> +	     intel_gt_pm_put_async(gt), tmp = 0)
> +#define with_intel_gt_pm_if_awake(gt, tmp) \
> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> +	     intel_gt_pm_put(gt), tmp = 0)
> +#define with_intel_gt_pm_if_awake_async(gt, tmp) \
> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> +	     intel_gt_pm_put_async(gt), tmp = 0)
> +
>   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>   {
>   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ba10bd374cee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -142,6 +142,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>   	INTEL_GUC_ACTION_LIMIT
>   };
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 6fd2719d1b75..7358883f1540 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -53,21 +53,37 @@ struct intel_guc {
>   		void (*disable)(struct intel_guc *guc);
>   	} interrupts;
>   
> -	/**
> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> -	 * ce->guc_id.ref when transitioning in and out of zero
> -	 */
> -	spinlock_t contexts_lock;
> -	/** @guc_ids: used to allocate new guc_ids */
> -	struct ida guc_ids;
> -	/** @num_guc_ids: number of guc_ids that can be used */
> -	u32 num_guc_ids;
> -	/** @max_guc_ids: max number of guc_ids that can be used */
> -	u32 max_guc_ids;
> -	/**
> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> -	 */
> -	struct list_head guc_id_list;
> +	struct {
> +		/**
> +		 * @lock: protects everything in submission_state, ce->guc_id,
> +		 * and ce->destroyed_link
> +		 */
> +		spinlock_t lock;
> +		/**
> +		 * @guc_ids: used to allocate new guc_ids
> +		 */
> +		struct ida guc_ids;
> +		/** @num_guc_ids: number of guc_ids that can be used */
> +		u32 num_guc_ids;
> +		/** @max_guc_ids: max number of guc_ids that can be used */
> +		u32 max_guc_ids;
> +		/**
> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> +		 * refs
> +		 */
> +		struct list_head guc_id_list;
> +		/**
> +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> +		 * (deregistered with the GuC)
> +		 */
> +		struct list_head destroyed_contexts;
> +		/**
> +		 * @destroyed_worker: worker to deregister contexts, need as we
> +		 * need to take a GT PM reference and can't from destroy
> +		 * function as it might be in an atomic context (no sleeping)
> +		 */
> +		struct work_struct destroyed_worker;
> +	} submission_state;
>   
>   	bool submission_supported;
>   	bool submission_selected;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index b88d343ee432..27655460ee84 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
>   	if (!intel_guc_submission_is_used(guc))
>   		return -ENODEV;
>   
> -	*val = guc->num_guc_ids;
> +	*val = guc->submission_state.num_guc_ids;
>   
>   	return 0;
>   }
> @@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
>   static int guc_num_id_set(void *data, u64 val)
>   {
>   	struct intel_guc *guc = data;
> +	unsigned long flags;
>   
>   	if (!intel_guc_submission_is_used(guc))
>   		return -ENODEV;
>   
> -	if (val > guc->max_guc_ids)
> -		val = guc->max_guc_ids;
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +
> +	if (val > guc->submission_state.max_guc_ids)
> +		val = guc->submission_state.max_guc_ids;
>   	else if (val < 256)
>   		val = 256;
>   
> -	guc->num_guc_ids = val;
> +	guc->submission_state.num_guc_ids = val;
> +
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 68742b612692..f835e06e5f9f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -86,9 +86,9 @@
>    * submitting at a time. Currently only 1 sched_engine used for all of GuC
>    * submission but that could change in the future.
>    *
> - * guc->contexts_lock
> - * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> - * submission can hold this at a time.
> + * guc->submission_state.lock
> + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> + * list.
>    *
>    * ce->guc_state.lock
>    * Protects everything under ce->guc_state. Ensures that a context is in the
> @@ -100,7 +100,7 @@
>    *
>    * Lock ordering rules:
>    * sched_engine->lock -> ce->guc_state.lock
> - * guc->contexts_lock -> ce->guc_state.lock
> + * guc->submission_state.lock -> ce->guc_state.lock
>    *
>    * Reset races:
>    * When a GPU full reset is triggered it is assumed that some G2H responses to
> @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>   
> -	GEM_BUG_ON(index >= guc->max_guc_ids);
> +	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
>   
>   	return &base[index];
>   }
> @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
>   {
>   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
>   
> -	GEM_BUG_ON(id >= guc->max_guc_ids);
> +	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
>   
>   	return ce;
>   }
> @@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
>   	u32 size;
>   	int ret;
>   
> -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
> +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> +			  guc->submission_state.max_guc_ids);
>   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
>   					     (void **)&guc->lrc_desc_pool_vaddr);
>   	if (ret)
> @@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>   			if (deregister)
>   				guc_signal_context_fence(ce);
>   			if (destroyed) {
> +				intel_gt_pm_put_async(guc_to_gt(guc));
>   				release_guc_id(guc, ce);
>   				__guc_context_destroy(ce);
>   			}
> @@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> +
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
>   	if (unlikely(!guc_submission_initialized(guc))) {
> @@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>   
>   	flush_work(&guc->ct.requests.worker);
> +	guc_flush_destroyed_contexts(guc);
>   
>   	scrub_guc_desc_for_outstanding_g2h(guc);
>   }
> @@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +static void destroyed_worker_func(struct work_struct *w);
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
> @@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   
>   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>   
> -	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list);
> -	ida_init(&guc->guc_ids);
> +	spin_lock_init(&guc->submission_state.lock);
> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> +	ida_init(&guc->submission_state.guc_ids);
> +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> +	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
>   
>   	return 0;
>   }
> @@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   		return;
>   
>   	guc_lrc_desc_pool_destroy(guc);
> +	guc_flush_destroyed_contexts(guc);
>   	i915_sched_engine_put(guc->sched_engine);
>   }
>   
> @@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
>   
>   static int new_guc_id(struct intel_guc *guc)
>   {
> -	return ida_simple_get(&guc->guc_ids, 0,
> -			      guc->num_guc_ids, GFP_KERNEL |
> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> +			      guc->submission_state.num_guc_ids, GFP_KERNEL |
>   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>   }
>   
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> +		ida_simple_remove(&guc->submission_state.guc_ids,
> +				  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	__release_guc_id(guc, ce);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int steal_guc_id(struct intel_guc *guc)
> @@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
>   	struct intel_context *ce;
>   	int guc_id;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
> -	if (!list_empty(&guc->guc_id_list)) {
> -		ce = list_first_entry(&guc->guc_id_list,
> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> @@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>   {
>   	int ret;
>   
> -	lockdep_assert_held(&guc->contexts_lock);
> +	lockdep_assert_held(&guc->submission_state.lock);
>   
>   	ret = new_guc_id(guc);
>   	if (unlikely(ret < 0)) {
> @@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>   
>   try_again:
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   
>   	might_lock(&ce->guc_state.lock);
>   
> @@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	atomic_inc(&ce->guc_id.ref);
>   
>   out_unlock:
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
>   	/*
>   	 * -EAGAIN indicates no guc_id are available, let's retire any
> @@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	if (unlikely(context_guc_id_invalid(ce)))
>   		return;
>   
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>   	    !atomic_read(&ce->guc_id.ref))
> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> +		list_add_tail(&ce->guc_id.link,
> +			      &guc->submission_state.guc_id_list);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
>   static int __guc_action_register_context(struct intel_guc *guc,
> @@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   static void guc_lrc_desc_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	unsigned long flags;
> +	bool disabled;
>   
> +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>   	GEM_BUG_ON(context_enabled(ce));
>   
> +	/* Seal race with Reset */
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	disabled = submission_disabled(guc);
> +	if (likely(!disabled)) {
> +		__intel_gt_pm_get(gt);
> +		set_context_destroyed(ce);
> +		clr_context_registered(ce);
> +	}
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	if (unlikely(disabled)) {
> +		release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +		return;
> +	}
> +
>   	deregister_context(ce, ce->guc_id.id, true);
>   }
>   
> @@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
>   	}
>   }
>   
> +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(!submission_disabled(guc) &&
> +		   guc_submission_initialized(guc));
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		__release_guc_id(guc, ce);
> +		__guc_context_destroy(ce);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void deregister_destroyed_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	list_for_each_entry_safe(ce, cn,
> +				 &guc->submission_state.destroyed_contexts,
> +				 destroyed_link) {
> +		list_del_init(&ce->destroyed_link);
> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);

Lock drop and re-acquire is not safe - list_for_each_entry_safe only 
deal with the list_del_init above. It does not help if someone else 
modifies the list in the window while the lock is not held, in which 
case things will go BOOM!

Not sure in what differnt ways you use this list, but perhaps you could 
employ a lockless single linked list for this. In which case you could 
atomically unlink all elements. Or just change this to not do lock 
dropping, if you can safely use the same list_head to put on a local 
list or something.

Regards,

Tvrtko

> +		guc_lrc_desc_unpin(ce);
> +		spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	}
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +}
> +
> +static void destroyed_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc = container_of(w, struct intel_guc,
> +					     submission_state.destroyed_worker);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int tmp;
> +
> +	with_intel_gt_pm(gt, tmp)
> +		deregister_destroyed_contexts(guc);
> +}
> +
>   static void guc_context_destroy(struct kref *kref)
>   {
>   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>   	struct intel_guc *guc = ce_to_guc(ce);
> -	intel_wakeref_t wakeref;
>   	unsigned long flags;
> -	bool disabled;
> +	bool destroy;
>   
>   	/*
>   	 * If the guc_id is invalid this context has been stolen and we can free
>   	 * it immediately. Also can be freed immediately if the context is not
>   	 * registered with the GuC or the GuC is in the middle of a reset.
>   	 */
> -	if (context_guc_id_invalid(ce)) {
> -		__guc_context_destroy(ce);
> -		return;
> -	} else if (submission_disabled(guc) ||
> -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> -		release_guc_id(guc, ce);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	/*
> -	 * We have to acquire the context spinlock and check guc_id again, if it
> -	 * is valid it hasn't been stolen and needs to be deregistered. We
> -	 * delete this context from the list of unpinned guc_id available to
> -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> -	 * returns indicating this context has been deregistered the guc_id is
> -	 * returned to the pool of available guc_id.
> -	 */
> -	spin_lock_irqsave(&guc->contexts_lock, flags);
> -	if (context_guc_id_invalid(ce)) {
> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> -		__guc_context_destroy(ce);
> -		return;
> -	}
> -
> -	if (!list_empty(&ce->guc_id.link))
> -		list_del_init(&ce->guc_id.link);
> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> -
> -	/* Seal race with Reset */
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	disabled = submission_disabled(guc);
> -	if (likely(!disabled)) {
> -		set_context_destroyed(ce);
> -		clr_context_registered(ce);
> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +	if (likely(!destroy)) {
> +		if (!list_empty(&ce->guc_id.link))
> +			list_del_init(&ce->guc_id.link);
> +		list_add_tail(&ce->destroyed_link,
> +			      &guc->submission_state.destroyed_contexts);
> +	} else {
> +		__release_guc_id(guc, ce);
>   	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -	if (unlikely(disabled)) {
> -		release_guc_id(guc, ce);
> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +	if (unlikely(destroy)) {
>   		__guc_context_destroy(ce);
>   		return;
>   	}
>   
>   	/*
> -	 * We defer GuC context deregistration until the context is destroyed
> -	 * in order to save on CTBs. With this optimization ideally we only need
> -	 * 1 CTB to register the context during the first pin and 1 CTB to
> -	 * deregister the context when the context is destroyed. Without this
> -	 * optimization, a CTB would be needed every pin & unpin.
> -	 *
> -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> -	 * from context_free_worker when runtime wakeref is not held.
> -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> -	 * in H2G CTB to deregister the context. A future patch may defer this
> -	 * H2G CTB if the runtime wakeref is zero.
> +	 * We use a worker to issue the H2G to deregister the context as we can
> +	 * take the GT PM for the first time which isn't allowed from an atomic
> +	 * context.
>   	 */
> -	with_intel_runtime_pm(runtime_pm, wakeref)
> -		guc_lrc_desc_unpin(ce);
> +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>   }
>   
>   static int guc_context_alloc(struct intel_context *ce)
> @@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>   
>   void intel_guc_submission_init_early(struct intel_guc *guc)
>   {
> -	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> -	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> +	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>   	guc->submission_supported = __guc_submission_supported(guc);
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
> @@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   {
>   	struct intel_context *ce;
>   
> -	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> +	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
>   		drm_err(&guc_to_gt(guc)->i915->drm,
>   			"Invalid desc_idx %u, max %u",
> -			desc_idx, guc->max_guc_ids);
> +			desc_idx, guc->submission_state.max_guc_ids);
>   		return NULL;
>   	}
>   
> @@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>   		intel_context_put(ce);
>   	} else if (context_destroyed(ce)) {
>   		/* Context has been destroyed */
> +		intel_gt_pm_put_async(guc_to_gt(guc));
>   		release_guc_id(guc, ce);
>   		__guc_context_destroy(ce);
>   	}
> @@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>   
>   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>   		   atomic_read(&guc->outstanding_submission_g2h));
> -	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> -	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC Number GuC IDs: %u\n",
> +		   guc->submission_state.num_guc_ids);
> +	drm_printf(p, "GuC Max GuC IDs: %u\n",
> +		   guc->submission_state.max_guc_ids);
>   	drm_printf(p, "GuC tasklet count: %u\n\n",
>   		   atomic_read(&sched_engine->tasklet.count));
>   
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-09-10 20:09     ` Matthew Brost
@ 2021-09-13 10:33       ` Tvrtko Ursulin
  2021-09-13 17:20         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-13 10:33 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 10/09/2021 21:09, Matthew Brost wrote:
> On Fri, Sep 10, 2021 at 09:36:17AM +0100, Tvrtko Ursulin wrote:
>>
>> On 20/08/2021 23:44, Matthew Brost wrote:
>>> Sometimes it is desirable to queue work up for later if the GT PM isn't
>>> held and run that work on next GT PM unpark.
>>
>> Sounds maybe plausible, but it depends how much work can happen on unpark
>> and whether it can have too much of a negative impact on latency for
>> interactive loads? Or from a reverse angle, why the work wouldn't be done on
> 
> All it is does is add an interface to kick a work queue on unpark. i.e.
> All the actually work is done async in the work queue so it shouldn't
> add any latency.
> 
>> parking?
>>
>> Also what kind of mechanism for dealing with too much stuff being put on
>> this list you have? Can there be pressure which triggers (or would need to
> 
> No limits on pressure. See above, I don't think this is a concern.

On unpark it has the potential to send an unbound amount of actions for 
the GuC to process. Which will be competing, in GuC internal processing 
power, with the user action which caused the unpark. That logically does 
feel like can have effect on initial latency. Why you think it cannot?

Why the work wouldn't be done on parking?

With this scheme couldn't we end up with a situation that the worker 
keeps missing the GT unparked state and so keeps piling items on the 
deregistration list? Can you run of some ids like that (which is related 
to my question of how is pressure handled here).

Unpark
Register context
Submit work
Retire
Schedule context deregister
Park

Worker runs
GT parked
Work put on a list

Unpark
Schedule deregistration worker
Register new context
Submit work
Retire
Schedule contect deregister
Park

Worker runs (lets say there was CPU pressure)
GT already parked
  -> deregistration queue now has two contexts on it

... repeat until disaster ...

Unless I have misunderstood the logic.

>> trigger) these deregistrations to happen at runtime (no park/unpark
>> transitions)?
>>
>>> Implemented with a list in the GT of all pending work, workqueues in
>>> the list, a callback to add a workqueue to the list, and finally a
>>> wakeref post_get callback that iterates / drains the list + queues the
>>> workqueues.
>>>
>>> First user of this is deregistration of GuC contexts.
>>
>> Does first imply there are more incoming?
>>
> 
> Haven't found another user yet but this is generic mechanism so we can
> add more in the future if other use cases arrise.

My feeling is it would be best to leave it for later.

>   
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/Makefile                 |  1 +
>>>    drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
>>>    .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
>>>    .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
>>>    drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
>>>    drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
>>>    drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
>>>    10 files changed, 119 insertions(+), 7 deletions(-)
>>>    create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
>>>    create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
>>>
>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
>>> index 642a5b5a1b81..579bdc069f25 100644
>>> --- a/drivers/gpu/drm/i915/Makefile
>>> +++ b/drivers/gpu/drm/i915/Makefile
>>> @@ -103,6 +103,7 @@ gt-y += \
>>>    	gt/intel_gt_clock_utils.o \
>>>    	gt/intel_gt_irq.o \
>>>    	gt/intel_gt_pm.o \
>>> +	gt/intel_gt_pm_unpark_work.o \
>>>    	gt/intel_gt_pm_irq.o \
>>>    	gt/intel_gt_requests.o \
>>>    	gt/intel_gtt.o \
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
>>> index 62d40c986642..7e690e74baa2 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
>>> @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
>>>    	spin_lock_init(&gt->irq_lock);
>>> +	spin_lock_init(&gt->pm_unpark_work_lock);
>>> +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
>>> +
>>>    	INIT_LIST_HEAD(&gt->closed_vma);
>>>    	spin_lock_init(&gt->closed_lock);
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> index dea8e2479897..564c11a3748b 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>>    	return 0;
>>>    }
>>> +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
>>> +{
>>> +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
>>> +
>>> +	intel_gt_pm_unpark_work_queue(gt);
>>> +}
>>> +
>>>    static int __gt_park(struct intel_wakeref *wf)
>>>    {
>>>    	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
>>> @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>>    static const struct intel_wakeref_ops wf_ops = {
>>>    	.get = __gt_unpark,
>>> +	.post_get = __gt_unpark_work_queue,
>>>    	.put = __gt_park,
>>>    };
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
>>> new file mode 100644
>>> index 000000000000..23162dbd0c35
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
>>> @@ -0,0 +1,35 @@
>>> +// SPDX-License-Identifier: MIT
>>> +/*
>>> + * Copyright © 2021 Intel Corporation
>>> + */
>>> +
>>> +#include "i915_drv.h"
>>> +#include "intel_runtime_pm.h"
>>> +#include "intel_gt_pm.h"
>>> +
>>> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
>>> +{
>>> +	struct intel_gt_pm_unpark_work *work, *next;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
>>> +	list_for_each_entry_safe(work, next,
>>> +				 &gt->pm_unpark_work_list, link) {
>>> +		list_del_init(&work->link);
>>> +		queue_work(system_unbound_wq, &work->worker);
>>> +	}
>>> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
>>> +}
>>> +
>>> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
>>> +				 struct intel_gt_pm_unpark_work *work)
>>> +{
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
>>> +	if (intel_gt_pm_is_awake(gt))
>>> +		queue_work(system_unbound_wq, &work->worker);
>>> +	else if (list_empty(&work->link))
>>
>> What's the list_empty check for, something can race by design?
>>
> 
> This function is allowed to be called twice, e.g. Two contexts can be
> tried to be deregistered while the GT PM is ideal but only first context
> results in the worker being added to the list of workers to be kicked on
> unpark.

Hmm okay.

> 
>>> +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
>>> +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
>>> +}
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
>>> new file mode 100644
>>> index 000000000000..eaf1dc313aa2
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
>>> @@ -0,0 +1,40 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2021 Intel Corporation
>>> + */
>>> +
>>> +#ifndef INTEL_GT_PM_UNPARK_WORK_H
>>> +#define INTEL_GT_PM_UNPARK_WORK_H
>>> +
>>> +#include <linux/list.h>
>>> +#include <linux/workqueue.h>
>>> +
>>> +struct intel_gt;
>>> +
>>> +/**
>>> + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
>>> + */
>>> +struct intel_gt_pm_unpark_work {
>>> +	/**
>>> +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
>>> +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
>>> +	 */
>>> +	struct list_head link;
>>> +	/** @worker: will be scheduled when GT unparked */
>>> +	struct work_struct worker;
>>> +};
>>> +
>>> +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
>>> +
>>> +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
>>> +				 struct intel_gt_pm_unpark_work *work);
>>> +
>>> +static inline void
>>> +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
>>> +			     work_func_t fn)
>>> +{
>>> +	INIT_LIST_HEAD(&work->link);
>>> +	INIT_WORK(&work->worker, fn);
>>> +}
>>> +
>>> +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
>>> index a81e21bf1bd1..4480312f0add 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
>>> @@ -96,6 +96,16 @@ struct intel_gt {
>>>    	struct intel_wakeref wakeref;
>>>    	atomic_t user_wakeref;
>>> +	/**
>>> +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
>>> +	 * unparked, protected by pm_unpark_work_lock
>>> +	 */
>>> +	struct list_head pm_unpark_work_list;
>>> +	/**
>>> +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
>>> +	 */
>>> +	spinlock_t pm_unpark_work_lock;
>>> +
>>>    	struct list_head closed_vma;
>>>    	spinlock_t closed_lock; /* guards the list of closed_vma */
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 7358883f1540..023953e77553 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -19,6 +19,7 @@
>>>    #include "intel_uc_fw.h"
>>>    #include "i915_utils.h"
>>>    #include "i915_vma.h"
>>> +#include "gt/intel_gt_pm_unpark_work.h"
>>>    struct __guc_ads_blob;
>>> @@ -78,11 +79,12 @@ struct intel_guc {
>>>    		 */
>>>    		struct list_head destroyed_contexts;
>>>    		/**
>>> -		 * @destroyed_worker: worker to deregister contexts, need as we
>>> +		 * @destroyed_worker: Worker to deregister contexts, need as we
>>>    		 * need to take a GT PM reference and can't from destroy
>>> -		 * function as it might be in an atomic context (no sleeping)
>>> +		 * function as it might be in an atomic context (no sleeping).
>>> +		 * Worker only issues deregister when GT is unparked.
>>>    		 */
>>> -		struct work_struct destroyed_worker;
>>> +		struct intel_gt_pm_unpark_work destroyed_worker;
>>>    	} submission_state;
>>>    	bool submission_supported;
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index f835e06e5f9f..dbf919801de2 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>>>    	ida_init(&guc->submission_state.guc_ids);
>>>    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>>> -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
>>> +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
>>> +				     destroyed_worker_func);
>>>    	return 0;
>>>    }
>>> @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>>>    static void destroyed_worker_func(struct work_struct *w)
>>>    {
>>> -	struct intel_guc *guc = container_of(w, struct intel_guc,
>>> +	struct intel_gt_pm_unpark_work *destroyed_worker =
>>> +		container_of(w, struct intel_gt_pm_unpark_work, worker);
>>> +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
>>>    					     submission_state.destroyed_worker);
>>>    	struct intel_gt *gt = guc_to_gt(guc);
>>>    	int tmp;
>>> -	with_intel_gt_pm(gt, tmp)
>>> +	with_intel_gt_pm_if_awake(gt, tmp)
>>>    		deregister_destroyed_contexts(guc);
>>> +
>>> +	if (!list_empty(&guc->submission_state.destroyed_contexts))
>>> +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
>>
>> This is the worker itself, right?
>>
> 
> Yes.
> 
>> There's a "if awake" here followed by another "if awake" inside
>> intel_gt_pm_unpark_work_add which raises questions.
>>
> 
> Even in the worker we only deregister the context if the GT is awake.

Yeah but the two "if awake" checks followed by one another is the 
confusing part since it is not atomic. Coupled with the list empty check 
I mean. So two external async condition can influence the flow, one of 
which is even checked twice.

The worker can decide not to deregister, but by the time 
intel_gt_pm_unpark_work_add runs GT may have became unparked so worker 
gets queued. When it runs GT is parked again and so it can bounce in a 
loop forever.

>   
>> Second question is what's the list_empty for - why is the state of the list
>> itself relevant to a single worker deciding whether to re-add itself to it
>> or not? And is there a lock protecting this list?
>>
> 
> Yes we have locking - pm_unpark_work_lock. It is basically all 2
> threads to call intel_gt_pm_unpark_work_add while the GT is idle - in
> that case only the first call adds the worker to the list of workers to
> kick when unparked (same explaination as above).
>   
>> On the overall it feels questionable to have unpark work which apparently
>> can race with subsequent parking. Presumably you cannot have it run sync on
>> unpark due execution context issues?
>>
> 
> No race, just allowed to be called twice.

But list_empty check is not under any locks so I don't follow.

> 
>>>    }
>>>    static void guc_context_destroy(struct kref *kref)
>>> @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
>>>    	 * take the GT PM for the first time which isn't allowed from an atomic
>>>    	 * context.
>>>    	 */
>>> -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>>> +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
>>> +				    &guc->submission_state.destroyed_worker);
>>>    }
>>>    static int guc_context_alloc(struct intel_context *ce)
>>> diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
>>> index dfd87d082218..282fc4f312e3 100644
>>> --- a/drivers/gpu/drm/i915/intel_wakeref.c
>>> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
>>> @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
>>>    int __intel_wakeref_get_first(struct intel_wakeref *wf)
>>>    {
>>> +	bool do_post = false;
>>> +
>>>    	/*
>>>    	 * Treat get/put as different subclasses, as we may need to run
>>>    	 * the put callback from under the shrinker and do not want to
>>> @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
>>>    		}
>>>    		smp_mb__before_atomic(); /* release wf->count */
>>> +		do_post = true;
>>>    	}
>>>    	atomic_inc(&wf->count);
>>> +	if (do_post && wf->ops->post_get)
>>> +		wf->ops->post_get(wf);
>>
>> You want this hook under the wf->mutex and why?
>>
> 
> I didn't really think about this but everything else in under the mutex
> so I included this under the mutex too. In this case this post_get op
> could likely be outside the mutex but I think it harmless to keep it
> under the mutex. For future safety, I think it should stay under the
> mutex.

If it is under the mutex then what is the point of two hooks? The only 
difference is get sees wf->count == 0, while post_get sees it as one.

No wait.. you have put post_get outside the 0->1 transition check so 
potentially called more than once. You probably do not want this.. But 
could you just do this from the existing hook is the main question.

Regards,

Tvrtko

> 
> Matt
> 
>> Regards,
>>
>> Tvrtko
>>
>>>    	mutex_unlock(&wf->mutex);
>>>    	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
>>> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
>>> index 545c8f277c46..ef7e6a698e8a 100644
>>> --- a/drivers/gpu/drm/i915/intel_wakeref.h
>>> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
>>> @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
>>>    struct intel_wakeref_ops {
>>>    	int (*get)(struct intel_wakeref *wf);
>>> +	void (*post_get)(struct intel_wakeref *wf);
>>>    	int (*put)(struct intel_wakeref *wf);
>>>    };
>>>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-09-10 20:49     ` Matthew Brost
@ 2021-09-13 10:52       ` Tvrtko Ursulin
  0 siblings, 0 replies; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-13 10:52 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 10/09/2021 21:49, Matthew Brost wrote:
> On Fri, Sep 10, 2021 at 12:25:43PM +0100, Tvrtko Ursulin wrote:
>>
>> On 20/08/2021 23:44, Matthew Brost wrote:
>>> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
>>> mid BB. To safely enable preemption at the BB boundary, a handshake
>>> between to parent and child is needed. This is implemented via custom
>>> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
>>> default if a context is configured by set parallel extension.
>>
>> FWIW I think it's wrong to hardcode the requirements of a particular
>> hardware generation fixed media pipeline into the uapi. IMO better solution
>> was when concept of parallel submission was decoupled from the no preemption
>> mid batch preambles. Otherwise might as well call the extension
>> I915_CONTEXT_ENGINES_EXT_MEDIA_SPLIT_FRAME_SUBMIT or something.
>>
> 
> I don't disagree but this where we landed per Daniel Vetter's feedback -
> default to what our current hardware supports and extend it later to
> newer hardware / requirements as needed.

I think this only re-affirms my argument - no point giving the extension 
a generic name if it is so tightly coupled to a specific use case. But I 
wrote FWIW so whatever.

It will be mighty awkward if compute multi-lrc will need to specify a 
flag to allow preemption, when turning off preemption is otherwise not 
something we allow unprivileged clients to do. So it will be kind of 
opt-out from unfriendly/dangerous default behaviour instead of explicit 
requesting it.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-13  9:24       ` Tvrtko Ursulin
@ 2021-09-13 16:50         ` Matthew Brost
  2021-09-14  8:34           ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-13 16:50 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 10:24:43AM +0100, Tvrtko Ursulin wrote:
> 
> On 10/09/2021 20:49, Matthew Brost wrote:
> > On Fri, Sep 10, 2021 at 12:12:42PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 20/08/2021 23:44, Matthew Brost wrote:
> > > > Add logical engine mapping. This is required for split-frame, as
> > > > workloads need to be placed on engines in a logically contiguous manner.
> > > > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Add kernel doc for new fields
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
> > > >    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
> > > >    .../drm/i915/gt/intel_execlists_submission.c  |  1 +
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
> > > >    5 files changed, 60 insertions(+), 29 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > index 0d9105a31d84..4d790f9a65dd 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
> > > >    	GEM_DEBUG_WARN_ON(iir);
> > > >    }
> > > > -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > > > +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> > > > +			      u8 logical_instance)
> > > >    {
> > > >    	const struct engine_info *info = &intel_engines[id];
> > > >    	struct drm_i915_private *i915 = gt->i915;
> > > > @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > > >    	engine->class = info->class;
> > > >    	engine->instance = info->instance;
> > > > +	engine->logical_mask = BIT(logical_instance);
> > > >    	__sprint_engine_name(engine);
> > > >    	engine->props.heartbeat_interval_ms =
> > > > @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
> > > >    	return info->engine_mask;
> > > >    }
> > > > +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> > > > +				 u8 class, const u8 *map, u8 num_instances)
> > > > +{
> > > > +	int i, j;
> > > > +	u8 current_logical_id = 0;
> > > > +
> > > > +	for (j = 0; j < num_instances; ++j) {
> > > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > > +			if (!HAS_ENGINE(gt, i) ||
> > > > +			    intel_engines[i].class != class)
> > > > +				continue;
> > > > +
> > > > +			if (intel_engines[i].instance == map[j]) {
> > > > +				logical_ids[intel_engines[i].instance] =
> > > > +					current_logical_id++;
> > > > +				break;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> > > > +{
> > > > +	int i;
> > > > +	u8 map[MAX_ENGINE_INSTANCE + 1];
> > > > +
> > > > +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> > > > +		map[i] = i;
> > > 
> > > What's the point of the map array since it is 1:1 with instance?
> > > 
> > 
> > Future products do not have a 1 to 1 mapping and that mapping can change
> > based on fusing, e.g. XeHP SDV.
> > 
> > Also technically ICL / TGL / ADL physical instance 2 maps to logical
> > instance 1.
> 
> I don't follow the argument. All I can see is that "map[i] = i" always in
> the proposed code, which is then used to check "instance == map[instance]".
> So I'd suggest to remove this array from the code until there is a need for
> it.
> 

Ok, this logic is slightly confusing and makes more sense once we have
non-standard mappings. Yes, map is setup in a 1 to 1 mapping by default
with the value in map[i] being a physical instance. Populate_logical_ids
searches the map finding all physical instances present in the map
assigning each found instance a new logical id increasing by 1 each
time.

e.g. If the map is setup 0-N and only physical instance 0 / 2 are
present they will get logical mapping 0 / 1 respectfully.

This algorithm works for non-standard mappings too /w fused parts. e.g.
on XeHP SDV the map is: { 0, 2, 4, 6, 1, 3, 5, 7 } and if any of the
physical instances can't be found due to fusing the logical mapping is
still correct per the bspec.

This array is absolutely needed for multi-lrc submission to work, even
on ICL / TGL / ADL as the GuC only supports logically contiguous engine
instances.

> > > > +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
> > > >     * @gt: pointer to struct intel_gt
> > > > @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > > >    	struct drm_i915_private *i915 = gt->i915;
> > > >    	const unsigned int engine_mask = init_engine_mask(gt);
> > > >    	unsigned int mask = 0;
> > > > -	unsigned int i;
> > > > +	unsigned int i, class;
> > > > +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
> > > >    	int err;
> > > >    	drm_WARN_ON(&i915->drm, engine_mask == 0);
> > > > @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > > >    	if (i915_inject_probe_failure(i915))
> > > >    		return -ENODEV;
> > > > -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> > > > -		if (!HAS_ENGINE(gt, i))
> > > > -			continue;
> > > > +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> > > > +		setup_logical_ids(gt, logical_ids, class);
> > > > -		err = intel_engine_setup(gt, i);
> > > > -		if (err)
> > > > -			goto cleanup;
> > > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > > +			u8 instance = intel_engines[i].instance;
> > > > +
> > > > +			if (intel_engines[i].class != class ||
> > > > +			    !HAS_ENGINE(gt, i))
> > > > +				continue;
> > > > -		mask |= BIT(i);
> > > > +			err = intel_engine_setup(gt, i,
> > > > +						 logical_ids[instance]);
> > > > +			if (err)
> > > > +				goto cleanup;
> > > > +
> > > > +			mask |= BIT(i);
> > > 
> > > I still this there is a less clunky way to set this up in less code and more
> > > readable at the same time. Like do it in two passes so you can iterate
> > > gt->engine_class[] array instead of having to implement a skip condition
> > > (both on class and HAS_ENGINE at two places) and also avoid walking the flat
> > > intel_engines array recursively.
> > > 
> > 
> > Kinda a bikeshed arguing about a pretty simple loop structure, don't you
> > think? I personally like the way it laid out.
> > 
> > Pseudo code for your suggestion?
> 
> Leave the existing setup loop as is and add an additional "for engine class"
> walk after it. That way you can walk already setup gt->engine_class[] array
> so wouldn't need to skip wrong classes and have HAS_ENGINE checks when
> walking the flat intel_engines[] array several times. It also applies to the
> helper which counts logical instances per class.
>

Ok, I think I see what you are getting at. Again IMO this is a total
bikeshed as this is 1 time setup step that we really should only care if
the loop works or not rather than it being optimized / looks a way a
certain person wants. I can change this if you really insist but again
IMO disucssing this is a total waste of energy.
 
> > > > +		}
> > > >    	}
> > > >    	/*
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > index ed91bcff20eb..fddf35546b58 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > @@ -266,6 +266,11 @@ struct intel_engine_cs {
> > > >    	unsigned int guc_id;
> > > >    	intel_engine_mask_t mask;
> > > > +	/**
> > > > +	 * @logical_mask: logical mask of engine, reported to user space via
> > > > +	 * query IOCTL and used to communicate with the GuC in logical space
> > > > +	 */
> > > > +	intel_engine_mask_t logical_mask;
> > > 
> > > You could prefix the new field with uabi_ to match the existing scheme and
> > > to signify to anyone who might be touching it in the future it should not be
> > > changed.
> > 
> > This is kinda uabi, but also kinda isn't. We do report a logical
> > instance via IOCTL but it also really is tied the GuC backend as we only
> > can communicate with the GuC in logical space. IMO we should leave as
> > is.
> 
> Perhaps it would be best to call the new field uabi_logical_instance so it's
> clear it is reported in the query directly and do the BIT() transformation
> in the GuC backend?
>

Virtual engines can have a multiple bits set in this mask, so this is
used for both the query on physical engines via UABI and submission in
the GuC backend.

> > 
> > > 
> > > Also, I think comment should explain what is logical space ie. how the
> > > numbering works.
> > > 
> > 
> > Don't I already do that? I suppose I could add something like:
> 
> Where is it? Can't see it in the uapi kerneldoc AFAICS (for the new query)
> or here.
>

	/**
	 * @logical_mask: logical mask of engine, reported to user space via
	 * query IOCTL and used to communicate with the GuC in logical space
	 */
 
> > 
> > The logical mask within engine class must be contiguous across all
> > instances.
> 
> Best not to start mentioning the mask for the first time. Just explain what
> logical numbering is in terms of how engines are enumerated in order of
> physical instances but skipping the fused off ones. In the kerneldoc for the
> new query is I think the right place.
>

Maybe I can add:

The logical mapping is defined on per part basis in the bspec and can
very based the parts fusing.
 
> > > Not sure the part about GuC needs to be in the comment since uapi is
> > > supposed to be backend agnostic.
> > > 
> > 
> > Communicating with the GuC in logical space is a pretty key point here.
> > The communication with the GuC in logical space is backend specific but
> > how our hardware works (e.g. split frame workloads must be placed
> > logical contiguous) is not. Mentioning the GuC requirement here makes
> > sense to me for completeness.
> 
> Yeah might be, I was thinking more about the new query. Query definitely is
> backend agnostic but yes it is fine to say in the comment here the new field
> is used both for the query and for communicating with GuC.
>

Sounds good, will make it clear it used for the query and from
communicating with the GuC.

Matt
 
> Regards,
> 
> Tvrtko
> 
> > 
> > Matt
> > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > >    	u8 class;
> > > >    	u8 instance;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > index cafb0608ffb4..813a6de01382 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > > >    		ve->siblings[ve->num_siblings++] = sibling;
> > > >    		ve->base.mask |= sibling->mask;
> > > > +		ve->base.logical_mask |= sibling->logical_mask;
> > > >    		/*
> > > >    		 * All physical engines must be compatible for their emission
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > index 6926919bcac6..9f5f43a16182 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
> > > >    	for_each_engine(engine, gt, id) {
> > > >    		u8 guc_class = engine_class_to_guc_class(engine->class);
> > > > -		system_info->mapping_table[guc_class][engine->instance] =
> > > > +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
> > > >    			engine->instance;
> > > >    	}
> > > >    }
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index e0eed70f9b92..ffafbac7335e 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > > >    	return __guc_action_deregister_context(guc, guc_id, loop);
> > > >    }
> > > > -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> > > > -{
> > > > -	switch (class) {
> > > > -	case RENDER_CLASS:
> > > > -		return mask >> RCS0;
> > > > -	case VIDEO_ENHANCEMENT_CLASS:
> > > > -		return mask >> VECS0;
> > > > -	case VIDEO_DECODE_CLASS:
> > > > -		return mask >> VCS0;
> > > > -	case COPY_ENGINE_CLASS:
> > > > -		return mask >> BCS0;
> > > > -	default:
> > > > -		MISSING_CASE(class);
> > > > -		return 0;
> > > > -	}
> > > > -}
> > > > -
> > > >    static void guc_context_policy_init(struct intel_engine_cs *engine,
> > > >    				    struct guc_lrc_desc *desc)
> > > >    {
> > > > @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >    	desc = __get_lrc_desc(guc, desc_idx);
> > > >    	desc->engine_class = engine_class_to_guc_class(engine->class);
> > > > -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> > > > -						      engine->mask);
> > > > +	desc->engine_submit_mask = engine->logical_mask;
> > > >    	desc->hw_context_desc = ce->lrc.lrca;
> > > >    	desc->priority = ce->guc_state.prio;
> > > >    	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > > @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > > >    		}
> > > >    		ve->base.mask |= sibling->mask;
> > > > +		ve->base.logical_mask |= sibling->logical_mask;
> > > >    		if (n != 0 && ve->base.class != sibling->class) {
> > > >    			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> > > > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-09-09 22:51   ` John Harrison
@ 2021-09-13 16:54     ` Matthew Brost
  2021-09-13 22:38       ` John Harrison
  2021-09-13 16:55     ` Matthew Brost
  1 sibling, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-13 16:54 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:51:27PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Calling switch_to_kernel_context isn't needed if the engine PM reference
> > is taken while all contexts are pinned. By not calling
> > switch_to_kernel_context we save on issuing a request to the engine.
> I thought the intention of the switch_to_kernel was to ensure that the GPU
> is not touching any user context and is basically idle. That is not a valid
> assumption with an external scheduler such as GuC. So why is the description
> above only mentioning PM references? What is the connection between the PM
> ref and the switch_to_kernel?
> 
> Also, the comment in the code does not mention anything about PM references,
> it just says 'not necessary with GuC' but no explanation at all.
> 

Yea, this need to be explained better. How about this?

Calling switch_to_kernel_context isn't needed if the engine PM reference
is take while all user contexts have scheduling enabled. Once scheduling
is disabled on all user contexts the GuC is guaranteed to not touch any
user context state which is effectively the same pointing to a kernel
context.

Matt

> 
> > v2:
> >   (Daniel Vetter)
> >    - Add FIXME comment about pushing switch_to_kernel_context to backend
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > index 1f07ac4e0672..11fee66daf60 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > @@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
> >   	unsigned long flags;
> >   	bool result = true;
> > +	/*
> > +	 * No need to switch_to_kernel_context if GuC submission
> > +	 *
> > +	 * FIXME: This execlists specific backend behavior in generic code, this
> "This execlists" -> "This is execlist"
> 
> "this should be" -> "it should be"
> 
> John.
> 
> > +	 * should be pushed to the backend.
> > +	 */
> > +	if (intel_engine_uses_guc(engine))
> > +		return true;
> > +
> >   	/* GPU is pointing to the void, as good as in the kernel context. */
> >   	if (intel_gt_is_wedged(engine->gt))
> >   		return true;
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-09-09 22:51   ` John Harrison
  2021-09-13 16:54     ` Matthew Brost
@ 2021-09-13 16:55     ` Matthew Brost
  1 sibling, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-13 16:55 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Thu, Sep 09, 2021 at 03:51:27PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Calling switch_to_kernel_context isn't needed if the engine PM reference
> > is taken while all contexts are pinned. By not calling
> > switch_to_kernel_context we save on issuing a request to the engine.
> I thought the intention of the switch_to_kernel was to ensure that the GPU
> is not touching any user context and is basically idle. That is not a valid
> assumption with an external scheduler such as GuC. So why is the description
> above only mentioning PM references? What is the connection between the PM
> ref and the switch_to_kernel?
> 
> Also, the comment in the code does not mention anything about PM references,
> it just says 'not necessary with GuC' but no explanation at all.
> 
> 
> > v2:
> >   (Daniel Vetter)
> >    - Add FIXME comment about pushing switch_to_kernel_context to backend
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > index 1f07ac4e0672..11fee66daf60 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > @@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
> >   	unsigned long flags;
> >   	bool result = true;
> > +	/*
> > +	 * No need to switch_to_kernel_context if GuC submission
> > +	 *
> > +	 * FIXME: This execlists specific backend behavior in generic code, this
> "This execlists" -> "This is execlist"
> 
> "this should be" -> "it should be"
> 

Missed this. Will fix in next rev.

Matt

> John.
> 
> > +	 * should be pushed to the backend.
> > +	 */
> > +	if (intel_engine_uses_guc(engine))
> > +		return true;
> > +
> >   	/* GPU is pointing to the void, as good as in the kernel context. */
> >   	if (intel_gt_is_wedged(engine->gt))
> >   		return true;
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-09-13  9:55   ` Tvrtko Ursulin
@ 2021-09-13 17:12     ` Matthew Brost
  2021-09-14  8:41       ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-13 17:12 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 10:55:59AM +0100, Tvrtko Ursulin wrote:
> 
> On 20/08/2021 23:44, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a deregister context H2G is in flight.
> > 
> > FIXME: Move locking / structure changes into different patch
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
> >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
> >   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
> >   8 files changed, 199 insertions(+), 106 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index adfe49b53b1b..c8595da64ad8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	ce->guc_id.id = GUC_INVALID_LRC_ID;
> >   	INIT_LIST_HEAD(&ce->guc_id.link);
> > +	INIT_LIST_HEAD(&ce->destroyed_link);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 80bbdc7810f6..fd338a30617e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -190,22 +190,29 @@ struct intel_context {
> >   		/**
> >   		 * @id: unique handle which is used to communicate information
> >   		 * with the GuC about this context, protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		u16 id;
> >   		/**
> >   		 * @ref: the number of references to the guc_id, when
> >   		 * transitioning in and out of zero protected by
> > -		 * guc->contexts_lock
> > +		 * guc->submission_state.lock
> >   		 */
> >   		atomic_t ref;
> >   		/**
> >   		 * @link: in guc->guc_id_list when the guc_id has no refs but is
> > -		 * still valid, protected by guc->contexts_lock
> > +		 * still valid, protected by guc->submission_state.lock
> >   		 */
> >   		struct list_head link;
> >   	} guc_id;
> > +	/**
> > +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
> > +	 * list when context is pending to be destroyed (deregistered with the
> > +	 * GuC), protected by guc->submission_state.lock
> > +	 */
> > +	struct list_head destroyed_link;
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > index 70ea46d6cfb0..17a5028ea177 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> >   	return intel_wakeref_is_active(&engine->wakeref);
> >   }
> > +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
> > +{
> > +	__intel_wakeref_get(&engine->wakeref);
> > +}
> > +
> >   static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
> >   {
> >   	intel_wakeref_get(&engine->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > index d0588d8aaa44..a17bf0d4592b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > @@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> >   	intel_wakeref_put_async(&gt->wakeref);
> >   }
> > +#define with_intel_gt_pm(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +#define with_intel_gt_pm_async(gt, tmp) \
> > +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > +	     intel_gt_pm_put_async(gt), tmp = 0)
> > +#define with_intel_gt_pm_if_awake(gt, tmp) \
> > +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> > +	     intel_gt_pm_put(gt), tmp = 0)
> > +#define with_intel_gt_pm_if_awake_async(gt, tmp) \
> > +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
> > +	     intel_gt_pm_put_async(gt), tmp = 0)
> > +
> >   static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
> >   {
> >   	return intel_wakeref_wait_for_idle(&gt->wakeref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > index 8ff582222aff..ba10bd374cee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > @@ -142,6 +142,7 @@ enum intel_guc_action {
> >   	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
> >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
> >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> >   	INTEL_GUC_ACTION_LIMIT
> >   };
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 6fd2719d1b75..7358883f1540 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -53,21 +53,37 @@ struct intel_guc {
> >   		void (*disable)(struct intel_guc *guc);
> >   	} interrupts;
> > -	/**
> > -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
> > -	 * ce->guc_id.ref when transitioning in and out of zero
> > -	 */
> > -	spinlock_t contexts_lock;
> > -	/** @guc_ids: used to allocate new guc_ids */
> > -	struct ida guc_ids;
> > -	/** @num_guc_ids: number of guc_ids that can be used */
> > -	u32 num_guc_ids;
> > -	/** @max_guc_ids: max number of guc_ids that can be used */
> > -	u32 max_guc_ids;
> > -	/**
> > -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
> > -	 */
> > -	struct list_head guc_id_list;
> > +	struct {
> > +		/**
> > +		 * @lock: protects everything in submission_state, ce->guc_id,
> > +		 * and ce->destroyed_link
> > +		 */
> > +		spinlock_t lock;
> > +		/**
> > +		 * @guc_ids: used to allocate new guc_ids
> > +		 */
> > +		struct ida guc_ids;
> > +		/** @num_guc_ids: number of guc_ids that can be used */
> > +		u32 num_guc_ids;
> > +		/** @max_guc_ids: max number of guc_ids that can be used */
> > +		u32 max_guc_ids;
> > +		/**
> > +		 * @guc_id_list: list of intel_context with valid guc_ids but no
> > +		 * refs
> > +		 */
> > +		struct list_head guc_id_list;
> > +		/**
> > +		 * @destroyed_contexts: list of contexts waiting to be destroyed
> > +		 * (deregistered with the GuC)
> > +		 */
> > +		struct list_head destroyed_contexts;
> > +		/**
> > +		 * @destroyed_worker: worker to deregister contexts, need as we
> > +		 * need to take a GT PM reference and can't from destroy
> > +		 * function as it might be in an atomic context (no sleeping)
> > +		 */
> > +		struct work_struct destroyed_worker;
> > +	} submission_state;
> >   	bool submission_supported;
> >   	bool submission_selected;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > index b88d343ee432..27655460ee84 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > @@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
> >   	if (!intel_guc_submission_is_used(guc))
> >   		return -ENODEV;
> > -	*val = guc->num_guc_ids;
> > +	*val = guc->submission_state.num_guc_ids;
> >   	return 0;
> >   }
> > @@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
> >   static int guc_num_id_set(void *data, u64 val)
> >   {
> >   	struct intel_guc *guc = data;
> > +	unsigned long flags;
> >   	if (!intel_guc_submission_is_used(guc))
> >   		return -ENODEV;
> > -	if (val > guc->max_guc_ids)
> > -		val = guc->max_guc_ids;
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +
> > +	if (val > guc->submission_state.max_guc_ids)
> > +		val = guc->submission_state.max_guc_ids;
> >   	else if (val < 256)
> >   		val = 256;
> > -	guc->num_guc_ids = val;
> > +	guc->submission_state.num_guc_ids = val;
> > +
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	return 0;
> >   }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 68742b612692..f835e06e5f9f 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -86,9 +86,9 @@
> >    * submitting at a time. Currently only 1 sched_engine used for all of GuC
> >    * submission but that could change in the future.
> >    *
> > - * guc->contexts_lock
> > - * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
> > - * submission can hold this at a time.
> > + * guc->submission_state.lock
> > + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
> > + * list.
> >    *
> >    * ce->guc_state.lock
> >    * Protects everything under ce->guc_state. Ensures that a context is in the
> > @@ -100,7 +100,7 @@
> >    *
> >    * Lock ordering rules:
> >    * sched_engine->lock -> ce->guc_state.lock
> > - * guc->contexts_lock -> ce->guc_state.lock
> > + * guc->submission_state.lock -> ce->guc_state.lock
> >    *
> >    * Reset races:
> >    * When a GPU full reset is triggered it is assumed that some G2H responses to
> > @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > -	GEM_BUG_ON(index >= guc->max_guc_ids);
> > +	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
> >   	return &base[index];
> >   }
> > @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
> >   {
> >   	struct intel_context *ce = xa_load(&guc->context_lookup, id);
> > -	GEM_BUG_ON(id >= guc->max_guc_ids);
> > +	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
> >   	return ce;
> >   }
> > @@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
> >   	u32 size;
> >   	int ret;
> > -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
> > +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
> > +			  guc->submission_state.max_guc_ids);
> >   	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
> >   					     (void **)&guc->lrc_desc_pool_vaddr);
> >   	if (ret)
> > @@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >   			if (deregister)
> >   				guc_signal_context_fence(ce);
> >   			if (destroyed) {
> > +				intel_gt_pm_put_async(guc_to_gt(guc));
> >   				release_guc_id(guc, ce);
> >   				__guc_context_destroy(ce);
> >   			}
> > @@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> > +
> >   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   {
> >   	if (unlikely(!guc_submission_initialized(guc))) {
> > @@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >   	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
> >   	flush_work(&guc->ct.requests.worker);
> > +	guc_flush_destroyed_contexts(guc);
> >   	scrub_guc_desc_for_outstanding_g2h(guc);
> >   }
> > @@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> >   }
> > +static void destroyed_worker_func(struct work_struct *w);
> > +
> >   /*
> >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> >    * at firmware loading time.
> > @@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> > -	spin_lock_init(&guc->contexts_lock);
> > -	INIT_LIST_HEAD(&guc->guc_id_list);
> > -	ida_init(&guc->guc_ids);
> > +	spin_lock_init(&guc->submission_state.lock);
> > +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> > +	ida_init(&guc->submission_state.guc_ids);
> > +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > +	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> >   	return 0;
> >   }
> > @@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   		return;
> >   	guc_lrc_desc_pool_destroy(guc);
> > +	guc_flush_destroyed_contexts(guc);
> >   	i915_sched_engine_put(guc->sched_engine);
> >   }
> > @@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
> >   static int new_guc_id(struct intel_guc *guc)
> >   {
> > -	return ida_simple_get(&guc->guc_ids, 0,
> > -			      guc->num_guc_ids, GFP_KERNEL |
> > +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > +			      guc->submission_state.num_guc_ids, GFP_KERNEL |
> >   			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> >   }
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
> > +		ida_simple_remove(&guc->submission_state.guc_ids,
> > +				  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	unsigned long flags;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	__release_guc_id(guc, ce);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int steal_guc_id(struct intel_guc *guc)
> > @@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
> >   	struct intel_context *ce;
> >   	int guc_id;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> > -	if (!list_empty(&guc->guc_id_list)) {
> > -		ce = list_first_entry(&guc->guc_id_list,
> > +	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > +		ce = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > @@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
> >   {
> >   	int ret;
> > -	lockdep_assert_held(&guc->contexts_lock);
> > +	lockdep_assert_held(&guc->submission_state.lock);
> >   	ret = new_guc_id(guc);
> >   	if (unlikely(ret < 0)) {
> > @@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> >   try_again:
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	might_lock(&ce->guc_state.lock);
> > @@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	atomic_inc(&ce->guc_id.ref);
> >   out_unlock:
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   	/*
> >   	 * -EAGAIN indicates no guc_id are available, let's retire any
> > @@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	if (unlikely(context_guc_id_invalid(ce)))
> >   		return;
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> >   	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
> >   	    !atomic_read(&ce->guc_id.ref))
> > -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > +		list_add_tail(&ce->guc_id.link,
> > +			      &guc->submission_state.guc_id_list);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> >   static int __guc_action_register_context(struct intel_guc *guc,
> > @@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   static void guc_lrc_desc_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	unsigned long flags;
> > +	bool disabled;
> > +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
> >   	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
> >   	GEM_BUG_ON(context_enabled(ce));
> > +	/* Seal race with Reset */
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +	disabled = submission_disabled(guc);
> > +	if (likely(!disabled)) {
> > +		__intel_gt_pm_get(gt);
> > +		set_context_destroyed(ce);
> > +		clr_context_registered(ce);
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +	if (unlikely(disabled)) {
> > +		release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +		return;
> > +	}
> > +
> >   	deregister_context(ce, ce->guc_id.id, true);
> >   }
> > @@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
> >   	}
> >   }
> > +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(!submission_disabled(guc) &&
> > +		   guc_submission_initialized(guc));
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		__release_guc_id(guc, ce);
> > +		__guc_context_destroy(ce);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void deregister_destroyed_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	list_for_each_entry_safe(ce, cn,
> > +				 &guc->submission_state.destroyed_contexts,
> > +				 destroyed_link) {
> > +		list_del_init(&ce->destroyed_link);
> > +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> 
> Lock drop and re-acquire is not safe - list_for_each_entry_safe only deal
> with the list_del_init above. It does not help if someone else modifies the
> list in the window while the lock is not held, in which case things will go
> BOOM!
> 

Good catch, but is this true? We always add via list_add_tail which if
I'm reading the list_for_each_entry_safe macro correctly I think is safe
to do as 'n = list_next_entry(n, member)' will just evaluate to the new
tail on each iteration, right?

Reference to macro below:
#define list_for_each_entry_safe(pos, n, head, member)			\
	for (pos = list_first_entry(head, typeof(*pos), member),	\
		n = list_next_entry(pos, member);			\
	     &pos->member != (head); 					\
	     pos = n, n = list_next_entry(n, member))

Regardless probably safer to just not drop the lock.

> Not sure in what differnt ways you use this list, but perhaps you could
> employ a lockless single linked list for this. In which case you could
> atomically unlink all elements. Or just change this to not do lock dropping,
> if you can safely use the same list_head to put on a local list or
> something.
> 

I think both options work - not sure why I drop the lock here. I'll play
around with this and fix in the next rev.

Matt

> Regards,
> 
> Tvrtko
> 
> > +		guc_lrc_desc_unpin(ce);
> > +		spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	}
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +}
> > +
> > +static void destroyed_worker_func(struct work_struct *w)
> > +{
> > +	struct intel_guc *guc = container_of(w, struct intel_guc,
> > +					     submission_state.destroyed_worker);
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int tmp;
> > +
> > +	with_intel_gt_pm(gt, tmp)
> > +		deregister_destroyed_contexts(guc);
> > +}
> > +
> >   static void guc_context_destroy(struct kref *kref)
> >   {
> >   	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > -	intel_wakeref_t wakeref;
> >   	unsigned long flags;
> > -	bool disabled;
> > +	bool destroy;
> >   	/*
> >   	 * If the guc_id is invalid this context has been stolen and we can free
> >   	 * it immediately. Also can be freed immediately if the context is not
> >   	 * registered with the GuC or the GuC is in the middle of a reset.
> >   	 */
> > -	if (context_guc_id_invalid(ce)) {
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	} else if (submission_disabled(guc) ||
> > -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
> > -		release_guc_id(guc, ce);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	/*
> > -	 * We have to acquire the context spinlock and check guc_id again, if it
> > -	 * is valid it hasn't been stolen and needs to be deregistered. We
> > -	 * delete this context from the list of unpinned guc_id available to
> > -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
> > -	 * returns indicating this context has been deregistered the guc_id is
> > -	 * returned to the pool of available guc_id.
> > -	 */
> > -	spin_lock_irqsave(&guc->contexts_lock, flags);
> > -	if (context_guc_id_invalid(ce)) {
> > -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > -		__guc_context_destroy(ce);
> > -		return;
> > -	}
> > -
> > -	if (!list_empty(&ce->guc_id.link))
> > -		list_del_init(&ce->guc_id.link);
> > -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > -
> > -	/* Seal race with Reset */
> > -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > -	disabled = submission_disabled(guc);
> > -	if (likely(!disabled)) {
> > -		set_context_destroyed(ce);
> > -		clr_context_registered(ce);
> > +	spin_lock_irqsave(&guc->submission_state.lock, flags);
> > +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +	if (likely(!destroy)) {
> > +		if (!list_empty(&ce->guc_id.link))
> > +			list_del_init(&ce->guc_id.link);
> > +		list_add_tail(&ce->destroyed_link,
> > +			      &guc->submission_state.destroyed_contexts);
> > +	} else {
> > +		__release_guc_id(guc, ce);
> >   	}
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -	if (unlikely(disabled)) {
> > -		release_guc_id(guc, ce);
> > +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > +	if (unlikely(destroy)) {
> >   		__guc_context_destroy(ce);
> >   		return;
> >   	}
> >   	/*
> > -	 * We defer GuC context deregistration until the context is destroyed
> > -	 * in order to save on CTBs. With this optimization ideally we only need
> > -	 * 1 CTB to register the context during the first pin and 1 CTB to
> > -	 * deregister the context when the context is destroyed. Without this
> > -	 * optimization, a CTB would be needed every pin & unpin.
> > -	 *
> > -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
> > -	 * from context_free_worker when runtime wakeref is not held.
> > -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
> > -	 * in H2G CTB to deregister the context. A future patch may defer this
> > -	 * H2G CTB if the runtime wakeref is zero.
> > +	 * We use a worker to issue the H2G to deregister the context as we can
> > +	 * take the GT PM for the first time which isn't allowed from an atomic
> > +	 * context.
> >   	 */
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		guc_lrc_desc_unpin(ce);
> > +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> >   }
> >   static int guc_context_alloc(struct intel_context *ce)
> > @@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
> >   void intel_guc_submission_init_early(struct intel_guc *guc)
> >   {
> > -	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > -	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > +	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> > +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
> >   	guc->submission_supported = __guc_submission_supported(guc);
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > @@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   {
> >   	struct intel_context *ce;
> > -	if (unlikely(desc_idx >= guc->max_guc_ids)) {
> > +	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
> >   		drm_err(&guc_to_gt(guc)->i915->drm,
> >   			"Invalid desc_idx %u, max %u",
> > -			desc_idx, guc->max_guc_ids);
> > +			desc_idx, guc->submission_state.max_guc_ids);
> >   		return NULL;
> >   	}
> > @@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
> >   		intel_context_put(ce);
> >   	} else if (context_destroyed(ce)) {
> >   		/* Context has been destroyed */
> > +		intel_gt_pm_put_async(guc_to_gt(guc));
> >   		release_guc_id(guc, ce);
> >   		__guc_context_destroy(ce);
> >   	}
> > @@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> >   	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> >   		   atomic_read(&guc->outstanding_submission_g2h));
> > -	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
> > -	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> > +	drm_printf(p, "GuC Number GuC IDs: %u\n",
> > +		   guc->submission_state.num_guc_ids);
> > +	drm_printf(p, "GuC Max GuC IDs: %u\n",
> > +		   guc->submission_state.max_guc_ids);
> >   	drm_printf(p, "GuC tasklet count: %u\n\n",
> >   		   atomic_read(&sched_engine->tasklet.count));
> > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 05/27] drm/i915: Add GT PM unpark worker
  2021-09-13 10:33       ` Tvrtko Ursulin
@ 2021-09-13 17:20         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-13 17:20 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 11:33:46AM +0100, Tvrtko Ursulin wrote:
> 
> On 10/09/2021 21:09, Matthew Brost wrote:
> > On Fri, Sep 10, 2021 at 09:36:17AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 20/08/2021 23:44, Matthew Brost wrote:
> > > > Sometimes it is desirable to queue work up for later if the GT PM isn't
> > > > held and run that work on next GT PM unpark.
> > > 
> > > Sounds maybe plausible, but it depends how much work can happen on unpark
> > > and whether it can have too much of a negative impact on latency for
> > > interactive loads? Or from a reverse angle, why the work wouldn't be done on
> > 
> > All it is does is add an interface to kick a work queue on unpark. i.e.
> > All the actually work is done async in the work queue so it shouldn't
> > add any latency.
> > 
> > > parking?
> > > 
> > > Also what kind of mechanism for dealing with too much stuff being put on
> > > this list you have? Can there be pressure which triggers (or would need to
> > 
> > No limits on pressure. See above, I don't think this is a concern.
> 
> On unpark it has the potential to send an unbound amount of actions for the
> GuC to process. Which will be competing, in GuC internal processing power,
> with the user action which caused the unpark. That logically does feel like
> can have effect on initial latency. Why you think it cannot?
> 

Ah, I see what you mean. Yes, bunch of deregisters could be sent before
a submission adding latency. Maybe I just drop this whole idea / patch
for now. Not going to respond to individual comments because this will
be dropped.

Matt

> Why the work wouldn't be done on parking?
> 
> With this scheme couldn't we end up with a situation that the worker keeps
> missing the GT unparked state and so keeps piling items on the
> deregistration list? Can you run of some ids like that (which is related to
> my question of how is pressure handled here).
> 
> Unpark
> Register context
> Submit work
> Retire
> Schedule context deregister
> Park
> 
> Worker runs
> GT parked
> Work put on a list
> 
> Unpark
> Schedule deregistration worker
> Register new context
> Submit work
> Retire
> Schedule contect deregister
> Park
> 
> Worker runs (lets say there was CPU pressure)
> GT already parked
>  -> deregistration queue now has two contexts on it
> 
> ... repeat until disaster ...
> 
> Unless I have misunderstood the logic.
> 
> > > trigger) these deregistrations to happen at runtime (no park/unpark
> > > transitions)?
> > > 
> > > > Implemented with a list in the GT of all pending work, workqueues in
> > > > the list, a callback to add a workqueue to the list, and finally a
> > > > wakeref post_get callback that iterates / drains the list + queues the
> > > > workqueues.
> > > > 
> > > > First user of this is deregistration of GuC contexts.
> > > 
> > > Does first imply there are more incoming?
> > > 
> > 
> > Haven't found another user yet but this is generic mechanism so we can
> > add more in the future if other use cases arrise.
> 
> My feeling is it would be best to leave it for later.
> 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/Makefile                 |  1 +
> > > >    drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
> > > >    drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 ++++
> > > >    .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 ++++++++++++++++
> > > >    .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 40 +++++++++++++++++++
> > > >    drivers/gpu/drm/i915/gt/intel_gt_types.h      | 10 +++++
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  8 ++--
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 +++++--
> > > >    drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
> > > >    drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
> > > >    10 files changed, 119 insertions(+), 7 deletions(-)
> > > >    create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > > >    create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > > > index 642a5b5a1b81..579bdc069f25 100644
> > > > --- a/drivers/gpu/drm/i915/Makefile
> > > > +++ b/drivers/gpu/drm/i915/Makefile
> > > > @@ -103,6 +103,7 @@ gt-y += \
> > > >    	gt/intel_gt_clock_utils.o \
> > > >    	gt/intel_gt_irq.o \
> > > >    	gt/intel_gt_pm.o \
> > > > +	gt/intel_gt_pm_unpark_work.o \
> > > >    	gt/intel_gt_pm_irq.o \
> > > >    	gt/intel_gt_requests.o \
> > > >    	gt/intel_gtt.o \
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> > > > index 62d40c986642..7e690e74baa2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> > > > @@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
> > > >    	spin_lock_init(&gt->irq_lock);
> > > > +	spin_lock_init(&gt->pm_unpark_work_lock);
> > > > +	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
> > > > +
> > > >    	INIT_LIST_HEAD(&gt->closed_vma);
> > > >    	spin_lock_init(&gt->closed_lock);
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > > index dea8e2479897..564c11a3748b 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > > @@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
> > > >    	return 0;
> > > >    }
> > > > +static void __gt_unpark_work_queue(struct intel_wakeref *wf)
> > > > +{
> > > > +	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > > > +
> > > > +	intel_gt_pm_unpark_work_queue(gt);
> > > > +}
> > > > +
> > > >    static int __gt_park(struct intel_wakeref *wf)
> > > >    {
> > > >    	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
> > > > @@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
> > > >    static const struct intel_wakeref_ops wf_ops = {
> > > >    	.get = __gt_unpark,
> > > > +	.post_get = __gt_unpark_work_queue,
> > > >    	.put = __gt_park,
> > > >    };
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > > > new file mode 100644
> > > > index 000000000000..23162dbd0c35
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
> > > > @@ -0,0 +1,35 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2021 Intel Corporation
> > > > + */
> > > > +
> > > > +#include "i915_drv.h"
> > > > +#include "intel_runtime_pm.h"
> > > > +#include "intel_gt_pm.h"
> > > > +
> > > > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
> > > > +{
> > > > +	struct intel_gt_pm_unpark_work *work, *next;
> > > > +	unsigned long flags;
> > > > +
> > > > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > > > +	list_for_each_entry_safe(work, next,
> > > > +				 &gt->pm_unpark_work_list, link) {
> > > > +		list_del_init(&work->link);
> > > > +		queue_work(system_unbound_wq, &work->worker);
> > > > +	}
> > > > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > > > +}
> > > > +
> > > > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > > > +				 struct intel_gt_pm_unpark_work *work)
> > > > +{
> > > > +	unsigned long flags;
> > > > +
> > > > +	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
> > > > +	if (intel_gt_pm_is_awake(gt))
> > > > +		queue_work(system_unbound_wq, &work->worker);
> > > > +	else if (list_empty(&work->link))
> > > 
> > > What's the list_empty check for, something can race by design?
> > > 
> > 
> > This function is allowed to be called twice, e.g. Two contexts can be
> > tried to be deregistered while the GT PM is ideal but only first context
> > results in the worker being added to the list of workers to be kicked on
> > unpark.
> 
> Hmm okay.
> 
> > 
> > > > +		list_add_tail(&work->link, &gt->pm_unpark_work_list);
> > > > +	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
> > > > +}
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > > > new file mode 100644
> > > > index 000000000000..eaf1dc313aa2
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
> > > > @@ -0,0 +1,40 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2021 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef INTEL_GT_PM_UNPARK_WORK_H
> > > > +#define INTEL_GT_PM_UNPARK_WORK_H
> > > > +
> > > > +#include <linux/list.h>
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct intel_gt;
> > > > +
> > > > +/**
> > > > + * struct intel_gt_pm_unpark_work - work to be scheduled when GT unparked
> > > > + */
> > > > +struct intel_gt_pm_unpark_work {
> > > > +	/**
> > > > +	 * @link: link into gt->pm_unpark_work_list of workers that need to be
> > > > +	 * scheduled when GT is unpark, protected by gt->pm_unpark_work_lock
> > > > +	 */
> > > > +	struct list_head link;
> > > > +	/** @worker: will be scheduled when GT unparked */
> > > > +	struct work_struct worker;
> > > > +};
> > > > +
> > > > +void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
> > > > +
> > > > +void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
> > > > +				 struct intel_gt_pm_unpark_work *work);
> > > > +
> > > > +static inline void
> > > > +intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
> > > > +			     work_func_t fn)
> > > > +{
> > > > +	INIT_LIST_HEAD(&work->link);
> > > > +	INIT_WORK(&work->worker, fn);
> > > > +}
> > > > +
> > > > +#endif /* INTEL_GT_PM_UNPARK_WORK_H */
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > > > index a81e21bf1bd1..4480312f0add 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> > > > @@ -96,6 +96,16 @@ struct intel_gt {
> > > >    	struct intel_wakeref wakeref;
> > > >    	atomic_t user_wakeref;
> > > > +	/**
> > > > +	 * @pm_unpark_work_list: list of delayed work to scheduled which GT is
> > > > +	 * unparked, protected by pm_unpark_work_lock
> > > > +	 */
> > > > +	struct list_head pm_unpark_work_list;
> > > > +	/**
> > > > +	 * @pm_unpark_work_lock: protects pm_unpark_work_list
> > > > +	 */
> > > > +	spinlock_t pm_unpark_work_lock;
> > > > +
> > > >    	struct list_head closed_vma;
> > > >    	spinlock_t closed_lock; /* guards the list of closed_vma */
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 7358883f1540..023953e77553 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -19,6 +19,7 @@
> > > >    #include "intel_uc_fw.h"
> > > >    #include "i915_utils.h"
> > > >    #include "i915_vma.h"
> > > > +#include "gt/intel_gt_pm_unpark_work.h"
> > > >    struct __guc_ads_blob;
> > > > @@ -78,11 +79,12 @@ struct intel_guc {
> > > >    		 */
> > > >    		struct list_head destroyed_contexts;
> > > >    		/**
> > > > -		 * @destroyed_worker: worker to deregister contexts, need as we
> > > > +		 * @destroyed_worker: Worker to deregister contexts, need as we
> > > >    		 * need to take a GT PM reference and can't from destroy
> > > > -		 * function as it might be in an atomic context (no sleeping)
> > > > +		 * function as it might be in an atomic context (no sleeping).
> > > > +		 * Worker only issues deregister when GT is unparked.
> > > >    		 */
> > > > -		struct work_struct destroyed_worker;
> > > > +		struct intel_gt_pm_unpark_work destroyed_worker;
> > > >    	} submission_state;
> > > >    	bool submission_supported;
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index f835e06e5f9f..dbf919801de2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1135,7 +1135,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >    	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
> > > >    	ida_init(&guc->submission_state.guc_ids);
> > > >    	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> > > > -	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
> > > > +	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> > > > +				     destroyed_worker_func);
> > > >    	return 0;
> > > >    }
> > > > @@ -1942,13 +1943,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
> > > >    static void destroyed_worker_func(struct work_struct *w)
> > > >    {
> > > > -	struct intel_guc *guc = container_of(w, struct intel_guc,
> > > > +	struct intel_gt_pm_unpark_work *destroyed_worker =
> > > > +		container_of(w, struct intel_gt_pm_unpark_work, worker);
> > > > +	struct intel_guc *guc = container_of(destroyed_worker, struct intel_guc,
> > > >    					     submission_state.destroyed_worker);
> > > >    	struct intel_gt *gt = guc_to_gt(guc);
> > > >    	int tmp;
> > > > -	with_intel_gt_pm(gt, tmp)
> > > > +	with_intel_gt_pm_if_awake(gt, tmp)
> > > >    		deregister_destroyed_contexts(guc);
> > > > +
> > > > +	if (!list_empty(&guc->submission_state.destroyed_contexts))
> > > > +		intel_gt_pm_unpark_work_add(gt, destroyed_worker);
> > > 
> > > This is the worker itself, right?
> > > 
> > 
> > Yes.
> > 
> > > There's a "if awake" here followed by another "if awake" inside
> > > intel_gt_pm_unpark_work_add which raises questions.
> > > 
> > 
> > Even in the worker we only deregister the context if the GT is awake.
> 
> Yeah but the two "if awake" checks followed by one another is the confusing
> part since it is not atomic. Coupled with the list empty check I mean. So
> two external async condition can influence the flow, one of which is even
> checked twice.
> 
> The worker can decide not to deregister, but by the time
> intel_gt_pm_unpark_work_add runs GT may have became unparked so worker gets
> queued. When it runs GT is parked again and so it can bounce in a loop
> forever.
> 
> > > Second question is what's the list_empty for - why is the state of the list
> > > itself relevant to a single worker deciding whether to re-add itself to it
> > > or not? And is there a lock protecting this list?
> > > 
> > 
> > Yes we have locking - pm_unpark_work_lock. It is basically all 2
> > threads to call intel_gt_pm_unpark_work_add while the GT is idle - in
> > that case only the first call adds the worker to the list of workers to
> > kick when unparked (same explaination as above).
> > > On the overall it feels questionable to have unpark work which apparently
> > > can race with subsequent parking. Presumably you cannot have it run sync on
> > > unpark due execution context issues?
> > > 
> > 
> > No race, just allowed to be called twice.
> 
> But list_empty check is not under any locks so I don't follow.
> 
> > 
> > > >    }
> > > >    static void guc_context_destroy(struct kref *kref)
> > > > @@ -1985,7 +1991,8 @@ static void guc_context_destroy(struct kref *kref)
> > > >    	 * take the GT PM for the first time which isn't allowed from an atomic
> > > >    	 * context.
> > > >    	 */
> > > > -	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
> > > > +	intel_gt_pm_unpark_work_add(guc_to_gt(guc),
> > > > +				    &guc->submission_state.destroyed_worker);
> > > >    }
> > > >    static int guc_context_alloc(struct intel_context *ce)
> > > > diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> > > > index dfd87d082218..282fc4f312e3 100644
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > > > @@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
> > > >    int __intel_wakeref_get_first(struct intel_wakeref *wf)
> > > >    {
> > > > +	bool do_post = false;
> > > > +
> > > >    	/*
> > > >    	 * Treat get/put as different subclasses, as we may need to run
> > > >    	 * the put callback from under the shrinker and do not want to
> > > > @@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
> > > >    		}
> > > >    		smp_mb__before_atomic(); /* release wf->count */
> > > > +		do_post = true;
> > > >    	}
> > > >    	atomic_inc(&wf->count);
> > > > +	if (do_post && wf->ops->post_get)
> > > > +		wf->ops->post_get(wf);
> > > 
> > > You want this hook under the wf->mutex and why?
> > > 
> > 
> > I didn't really think about this but everything else in under the mutex
> > so I included this under the mutex too. In this case this post_get op
> > could likely be outside the mutex but I think it harmless to keep it
> > under the mutex. For future safety, I think it should stay under the
> > mutex.
> 
> If it is under the mutex then what is the point of two hooks? The only
> difference is get sees wf->count == 0, while post_get sees it as one.
> 
> No wait.. you have put post_get outside the 0->1 transition check so
> potentially called more than once. You probably do not want this.. But could
> you just do this from the existing hook is the main question.
> 
> Regards,
> 
> Tvrtko
> 
> > 
> > Matt
> > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > >    	mutex_unlock(&wf->mutex);
> > > >    	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
> > > > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > index 545c8f277c46..ef7e6a698e8a 100644
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > @@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
> > > >    struct intel_wakeref_ops {
> > > >    	int (*get)(struct intel_wakeref *wf);
> > > > +	void (*post_get)(struct intel_wakeref *wf);
> > > >    	int (*put)(struct intel_wakeref *wf);
> > > >    };
> > > > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-09-10  0:41     ` Matthew Brost
@ 2021-09-13 22:26       ` John Harrison
  2021-09-14  1:12         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-13 22:26 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On 9/9/2021 17:41, Matthew Brost wrote:
> On Thu, Sep 09, 2021 at 03:46:43PM -0700, John Harrison wrote:
>> On 8/20/2021 15:44, Matthew Brost wrote:
>>> Taking a PM reference to prevent intel_gt_wait_for_idle from short
>>> circuiting while a scheduling of user context could be enabled.
>> As with earlier PM patch, needs more explanation of what the problem is and
>> why it is only now a problem.
>>
>>
> Same explaination, will add here.
>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Add might_lock annotations to pin / unpin function
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>>>    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
>>>    5 files changed, 73 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>>> index c8595da64ad8..508cfe5770c0 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>> @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>>>    	if (err)
>>>    		goto err_post_unpin;
>>> +	intel_engine_pm_might_get(ce->engine);
>>> +
>>>    	if (unlikely(intel_context_is_closed(ce))) {
>>>    		err = -ENOENT;
>>>    		goto err_unlock;
>>> @@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
>>>    		return;
>>>    	CE_TRACE(ce, "unpin\n");
>>> +	intel_engine_pm_might_put(ce->engine);
>>>    	ce->ops->unpin(ce);
>>>    	ce->ops->post_unpin(ce);
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> index 17a5028ea177..3fe2ae1bcc26 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> @@ -9,6 +9,7 @@
>>>    #include "i915_request.h"
>>>    #include "intel_engine_types.h"
>>>    #include "intel_wakeref.h"
>>> +#include "intel_gt_pm.h"
>>>    static inline bool
>>>    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>>> @@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
>>>    	return intel_wakeref_get_if_active(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine))
>>> +		intel_wakeref_might_get(&engine->wakeref);
>> Why doesn't this need to iterate through the physical engines of the virtual
>> engine?
>>
> Yea, technically it should. This is just an annotation though to check
> if we do something horribly wrong in our code. If we use any physical
> engine in our stack this annotation should pop and we can fix it. I just
> don't see what making this 100% correct for virtual engines buys us. If
> you want I can fix this but thinking the more complex we make this
> annotation the less likely it just gets compiled out with lockdep off
> which is what we are aiming for.
But if the annotation is missing a fundamental lock then it is surely 
not actually going to do any good? Not sure if you need to iterate over 
all child engines + parent but it seems like you should be calling 
might_lock() on at least one engine's mutex to feed the lockdep annotation.

John.

> Matt
>
>> John.
>>
>>> +	intel_gt_pm_might_get(engine->gt);
>>> +}
>>> +
>>>    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
>>>    {
>>>    	intel_wakeref_put(&engine->wakeref);
>>> @@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
>>>    	intel_wakeref_unlock_wait(&engine->wakeref);
>>>    }
>>> +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
>>> +{
>>> +	if (!intel_engine_is_virtual(engine))
>>> +		intel_wakeref_might_put(&engine->wakeref);
>>> +	intel_gt_pm_might_put(engine->gt);
>>> +}
>>> +
>>>    static inline struct i915_request *
>>>    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
>>>    {
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> index a17bf0d4592b..3c173033ce23 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
>>>    	return intel_wakeref_get_if_active(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_get(&gt->wakeref);
>>> +}
>>> +
>>>    static inline void intel_gt_pm_put(struct intel_gt *gt)
>>>    {
>>>    	intel_wakeref_put(&gt->wakeref);
>>> @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>>>    	intel_wakeref_put_async(&gt->wakeref);
>>>    }
>>> +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
>>> +{
>>> +	intel_wakeref_might_put(&gt->wakeref);
>>> +}
>>> +
>>>    #define with_intel_gt_pm(gt, tmp) \
>>>    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>>    	     intel_gt_pm_put(gt), tmp = 0)
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index dbf919801de2..e0eed70f9b92 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>>>    static int guc_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>> -	return __guc_context_pin(ce, ce->engine, vaddr);
>>> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
>>> +
>>> +	if (likely(!ret && !intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_get(ce->engine);
>>> +
>>> +	return ret;
>>>    }
>>>    static void guc_context_unpin(struct intel_context *ce)
>>> @@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
>>>    	unpin_guc_id(guc, ce);
>>>    	lrc_unpin(ce);
>>> +
>>> +	if (likely(!intel_context_is_barrier(ce)))
>>> +		intel_engine_pm_put_async(ce->engine);
>>>    }
>>>    static void guc_context_post_unpin(struct intel_context *ce)
>>> @@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>>>    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>>>    {
>>>    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
>>> +	int ret = __guc_context_pin(ce, engine, vaddr);
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +
>>> +	if (likely(!ret))
>>> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +			intel_engine_pm_get(engine);
>>> -	return __guc_context_pin(ce, engine, vaddr);
>>> +	return ret;
>>> +}
>>> +
>>> +static void guc_virtual_context_unpin(struct intel_context *ce)
>>> +{
>>> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
>>> +	struct intel_engine_cs *engine;
>>> +	struct intel_guc *guc = ce_to_guc(ce);
>>> +
>>> +	GEM_BUG_ON(context_enabled(ce));
>>> +	GEM_BUG_ON(intel_context_is_barrier(ce));
>>> +
>>> +	unpin_guc_id(guc, ce);
>>> +	lrc_unpin(ce);
>>> +
>>> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>>> +		intel_engine_pm_put_async(engine);
>>>    }
>>>    static void guc_virtual_context_enter(struct intel_context *ce)
>>> @@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>>>    	.pre_pin = guc_virtual_context_pre_pin,
>>>    	.pin = guc_virtual_context_pin,
>>> -	.unpin = guc_context_unpin,
>>> +	.unpin = guc_virtual_context_unpin,
>>>    	.post_unpin = guc_context_post_unpin,
>>>    	.ban = guc_context_ban,
>>> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
>>> index ef7e6a698e8a..dd530ae028e0 100644
>>> --- a/drivers/gpu/drm/i915/intel_wakeref.h
>>> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
>>> @@ -124,6 +124,12 @@ enum {
>>>    	__INTEL_WAKEREF_PUT_LAST_BIT__
>>>    };
>>> +static inline void
>>> +intel_wakeref_might_get(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_put_flags: Release the wakeref
>>>     * @wf: the wakeref
>>> @@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
>>>    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
>>>    }
>>> +static inline void
>>> +intel_wakeref_might_put(struct intel_wakeref *wf)
>>> +{
>>> +	might_lock(&wf->mutex);
>>> +}
>>> +
>>>    /**
>>>     * intel_wakeref_lock: Lock the wakeref (mutex)
>>>     * @wf: the wakeref


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-09-13 16:54     ` Matthew Brost
@ 2021-09-13 22:38       ` John Harrison
  2021-09-14  5:02         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-13 22:38 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 3420 bytes --]

On 9/13/2021 09:54, Matthew Brost wrote:
> On Thu, Sep 09, 2021 at 03:51:27PM -0700, John Harrison wrote:
>> On 8/20/2021 15:44, Matthew Brost wrote:
>>> Calling switch_to_kernel_context isn't needed if the engine PM reference
>>> is taken while all contexts are pinned. By not calling
>>> switch_to_kernel_context we save on issuing a request to the engine.
>> I thought the intention of the switch_to_kernel was to ensure that the GPU
>> is not touching any user context and is basically idle. That is not a valid
>> assumption with an external scheduler such as GuC. So why is the description
>> above only mentioning PM references? What is the connection between the PM
>> ref and the switch_to_kernel?
>>
>> Also, the comment in the code does not mention anything about PM references,
>> it just says 'not necessary with GuC' but no explanation at all.
>>
> Yea, this need to be explained better. How about this?
>
> Calling switch_to_kernel_context isn't needed if the engine PM reference
> is take while all user contexts have scheduling enabled. Once scheduling
> is disabled on all user contexts the GuC is guaranteed to not touch any
> user context state which is effectively the same pointing to a kernel
> context.
>
> Matt
I'm still not seeing how the PM reference is involved?

Also, IMHO the focus is wrong in the above text. The fundamental 
requirement is the ensure the hardware is idle. Execlist achieves this 
by switching to a safe context. GuC achieves it by disabling scheduling. 
Indeed, switching to a 'safe' context really has no effect with GuC 
submission. So 'effectively the same as pointing to a kernel context' is 
an incorrect description. I would go with something like:

    "This is execlist specific behaviour intended to ensure the GPU is
    idle by switching to a known 'safe' context. With GuC submission,
    the same idle guarantee is achieved by other means (disabling
    scheduling). Further, switching to a 'safe' context has no effect
    with GuC submission as the scheduler can just switch back again.
    FIXME: Move this backend scheduler specific behaviour into the
    scheduler backend."


John.


>
>>> v2:
>>>    (Daniel Vetter)
>>>     - Add FIXME comment about pushing switch_to_kernel_context to backend
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
>>>    1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> index 1f07ac4e0672..11fee66daf60 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> @@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>>>    	unsigned long flags;
>>>    	bool result = true;
>>> +	/*
>>> +	 * No need to switch_to_kernel_context if GuC submission
>>> +	 *
>>> +	 * FIXME: This execlists specific backend behavior in generic code, this
>> "This execlists" -> "This is execlist"
>>
>> "this should be" -> "it should be"
>>
>> John.
>>
>>> +	 * should be pushed to the backend.
>>> +	 */
>>> +	if (intel_engine_uses_guc(engine))
>>> +		return true;
>>> +
>>>    	/* GPU is pointing to the void, as good as in the kernel context. */
>>>    	if (intel_gt_is_wedged(engine->gt))
>>>    		return true;


[-- Attachment #2: Type: text/html, Size: 4724 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 09/27] drm/i915: Expose logical engine instance to user
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-13 23:06   ` John Harrison
  2021-09-14  1:08     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-13 23:06 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Expose logical engine instance to user via query engine info IOCTL. This
> is required for split-frame workloads as these needs to be placed on
> engines in a logically contiguous order. The logical mapping can change
> based on fusing. Rather than having user have knowledge of the fusing we
> simply just expose the logical mapping with the existing query engine
> info IOCTL.
>
> IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
> media UMD: link coming soon
>
> v2:
>   (Daniel Vetter)
>    - Add IGT link, placeholder for media UMD
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_query.c | 2 ++
>   include/uapi/drm/i915_drm.h       | 8 +++++++-
>   2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> index e49da36c62fb..8a72923fbdba 100644
> --- a/drivers/gpu/drm/i915/i915_query.c
> +++ b/drivers/gpu/drm/i915/i915_query.c
> @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
>   	for_each_uabi_engine(engine, i915) {
>   		info.engine.engine_class = engine->uabi_class;
>   		info.engine.engine_instance = engine->uabi_instance;
> +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
>   		info.capabilities = engine->uabi_capabilities;
> +		info.logical_instance = ilog2(engine->logical_mask);
>   
>   		if (copy_to_user(info_ptr, &info, sizeof(info)))
>   			return -EFAULT;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index bde5860b3686..b1248a67b4f8 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
>   
>   	/** @flags: Engine flags. */
>   	__u64 flags;
> +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
>   
>   	/** @capabilities: Capabilities of this engine. */
>   	__u64 capabilities;
>   #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
>   #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
>   
> +	/** @logical_instance: Logical instance of engine */
> +	__u16 logical_instance;
> +
>   	/** @rsvd1: Reserved fields. */
> -	__u64 rsvd1[4];
> +	__u16 rsvd1[3];
> +	/** @rsvd2: Reserved fields. */
> +	__u64 rsvd2[3];
>   };
>   
>   /**
Any idea why the padding? Would be useful if the comment said 'this 
structure must be at least/exactly X bytes in size / a multiple of X 
bytes in size because ...' or whatever.

However, not really anything to do with this patch as such, so either way:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 10/27] drm/i915/guc: Introduce context parent-child relationship
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-13 23:19   ` John Harrison
  2021-09-14  1:18     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-13 23:19 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Introduce context parent-child relationship. Once this relationship is
> created all pinning / unpinning operations are directed to the parent
> context. The parent context is responsible for pinning all of its'
> children and itself.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - a single H2G is used
> register / deregister all of the contexts simultaneously.
>
> Subsequent patches in the series will implement the pinning / unpinning
> operations for parent / child contexts.
>
> v2:
>   (Daniel Vetter)
>    - Add kernel doc, add wrapper to access parent to ensure safety
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 29 ++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h       | 39 +++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context_types.h | 23 +++++++++++
>   3 files changed, 91 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 508cfe5770c0..00d1aee6d199 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -404,6 +404,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   	INIT_LIST_HEAD(&ce->destroyed_link);
>   
No need for this blank line?

> +	INIT_LIST_HEAD(&ce->guc_child_list);
> +
>   	/*
>   	 * Initialize fence to be complete as this is expected to be complete
>   	 * unless there is a pending schedule disable outstanding.
> @@ -418,10 +420,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>   
>   void intel_context_fini(struct intel_context *ce)
>   {
> +	struct intel_context *child, *next;
> +
>   	if (ce->timeline)
>   		intel_timeline_put(ce->timeline);
>   	i915_vm_put(ce->vm);
>   
> +	/* Need to put the creation ref for the children */
> +	if (intel_context_is_parent(ce))
> +		for_each_child_safe(ce, child, next)
> +			intel_context_put(child);
> +
>   	mutex_destroy(&ce->pin_mutex);
>   	i915_active_fini(&ce->active);
>   }
> @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   	return active;
>   }
>   
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child)
> +{
> +	/*
> +	 * Callers responsibility to validate that this function is used
> +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> +	 */
> +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> +	GEM_BUG_ON(intel_context_is_pinned(parent));
> +	GEM_BUG_ON(intel_context_is_child(parent));
> +	GEM_BUG_ON(intel_context_is_pinned(child));
> +	GEM_BUG_ON(intel_context_is_child(child));
> +	GEM_BUG_ON(intel_context_is_parent(child));
> +
> +	parent->guc_number_children++;
> +	list_add_tail(&child->guc_child_link,
> +		      &parent->guc_child_list);
> +	child->parent = parent;
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_context.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c41098950746..c2985822ab74 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -44,6 +44,45 @@ void intel_context_free(struct intel_context *ce);
>   int intel_context_reconfigure_sseu(struct intel_context *ce,
>   				   const struct intel_sseu sseu);
>   
> +static inline bool intel_context_is_child(struct intel_context *ce)
> +{
> +	return !!ce->parent;
> +}
> +
> +static inline bool intel_context_is_parent(struct intel_context *ce)
> +{
> +	return !!ce->guc_number_children;
> +}
> +
> +static inline bool intel_context_is_pinned(struct intel_context *ce);
No point declaring 'static inline' if there is no function body?

> +
> +static inline struct intel_context *
> +intel_context_to_parent(struct intel_context *ce)
> +{
> +        if (intel_context_is_child(ce)) {
> +		/*
> +		 * The parent holds ref count to the child so it is always safe
> +		 * for the parent to access the child, but the child has pointer
has pointer -> has a pointer

> +		 * to the parent without a ref. To ensure this is safe the child
> +		 * should only access the parent pointer while the parent is
> +		 * pinned.
> +		 */
> +                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
> +
> +                return ce->parent;
> +        } else {
> +                return ce;
> +        }
> +}
> +
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child);
> +
> +#define for_each_child(parent, ce)\
> +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> +#define for_each_child_safe(parent, ce, cn)\
> +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
Do these macros not need some kind of intel_context prefix? Or at least 
be 'for_each_guc_child' given the naming of the list/link fields? But 
maybe not if the guc_ is dropped from the variable names - see below.

> +
>   /**
>    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
>    * @ce - the context
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index fd338a30617e..0fafc178cf2c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -213,6 +213,29 @@ struct intel_context {
>   	 */
>   	struct list_head destroyed_link;
>   
> +	/** anonymous struct for parent / children only members */
> +	struct {
> +		union {
> +			/**
> +			 * @guc_child_list: parent's list of of children
> +			 * contexts, no protection as immutable after context
> +			 * creation
> +			 */
> +			struct list_head guc_child_list;
> +			/**
> +			 * @guc_child_link: child's link into parent's list of
> +			 * children
> +			 */
> +			struct list_head guc_child_link;
> +		};
> +
> +		/** @parent: pointer to parent if child */
> +		struct intel_context *parent;
> +
> +		/** @guc_number_children: number of children if parent */
> +		u8 guc_number_children;
These are not really a GuC specific fields? The parent/child thing might 
only be necessary for GuC submission (although can you say it won't be 
required by any future backend, such as the DRM scheduler?) but it is a 
context level concept. None of the files changed in this patch are GuC 
specific. So no need for 'guc_' prefix? Alternatively, if it all really 
is completely GuC specific then the 'parent' field should also have the 
prefix? Or even just name the outer struct 'guc_family' or something and 
drop the prefixes from all the inner members.

John.

> +	};
> +
>   #ifdef CONFIG_DRM_I915_SELFTEST
>   	/**
>   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 09/27] drm/i915: Expose logical engine instance to user
  2021-09-13 23:06   ` John Harrison
@ 2021-09-14  1:08     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-14  1:08 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 04:06:33PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Expose logical engine instance to user via query engine info IOCTL. This
> > is required for split-frame workloads as these needs to be placed on
> > engines in a logically contiguous order. The logical mapping can change
> > based on fusing. Rather than having user have knowledge of the fusing we
> > simply just expose the logical mapping with the existing query engine
> > info IOCTL.
> > 
> > IGT: https://patchwork.freedesktop.org/patch/445637/?series=92854&rev=1
> > media UMD: link coming soon
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add IGT link, placeholder for media UMD
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_query.c | 2 ++
> >   include/uapi/drm/i915_drm.h       | 8 +++++++-
> >   2 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> > index e49da36c62fb..8a72923fbdba 100644
> > --- a/drivers/gpu/drm/i915/i915_query.c
> > +++ b/drivers/gpu/drm/i915/i915_query.c
> > @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
> >   	for_each_uabi_engine(engine, i915) {
> >   		info.engine.engine_class = engine->uabi_class;
> >   		info.engine.engine_instance = engine->uabi_instance;
> > +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
> >   		info.capabilities = engine->uabi_capabilities;
> > +		info.logical_instance = ilog2(engine->logical_mask);
> >   		if (copy_to_user(info_ptr, &info, sizeof(info)))
> >   			return -EFAULT;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index bde5860b3686..b1248a67b4f8 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -2726,14 +2726,20 @@ struct drm_i915_engine_info {
> >   	/** @flags: Engine flags. */
> >   	__u64 flags;
> > +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
> >   	/** @capabilities: Capabilities of this engine. */
> >   	__u64 capabilities;
> >   #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
> >   #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
> > +	/** @logical_instance: Logical instance of engine */
> > +	__u16 logical_instance;
> > +
> >   	/** @rsvd1: Reserved fields. */
> > -	__u64 rsvd1[4];
> > +	__u16 rsvd1[3];
> > +	/** @rsvd2: Reserved fields. */
> > +	__u64 rsvd2[3];
> >   };
> >   /**
> Any idea why the padding? Would be useful if the comment said 'this
> structure must be at least/exactly X bytes in size / a multiple of X bytes
> in size because ...' or whatever.
> 

I think this is pretty standard in UABI interface - add a bunch of
padding bits in case you need them in the future (i.e. this patch is an
example of that).

Matt 

> However, not really anything to do with this patch as such, so either way:
> Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-09-13 22:26       ` John Harrison
@ 2021-09-14  1:12         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-14  1:12 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 03:26:29PM -0700, John Harrison wrote:
> On 9/9/2021 17:41, Matthew Brost wrote:
> > On Thu, Sep 09, 2021 at 03:46:43PM -0700, John Harrison wrote:
> > > On 8/20/2021 15:44, Matthew Brost wrote:
> > > > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > > > circuiting while a scheduling of user context could be enabled.
> > > As with earlier PM patch, needs more explanation of what the problem is and
> > > why it is only now a problem.
> > > 
> > > 
> > Same explaination, will add here.
> > 
> > > > v2:
> > > >    (Daniel Vetter)
> > > >     - Add might_lock annotations to pin / unpin function
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context.c       |  3 ++
> > > >    drivers/gpu/drm/i915/gt/intel_engine_pm.h     | 15 ++++++++
> > > >    drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 10 ++++++
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> > > >    drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++++
> > > >    5 files changed, 73 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index c8595da64ad8..508cfe5770c0 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -240,6 +240,8 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >    	if (err)
> > > >    		goto err_post_unpin;
> > > > +	intel_engine_pm_might_get(ce->engine);
> > > > +
> > > >    	if (unlikely(intel_context_is_closed(ce))) {
> > > >    		err = -ENOENT;
> > > >    		goto err_unlock;
> > > > @@ -313,6 +315,7 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub)
> > > >    		return;
> > > >    	CE_TRACE(ce, "unpin\n");
> > > > +	intel_engine_pm_might_put(ce->engine);
> > > >    	ce->ops->unpin(ce);
> > > >    	ce->ops->post_unpin(ce);
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > index 17a5028ea177..3fe2ae1bcc26 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
> > > > @@ -9,6 +9,7 @@
> > > >    #include "i915_request.h"
> > > >    #include "intel_engine_types.h"
> > > >    #include "intel_wakeref.h"
> > > > +#include "intel_gt_pm.h"
> > > >    static inline bool
> > > >    intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
> > > > @@ -31,6 +32,13 @@ static inline bool intel_engine_pm_get_if_awake(struct intel_engine_cs *engine)
> > > >    	return intel_wakeref_get_if_active(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_get(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine))
> > > > +		intel_wakeref_might_get(&engine->wakeref);
> > > Why doesn't this need to iterate through the physical engines of the virtual
> > > engine?
> > > 
> > Yea, technically it should. This is just an annotation though to check
> > if we do something horribly wrong in our code. If we use any physical
> > engine in our stack this annotation should pop and we can fix it. I just
> > don't see what making this 100% correct for virtual engines buys us. If
> > you want I can fix this but thinking the more complex we make this
> > annotation the less likely it just gets compiled out with lockdep off
> > which is what we are aiming for.
> But if the annotation is missing a fundamental lock then it is surely not
> actually going to do any good? Not sure if you need to iterate over all
> child engines + parent but it seems like you should be calling might_lock()
> on at least one engine's mutex to feed the lockdep annotation.
> 

These are all inline functions so the compiler to be able to compile
this out if lockdep is disabled. I'll fix this in the next rev.

Matt 

> John.
> 
> > Matt
> > 
> > > John.
> > > 
> > > > +	intel_gt_pm_might_get(engine->gt);
> > > > +}
> > > > +
> > > >    static inline void intel_engine_pm_put(struct intel_engine_cs *engine)
> > > >    {
> > > >    	intel_wakeref_put(&engine->wakeref);
> > > > @@ -52,6 +60,13 @@ static inline void intel_engine_pm_flush(struct intel_engine_cs *engine)
> > > >    	intel_wakeref_unlock_wait(&engine->wakeref);
> > > >    }
> > > > +static inline void intel_engine_pm_might_put(struct intel_engine_cs *engine)
> > > > +{
> > > > +	if (!intel_engine_is_virtual(engine))
> > > > +		intel_wakeref_might_put(&engine->wakeref);
> > > > +	intel_gt_pm_might_put(engine->gt);
> > > > +}
> > > > +
> > > >    static inline struct i915_request *
> > > >    intel_engine_create_kernel_request(struct intel_engine_cs *engine)
> > > >    {
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > index a17bf0d4592b..3c173033ce23 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> > > > @@ -31,6 +31,11 @@ static inline bool intel_gt_pm_get_if_awake(struct intel_gt *gt)
> > > >    	return intel_wakeref_get_if_active(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_get(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_get(&gt->wakeref);
> > > > +}
> > > > +
> > > >    static inline void intel_gt_pm_put(struct intel_gt *gt)
> > > >    {
> > > >    	intel_wakeref_put(&gt->wakeref);
> > > > @@ -41,6 +46,11 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
> > > >    	intel_wakeref_put_async(&gt->wakeref);
> > > >    }
> > > > +static inline void intel_gt_pm_might_put(struct intel_gt *gt)
> > > > +{
> > > > +	intel_wakeref_might_put(&gt->wakeref);
> > > > +}
> > > > +
> > > >    #define with_intel_gt_pm(gt, tmp) \
> > > >    	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
> > > >    	     intel_gt_pm_put(gt), tmp = 0)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index dbf919801de2..e0eed70f9b92 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1550,7 +1550,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +
> > > > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_get(ce->engine);
> > > > +
> > > > +	return ret;
> > > >    }
> > > >    static void guc_context_unpin(struct intel_context *ce)
> > > > @@ -1559,6 +1564,9 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >    	unpin_guc_id(guc, ce);
> > > >    	lrc_unpin(ce);
> > > > +
> > > > +	if (likely(!intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_put_async(ce->engine);
> > > >    }
> > > >    static void guc_context_post_unpin(struct intel_context *ce)
> > > > @@ -2328,8 +2336,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > >    static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > >    {
> > > >    	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +
> > > > +	if (likely(!ret))
> > > > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > > -	return __guc_context_pin(ce, engine, vaddr);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +	struct intel_engine_cs *engine;
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > +
> > > > +	unpin_guc_id(guc, ce);
> > > > +	lrc_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +		intel_engine_pm_put_async(engine);
> > > >    }
> > > >    static void guc_virtual_context_enter(struct intel_context *ce)
> > > > @@ -2366,7 +2396,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > >    	.pre_pin = guc_virtual_context_pre_pin,
> > > >    	.pin = guc_virtual_context_pin,
> > > > -	.unpin = guc_context_unpin,
> > > > +	.unpin = guc_virtual_context_unpin,
> > > >    	.post_unpin = guc_context_post_unpin,
> > > >    	.ban = guc_context_ban,
> > > > diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > index ef7e6a698e8a..dd530ae028e0 100644
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.h
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> > > > @@ -124,6 +124,12 @@ enum {
> > > >    	__INTEL_WAKEREF_PUT_LAST_BIT__
> > > >    };
> > > > +static inline void
> > > > +intel_wakeref_might_get(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_put_flags: Release the wakeref
> > > >     * @wf: the wakeref
> > > > @@ -171,6 +177,12 @@ intel_wakeref_put_delay(struct intel_wakeref *wf, unsigned long delay)
> > > >    			    FIELD_PREP(INTEL_WAKEREF_PUT_DELAY, delay));
> > > >    }
> > > > +static inline void
> > > > +intel_wakeref_might_put(struct intel_wakeref *wf)
> > > > +{
> > > > +	might_lock(&wf->mutex);
> > > > +}
> > > > +
> > > >    /**
> > > >     * intel_wakeref_lock: Lock the wakeref (mutex)
> > > >     * @wf: the wakeref
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 10/27] drm/i915/guc: Introduce context parent-child relationship
  2021-09-13 23:19   ` John Harrison
@ 2021-09-14  1:18     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-14  1:18 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 04:19:00PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Introduce context parent-child relationship. Once this relationship is
> > created all pinning / unpinning operations are directed to the parent
> > context. The parent context is responsible for pinning all of its'
> > children and itself.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - a single H2G is used
> > register / deregister all of the contexts simultaneously.
> > 
> > Subsequent patches in the series will implement the pinning / unpinning
> > operations for parent / child contexts.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add kernel doc, add wrapper to access parent to ensure safety
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       | 29 ++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context.h       | 39 +++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context_types.h | 23 +++++++++++
> >   3 files changed, 91 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 508cfe5770c0..00d1aee6d199 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -404,6 +404,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   	INIT_LIST_HEAD(&ce->destroyed_link);
> No need for this blank line?
> 

I guess but typically I try to put blank lines between each different
set of variables (e.g. a lock and list would next to each other if the
lock protects the list, two different lists would have a blank line like
this case here). 

> > +	INIT_LIST_HEAD(&ce->guc_child_list);
> > +
> >   	/*
> >   	 * Initialize fence to be complete as this is expected to be complete
> >   	 * unless there is a pending schedule disable outstanding.
> > @@ -418,10 +420,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >   void intel_context_fini(struct intel_context *ce)
> >   {
> > +	struct intel_context *child, *next;
> > +
> >   	if (ce->timeline)
> >   		intel_timeline_put(ce->timeline);
> >   	i915_vm_put(ce->vm);
> > +	/* Need to put the creation ref for the children */
> > +	if (intel_context_is_parent(ce))
> > +		for_each_child_safe(ce, child, next)
> > +			intel_context_put(child);
> > +
> >   	mutex_destroy(&ce->pin_mutex);
> >   	i915_active_fini(&ce->active);
> >   }
> > @@ -537,6 +546,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >   	return active;
> >   }
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child)
> > +{
> > +	/*
> > +	 * Callers responsibility to validate that this function is used
> > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > +	 */
> > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > +	GEM_BUG_ON(intel_context_is_child(parent));
> > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > +	GEM_BUG_ON(intel_context_is_child(child));
> > +	GEM_BUG_ON(intel_context_is_parent(child));
> > +
> > +	parent->guc_number_children++;
> > +	list_add_tail(&child->guc_child_link,
> > +		      &parent->guc_child_list);
> > +	child->parent = parent;
> > +}
> > +
> >   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >   #include "selftest_context.c"
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c41098950746..c2985822ab74 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -44,6 +44,45 @@ void intel_context_free(struct intel_context *ce);
> >   int intel_context_reconfigure_sseu(struct intel_context *ce,
> >   				   const struct intel_sseu sseu);
> > +static inline bool intel_context_is_child(struct intel_context *ce)
> > +{
> > +	return !!ce->parent;
> > +}
> > +
> > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > +{
> > +	return !!ce->guc_number_children;
> > +}
> > +
> > +static inline bool intel_context_is_pinned(struct intel_context *ce);
> No point declaring 'static inline' if there is no function body?
> 

Forward delc for the below function.

> > +
> > +static inline struct intel_context *
> > +intel_context_to_parent(struct intel_context *ce)
> > +{
> > +        if (intel_context_is_child(ce)) {
> > +		/*
> > +		 * The parent holds ref count to the child so it is always safe
> > +		 * for the parent to access the child, but the child has pointer
> has pointer -> has a pointer
>

Yep.
 
> > +		 * to the parent without a ref. To ensure this is safe the child
> > +		 * should only access the parent pointer while the parent is
> > +		 * pinned.
> > +		 */
> > +                GEM_BUG_ON(!intel_context_is_pinned(ce->parent));
> > +
> > +                return ce->parent;
> > +        } else {
> > +                return ce;
> > +        }
> > +}
> > +
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child);
> > +
> > +#define for_each_child(parent, ce)\
> > +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> > +#define for_each_child_safe(parent, ce, cn)\
> > +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> Do these macros not need some kind of intel_context prefix? Or at least be
> 'for_each_guc_child' given the naming of the list/link fields? But maybe not
> if the guc_ is dropped from the variable names - see below.
> 

I like the names. Yes, the guc_* prefix should be dropped because these
are used in execlists too. I can drop those prefixes in the next rev.

> > +
> >   /**
> >    * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> >    * @ce - the context
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index fd338a30617e..0fafc178cf2c 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -213,6 +213,29 @@ struct intel_context {
> >   	 */
> >   	struct list_head destroyed_link;
> > +	/** anonymous struct for parent / children only members */
> > +	struct {
> > +		union {
> > +			/**
> > +			 * @guc_child_list: parent's list of of children
> > +			 * contexts, no protection as immutable after context
> > +			 * creation
> > +			 */
> > +			struct list_head guc_child_list;
> > +			/**
> > +			 * @guc_child_link: child's link into parent's list of
> > +			 * children
> > +			 */
> > +			struct list_head guc_child_link;
> > +		};
> > +
> > +		/** @parent: pointer to parent if child */
> > +		struct intel_context *parent;
> > +
> > +		/** @guc_number_children: number of children if parent */
> > +		u8 guc_number_children;
> These are not really a GuC specific fields? The parent/child thing might
> only be necessary for GuC submission (although can you say it won't be
> required by any future backend, such as the DRM scheduler?) but it is a
> context level concept. None of the files changed in this patch are GuC
> specific. So no need for 'guc_' prefix? Alternatively, if it all really is
> completely GuC specific then the 'parent' field should also have the prefix?
> Or even just name the outer struct 'guc_family' or something and drop the
> prefixes from all the inner members.
> 

Yep, will drop. Originally only used for GuC submission but now a
generic concept.

Matt

> John.
> 
> > +	};
> > +
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> >   	/**
> >   	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-09-13 22:38       ` John Harrison
@ 2021-09-14  5:02         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-14  5:02 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 13, 2021 at 03:38:44PM -0700, John Harrison wrote:
> On 9/13/2021 09:54, Matthew Brost wrote:
> 
>     On Thu, Sep 09, 2021 at 03:51:27PM -0700, John Harrison wrote:
> 
>         On 8/20/2021 15:44, Matthew Brost wrote:
> 
>             Calling switch_to_kernel_context isn't needed if the engine PM reference
>             is taken while all contexts are pinned. By not calling
>             switch_to_kernel_context we save on issuing a request to the engine.
> 
>         I thought the intention of the switch_to_kernel was to ensure that the GPU
>         is not touching any user context and is basically idle. That is not a valid
>         assumption with an external scheduler such as GuC. So why is the description
>         above only mentioning PM references? What is the connection between the PM
>         ref and the switch_to_kernel?
> 
>         Also, the comment in the code does not mention anything about PM references,
>         it just says 'not necessary with GuC' but no explanation at all.
> 
> 
>     Yea, this need to be explained better. How about this?
> 
>     Calling switch_to_kernel_context isn't needed if the engine PM reference
>     is take while all user contexts have scheduling enabled. Once scheduling
>     is disabled on all user contexts the GuC is guaranteed to not touch any
>     user context state which is effectively the same pointing to a kernel
>     context.
> 
>     Matt
> 
> I'm still not seeing how the PM reference is involved?
> 

We shouldn't trap into the GT PM park code while a user context has
scheduling enabled as the GT PM park code may have side affects we don't
to execute if a user context still has scheduling enabled. I guess that
isn't explained very well.

> Also, IMHO the focus is wrong in the above text. The fundamental requirement is
> the ensure the hardware is idle. Execlist achieves this by switching to a safe
> context. GuC achieves it by disabling scheduling. Indeed, switching to a 'safe'
> context really has no effect with GuC submission. So 'effectively the same as
> pointing to a kernel context' is an incorrect description. I would go with
> something like:
> 
>     "This is execlist specific behaviour intended to ensure the GPU is idle by
>     switching to a known 'safe' context. With GuC submission, the same idle
>     guarantee is achieved by other means (disabling scheduling). Further,
>     switching to a 'safe' context has no effect with GuC submission as the
>     scheduler can just switch back again.
>     FIXME: Move this backend scheduler specific behaviour into the scheduler
>     backend."
>

That is worded better. Will pull into the next rev.

Matt
 
> 
> John.
> 
> 
> 
> 
> 
>             v2:
>               (Daniel Vetter)
>                - Add FIXME comment about pushing switch_to_kernel_context to backend
> 
>             Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>             Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>             ---
>               drivers/gpu/drm/i915/gt/intel_engine_pm.c | 9 +++++++++
>               1 file changed, 9 insertions(+)
> 
>             diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>             index 1f07ac4e0672..11fee66daf60 100644
>             --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>             +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>             @@ -162,6 +162,15 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>                     unsigned long flags;
>                     bool result = true;
>             +       /*
>             +        * No need to switch_to_kernel_context if GuC submission
>             +        *
>             +        * FIXME: This execlists specific backend behavior in generic code, this
> 
>         "This execlists" -> "This is execlist"
> 
>         "this should be" -> "it should be"
> 
>         John.
> 
> 
>             +        * should be pushed to the backend.
>             +        */
>             +       if (intel_engine_uses_guc(engine))
>             +               return true;
>             +
>                     /* GPU is pointing to the void, as good as in the kernel context. */
>                     if (intel_gt_is_wedged(engine->gt))
>                             return true;
> 
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-13 16:50         ` Matthew Brost
@ 2021-09-14  8:34           ` Tvrtko Ursulin
  2021-09-14 18:04             ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-14  8:34 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 13/09/2021 17:50, Matthew Brost wrote:
> On Mon, Sep 13, 2021 at 10:24:43AM +0100, Tvrtko Ursulin wrote:
>>
>> On 10/09/2021 20:49, Matthew Brost wrote:
>>> On Fri, Sep 10, 2021 at 12:12:42PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 20/08/2021 23:44, Matthew Brost wrote:
>>>>> Add logical engine mapping. This is required for split-frame, as
>>>>> workloads need to be placed on engines in a logically contiguous manner.
>>>>>
>>>>> v2:
>>>>>     (Daniel Vetter)
>>>>>      - Add kernel doc for new fields
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>>>>>     drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
>>>>>     .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>>>>>     drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>>>>>     .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>>>>>     5 files changed, 60 insertions(+), 29 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>>> index 0d9105a31d84..4d790f9a65dd 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>>> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>>>>>     	GEM_DEBUG_WARN_ON(iir);
>>>>>     }
>>>>> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>>>>> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
>>>>> +			      u8 logical_instance)
>>>>>     {
>>>>>     	const struct engine_info *info = &intel_engines[id];
>>>>>     	struct drm_i915_private *i915 = gt->i915;
>>>>> @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>>>>>     	engine->class = info->class;
>>>>>     	engine->instance = info->instance;
>>>>> +	engine->logical_mask = BIT(logical_instance);
>>>>>     	__sprint_engine_name(engine);
>>>>>     	engine->props.heartbeat_interval_ms =
>>>>> @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>>>>>     	return info->engine_mask;
>>>>>     }
>>>>> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
>>>>> +				 u8 class, const u8 *map, u8 num_instances)
>>>>> +{
>>>>> +	int i, j;
>>>>> +	u8 current_logical_id = 0;
>>>>> +
>>>>> +	for (j = 0; j < num_instances; ++j) {
>>>>> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
>>>>> +			if (!HAS_ENGINE(gt, i) ||
>>>>> +			    intel_engines[i].class != class)
>>>>> +				continue;
>>>>> +
>>>>> +			if (intel_engines[i].instance == map[j]) {
>>>>> +				logical_ids[intel_engines[i].instance] =
>>>>> +					current_logical_id++;
>>>>> +				break;
>>>>> +			}
>>>>> +		}
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
>>>>> +{
>>>>> +	int i;
>>>>> +	u8 map[MAX_ENGINE_INSTANCE + 1];
>>>>> +
>>>>> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
>>>>> +		map[i] = i;
>>>>
>>>> What's the point of the map array since it is 1:1 with instance?
>>>>
>>>
>>> Future products do not have a 1 to 1 mapping and that mapping can change
>>> based on fusing, e.g. XeHP SDV.
>>>
>>> Also technically ICL / TGL / ADL physical instance 2 maps to logical
>>> instance 1.
>>
>> I don't follow the argument. All I can see is that "map[i] = i" always in
>> the proposed code, which is then used to check "instance == map[instance]".
>> So I'd suggest to remove this array from the code until there is a need for
>> it.
>>
> 
> Ok, this logic is slightly confusing and makes more sense once we have
> non-standard mappings. Yes, map is setup in a 1 to 1 mapping by default
> with the value in map[i] being a physical instance. Populate_logical_ids
> searches the map finding all physical instances present in the map
> assigning each found instance a new logical id increasing by 1 each
> time.
> 
> e.g. If the map is setup 0-N and only physical instance 0 / 2 are
> present they will get logical mapping 0 / 1 respectfully.
> 
> This algorithm works for non-standard mappings too /w fused parts. e.g.
> on XeHP SDV the map is: { 0, 2, 4, 6, 1, 3, 5, 7 } and if any of the
> physical instances can't be found due to fusing the logical mapping is
> still correct per the bspec.
> 
> This array is absolutely needed for multi-lrc submission to work, even
> on ICL / TGL / ADL as the GuC only supports logically contiguous engine
> instances.

No idea how can an array fixed at "map[i] = i" be absolutely needed when 
you can just write it like "i". Sometimes it is okay to lay some ground 
work for future platforms but in this case to me it's just obfuscation 
which should be added later, when it is required.

>>>>> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>>>>>      * @gt: pointer to struct intel_gt
>>>>> @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>>>>>     	struct drm_i915_private *i915 = gt->i915;
>>>>>     	const unsigned int engine_mask = init_engine_mask(gt);
>>>>>     	unsigned int mask = 0;
>>>>> -	unsigned int i;
>>>>> +	unsigned int i, class;
>>>>> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>>>>>     	int err;
>>>>>     	drm_WARN_ON(&i915->drm, engine_mask == 0);
>>>>> @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>>>>>     	if (i915_inject_probe_failure(i915))
>>>>>     		return -ENODEV;
>>>>> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
>>>>> -		if (!HAS_ENGINE(gt, i))
>>>>> -			continue;
>>>>> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
>>>>> +		setup_logical_ids(gt, logical_ids, class);
>>>>> -		err = intel_engine_setup(gt, i);
>>>>> -		if (err)
>>>>> -			goto cleanup;
>>>>> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
>>>>> +			u8 instance = intel_engines[i].instance;
>>>>> +
>>>>> +			if (intel_engines[i].class != class ||
>>>>> +			    !HAS_ENGINE(gt, i))
>>>>> +				continue;
>>>>> -		mask |= BIT(i);
>>>>> +			err = intel_engine_setup(gt, i,
>>>>> +						 logical_ids[instance]);
>>>>> +			if (err)
>>>>> +				goto cleanup;
>>>>> +
>>>>> +			mask |= BIT(i);
>>>>
>>>> I still this there is a less clunky way to set this up in less code and more
>>>> readable at the same time. Like do it in two passes so you can iterate
>>>> gt->engine_class[] array instead of having to implement a skip condition
>>>> (both on class and HAS_ENGINE at two places) and also avoid walking the flat
>>>> intel_engines array recursively.
>>>>
>>>
>>> Kinda a bikeshed arguing about a pretty simple loop structure, don't you
>>> think? I personally like the way it laid out.
>>>
>>> Pseudo code for your suggestion?
>>
>> Leave the existing setup loop as is and add an additional "for engine class"
>> walk after it. That way you can walk already setup gt->engine_class[] array
>> so wouldn't need to skip wrong classes and have HAS_ENGINE checks when
>> walking the flat intel_engines[] array several times. It also applies to the
>> helper which counts logical instances per class.
>>
> 
> Ok, I think I see what you are getting at. Again IMO this is a total
> bikeshed as this is 1 time setup step that we really should only care if
> the loop works or not rather than it being optimized / looks a way a
> certain person wants. I can change this if you really insist but again
> IMO disucssing this is a total waste of energy.

It should be such a no brainer to go with simpler and less invasive 
change that I honestly don't understand where is the big deal. Here is 
my pseudo code one more time and that will be the last from me on the topic.

Today we have:

for_each intel_engines: // intel_engines is a flat list of all engines
	intel_engine_setup()

You propose to change it to:

for_each engine_class:
    for 0..max_global_engine_instance:
       for_each intel_engines:
          skip engine not present
          skip class not matching

          count logical instance

    for_each intel_engines:
       skip engine not present
       skip wrong class

       intel_engine_setup()


I propose:

// Leave as is:

for_each intel_engines:
    intel_engine_setup()

// Add:

for_each engine_class:
    logical = 0
    for_each gt->engine_class[class]:
       skip engine not present

       engine->logical_instance = logical++


When code which actually needs a preturbed "map" arrives you add that in 
to this second loop.

>   
>>>>> +		}
>>>>>     	}
>>>>>     	/*
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>>>> index ed91bcff20eb..fddf35546b58 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>>>> @@ -266,6 +266,11 @@ struct intel_engine_cs {
>>>>>     	unsigned int guc_id;
>>>>>     	intel_engine_mask_t mask;
>>>>> +	/**
>>>>> +	 * @logical_mask: logical mask of engine, reported to user space via
>>>>> +	 * query IOCTL and used to communicate with the GuC in logical space
>>>>> +	 */
>>>>> +	intel_engine_mask_t logical_mask;
>>>>
>>>> You could prefix the new field with uabi_ to match the existing scheme and
>>>> to signify to anyone who might be touching it in the future it should not be
>>>> changed.
>>>
>>> This is kinda uabi, but also kinda isn't. We do report a logical
>>> instance via IOCTL but it also really is tied the GuC backend as we only
>>> can communicate with the GuC in logical space. IMO we should leave as
>>> is.
>>
>> Perhaps it would be best to call the new field uabi_logical_instance so it's
>> clear it is reported in the query directly and do the BIT() transformation
>> in the GuC backend?
>>
> 
> Virtual engines can have a multiple bits set in this mask, so this is
> used for both the query on physical engines via UABI and submission in
> the GuC backend.

You could add both fields if that would help. I just think it is 
preferrable to keep the existing convetion of uabi_ prefix for the 
fields which i915 refactors must not change willy-nilly.

> 
>>>
>>>>
>>>> Also, I think comment should explain what is logical space ie. how the
>>>> numbering works.
>>>>
>>>
>>> Don't I already do that? I suppose I could add something like:
>>
>> Where is it? Can't see it in the uapi kerneldoc AFAICS (for the new query)
>> or here.
>>
> 
> 	/**
> 	 * @logical_mask: logical mask of engine, reported to user space via
> 	 * query IOCTL and used to communicate with the GuC in logical space
> 	 */
>   
>>>
>>> The logical mask within engine class must be contiguous across all
>>> instances.
>>
>> Best not to start mentioning the mask for the first time. Just explain what
>> logical numbering is in terms of how engines are enumerated in order of
>> physical instances but skipping the fused off ones. In the kerneldoc for the
>> new query is I think the right place.
>>
> 
> Maybe I can add:
> 
> The logical mapping is defined on per part basis in the bspec and can
> very based the parts fusing.

Sounds good, I think that would be useful. But in the uapi kerneldoc.

Perhaps a cross-link to/from the kernel doc which talks about frame 
split to explain consecutive logical instances have to be used. That 
would tie the new query with that uapi in the narrative.

Regards,

Tvrtko

>>>> Not sure the part about GuC needs to be in the comment since uapi is
>>>> supposed to be backend agnostic.
>>>>
>>>
>>> Communicating with the GuC in logical space is a pretty key point here.
>>> The communication with the GuC in logical space is backend specific but
>>> how our hardware works (e.g. split frame workloads must be placed
>>> logical contiguous) is not. Mentioning the GuC requirement here makes
>>> sense to me for completeness.
>>
>> Yeah might be, I was thinking more about the new query. Query definitely is
>> backend agnostic but yes it is fine to say in the comment here the new field
>> is used both for the query and for communicating with GuC.
>>
> 
> Sounds good, will make it clear it used for the query and from
> communicating with the GuC.
> 
> Matt
>   
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Matt
>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>>     	u8 class;
>>>>>     	u8 instance;
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>>> index cafb0608ffb4..813a6de01382 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>>> @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>>>>>     		ve->siblings[ve->num_siblings++] = sibling;
>>>>>     		ve->base.mask |= sibling->mask;
>>>>> +		ve->base.logical_mask |= sibling->logical_mask;
>>>>>     		/*
>>>>>     		 * All physical engines must be compatible for their emission
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>>>> index 6926919bcac6..9f5f43a16182 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>>>> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>>>>>     	for_each_engine(engine, gt, id) {
>>>>>     		u8 guc_class = engine_class_to_guc_class(engine->class);
>>>>> -		system_info->mapping_table[guc_class][engine->instance] =
>>>>> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>>>>>     			engine->instance;
>>>>>     	}
>>>>>     }
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index e0eed70f9b92..ffafbac7335e 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>>>>>     	return __guc_action_deregister_context(guc, guc_id, loop);
>>>>>     }
>>>>> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
>>>>> -{
>>>>> -	switch (class) {
>>>>> -	case RENDER_CLASS:
>>>>> -		return mask >> RCS0;
>>>>> -	case VIDEO_ENHANCEMENT_CLASS:
>>>>> -		return mask >> VECS0;
>>>>> -	case VIDEO_DECODE_CLASS:
>>>>> -		return mask >> VCS0;
>>>>> -	case COPY_ENGINE_CLASS:
>>>>> -		return mask >> BCS0;
>>>>> -	default:
>>>>> -		MISSING_CASE(class);
>>>>> -		return 0;
>>>>> -	}
>>>>> -}
>>>>> -
>>>>>     static void guc_context_policy_init(struct intel_engine_cs *engine,
>>>>>     				    struct guc_lrc_desc *desc)
>>>>>     {
>>>>> @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>>>     	desc = __get_lrc_desc(guc, desc_idx);
>>>>>     	desc->engine_class = engine_class_to_guc_class(engine->class);
>>>>> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
>>>>> -						      engine->mask);
>>>>> +	desc->engine_submit_mask = engine->logical_mask;
>>>>>     	desc->hw_context_desc = ce->lrc.lrca;
>>>>>     	desc->priority = ce->guc_state.prio;
>>>>>     	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>>>> @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>>>>>     		}
>>>>>     		ve->base.mask |= sibling->mask;
>>>>> +		ve->base.logical_mask |= sibling->logical_mask;
>>>>>     		if (n != 0 && ve->base.class != sibling->class) {
>>>>>     			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
>>>>>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context
  2021-09-13 17:12     ` Matthew Brost
@ 2021-09-14  8:41       ` Tvrtko Ursulin
  0 siblings, 0 replies; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-14  8:41 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 13/09/2021 18:12, Matthew Brost wrote:
> On Mon, Sep 13, 2021 at 10:55:59AM +0100, Tvrtko Ursulin wrote:
>>
>> On 20/08/2021 23:44, Matthew Brost wrote:
>>> Taking a PM reference to prevent intel_gt_wait_for_idle from short
>>> circuiting while a deregister context H2G is in flight.
>>>
>>> FIXME: Move locking / structure changes into different patch
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>>>    drivers/gpu/drm/i915/gt/intel_context_types.h |  13 +-
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.h     |   5 +
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  13 ++
>>>    .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  46 ++--
>>>    .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  13 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 212 +++++++++++-------
>>>    8 files changed, 199 insertions(+), 106 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>>> index adfe49b53b1b..c8595da64ad8 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>> @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>>>    	ce->guc_id.id = GUC_INVALID_LRC_ID;
>>>    	INIT_LIST_HEAD(&ce->guc_id.link);
>>> +	INIT_LIST_HEAD(&ce->destroyed_link);
>>> +
>>>    	/*
>>>    	 * Initialize fence to be complete as this is expected to be complete
>>>    	 * unless there is a pending schedule disable outstanding.
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> index 80bbdc7810f6..fd338a30617e 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> @@ -190,22 +190,29 @@ struct intel_context {
>>>    		/**
>>>    		 * @id: unique handle which is used to communicate information
>>>    		 * with the GuC about this context, protected by
>>> -		 * guc->contexts_lock
>>> +		 * guc->submission_state.lock
>>>    		 */
>>>    		u16 id;
>>>    		/**
>>>    		 * @ref: the number of references to the guc_id, when
>>>    		 * transitioning in and out of zero protected by
>>> -		 * guc->contexts_lock
>>> +		 * guc->submission_state.lock
>>>    		 */
>>>    		atomic_t ref;
>>>    		/**
>>>    		 * @link: in guc->guc_id_list when the guc_id has no refs but is
>>> -		 * still valid, protected by guc->contexts_lock
>>> +		 * still valid, protected by guc->submission_state.lock
>>>    		 */
>>>    		struct list_head link;
>>>    	} guc_id;
>>> +	/**
>>> +	 * @destroyed_link: link in guc->submission_state.destroyed_contexts, in
>>> +	 * list when context is pending to be destroyed (deregistered with the
>>> +	 * GuC), protected by guc->submission_state.lock
>>> +	 */
>>> +	struct list_head destroyed_link;
>>> +
>>>    #ifdef CONFIG_DRM_I915_SELFTEST
>>>    	/**
>>>    	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> index 70ea46d6cfb0..17a5028ea177 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
>>> @@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
>>>    	return intel_wakeref_is_active(&engine->wakeref);
>>>    }
>>> +static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
>>> +{
>>> +	__intel_wakeref_get(&engine->wakeref);
>>> +}
>>> +
>>>    static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
>>>    {
>>>    	intel_wakeref_get(&engine->wakeref);
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> index d0588d8aaa44..a17bf0d4592b 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
>>> @@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
>>>    	intel_wakeref_put_async(&gt->wakeref);
>>>    }
>>> +#define with_intel_gt_pm(gt, tmp) \
>>> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>> +	     intel_gt_pm_put(gt), tmp = 0)
>>> +#define with_intel_gt_pm_async(gt, tmp) \
>>> +	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
>>> +	     intel_gt_pm_put_async(gt), tmp = 0)
>>> +#define with_intel_gt_pm_if_awake(gt, tmp) \
>>> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
>>> +	     intel_gt_pm_put(gt), tmp = 0)
>>> +#define with_intel_gt_pm_if_awake_async(gt, tmp) \
>>> +	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
>>> +	     intel_gt_pm_put_async(gt), tmp = 0)
>>> +
>>>    static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>>>    {
>>>    	return intel_wakeref_wait_for_idle(&gt->wakeref);
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>> index 8ff582222aff..ba10bd374cee 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>> @@ -142,6 +142,7 @@ enum intel_guc_action {
>>>    	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
>>>    	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>>    	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>> +	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>>>    	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>>>    	INTEL_GUC_ACTION_LIMIT
>>>    };
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 6fd2719d1b75..7358883f1540 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -53,21 +53,37 @@ struct intel_guc {
>>>    		void (*disable)(struct intel_guc *guc);
>>>    	} interrupts;
>>> -	/**
>>> -	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
>>> -	 * ce->guc_id.ref when transitioning in and out of zero
>>> -	 */
>>> -	spinlock_t contexts_lock;
>>> -	/** @guc_ids: used to allocate new guc_ids */
>>> -	struct ida guc_ids;
>>> -	/** @num_guc_ids: number of guc_ids that can be used */
>>> -	u32 num_guc_ids;
>>> -	/** @max_guc_ids: max number of guc_ids that can be used */
>>> -	u32 max_guc_ids;
>>> -	/**
>>> -	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
>>> -	 */
>>> -	struct list_head guc_id_list;
>>> +	struct {
>>> +		/**
>>> +		 * @lock: protects everything in submission_state, ce->guc_id,
>>> +		 * and ce->destroyed_link
>>> +		 */
>>> +		spinlock_t lock;
>>> +		/**
>>> +		 * @guc_ids: used to allocate new guc_ids
>>> +		 */
>>> +		struct ida guc_ids;
>>> +		/** @num_guc_ids: number of guc_ids that can be used */
>>> +		u32 num_guc_ids;
>>> +		/** @max_guc_ids: max number of guc_ids that can be used */
>>> +		u32 max_guc_ids;
>>> +		/**
>>> +		 * @guc_id_list: list of intel_context with valid guc_ids but no
>>> +		 * refs
>>> +		 */
>>> +		struct list_head guc_id_list;
>>> +		/**
>>> +		 * @destroyed_contexts: list of contexts waiting to be destroyed
>>> +		 * (deregistered with the GuC)
>>> +		 */
>>> +		struct list_head destroyed_contexts;
>>> +		/**
>>> +		 * @destroyed_worker: worker to deregister contexts, need as we
>>> +		 * need to take a GT PM reference and can't from destroy
>>> +		 * function as it might be in an atomic context (no sleeping)
>>> +		 */
>>> +		struct work_struct destroyed_worker;
>>> +	} submission_state;
>>>    	bool submission_supported;
>>>    	bool submission_selected;
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
>>> index b88d343ee432..27655460ee84 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
>>> @@ -78,7 +78,7 @@ static int guc_num_id_get(void *data, u64 *val)
>>>    	if (!intel_guc_submission_is_used(guc))
>>>    		return -ENODEV;
>>> -	*val = guc->num_guc_ids;
>>> +	*val = guc->submission_state.num_guc_ids;
>>>    	return 0;
>>>    }
>>> @@ -86,16 +86,21 @@ static int guc_num_id_get(void *data, u64 *val)
>>>    static int guc_num_id_set(void *data, u64 val)
>>>    {
>>>    	struct intel_guc *guc = data;
>>> +	unsigned long flags;
>>>    	if (!intel_guc_submission_is_used(guc))
>>>    		return -ENODEV;
>>> -	if (val > guc->max_guc_ids)
>>> -		val = guc->max_guc_ids;
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>> +
>>> +	if (val > guc->submission_state.max_guc_ids)
>>> +		val = guc->submission_state.max_guc_ids;
>>>    	else if (val < 256)
>>>    		val = 256;
>>> -	guc->num_guc_ids = val;
>>> +	guc->submission_state.num_guc_ids = val;
>>> +
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 68742b612692..f835e06e5f9f 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -86,9 +86,9 @@
>>>     * submitting at a time. Currently only 1 sched_engine used for all of GuC
>>>     * submission but that could change in the future.
>>>     *
>>> - * guc->contexts_lock
>>> - * Protects guc_id allocation. Global lock i.e. Only 1 context that uses GuC
>>> - * submission can hold this at a time.
>>> + * guc->submission_state.lock
>>> + * Global lock for GuC submission state. Protects guc_ids and destroyed contexts
>>> + * list.
>>>     *
>>>     * ce->guc_state.lock
>>>     * Protects everything under ce->guc_state. Ensures that a context is in the
>>> @@ -100,7 +100,7 @@
>>>     *
>>>     * Lock ordering rules:
>>>     * sched_engine->lock -> ce->guc_state.lock
>>> - * guc->contexts_lock -> ce->guc_state.lock
>>> + * guc->submission_state.lock -> ce->guc_state.lock
>>>     *
>>>     * Reset races:
>>>     * When a GPU full reset is triggered it is assumed that some G2H responses to
>>> @@ -344,7 +344,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>>>    {
>>>    	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>>> -	GEM_BUG_ON(index >= guc->max_guc_ids);
>>> +	GEM_BUG_ON(index >= guc->submission_state.max_guc_ids);
>>>    	return &base[index];
>>>    }
>>> @@ -353,7 +353,7 @@ static struct intel_context *__get_context(struct intel_guc *guc, u32 id)
>>>    {
>>>    	struct intel_context *ce = xa_load(&guc->context_lookup, id);
>>> -	GEM_BUG_ON(id >= guc->max_guc_ids);
>>> +	GEM_BUG_ON(id >= guc->submission_state.max_guc_ids);
>>>    	return ce;
>>>    }
>>> @@ -363,7 +363,8 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
>>>    	u32 size;
>>>    	int ret;
>>> -	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
>>> +	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
>>> +			  guc->submission_state.max_guc_ids);
>>>    	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
>>>    					     (void **)&guc->lrc_desc_pool_vaddr);
>>>    	if (ret)
>>> @@ -711,6 +712,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>>>    			if (deregister)
>>>    				guc_signal_context_fence(ce);
>>>    			if (destroyed) {
>>> +				intel_gt_pm_put_async(guc_to_gt(guc));
>>>    				release_guc_id(guc, ce);
>>>    				__guc_context_destroy(ce);
>>>    			}
>>> @@ -789,6 +791,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
>>>    	spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>    }
>>> +static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>>> +
>>>    void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>>>    {
>>>    	if (unlikely(!guc_submission_initialized(guc))) {
>>> @@ -805,6 +809,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>>>    	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
>>>    	flush_work(&guc->ct.requests.worker);
>>> +	guc_flush_destroyed_contexts(guc);
>>>    	scrub_guc_desc_for_outstanding_g2h(guc);
>>>    }
>>> @@ -1102,6 +1107,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>>    	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>>    }
>>> +static void destroyed_worker_func(struct work_struct *w);
>>> +
>>>    /*
>>>     * Set up the memory resources to be shared with the GuC (via the GGTT)
>>>     * at firmware loading time.
>>> @@ -1124,9 +1131,11 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>>    	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>>> -	spin_lock_init(&guc->contexts_lock);
>>> -	INIT_LIST_HEAD(&guc->guc_id_list);
>>> -	ida_init(&guc->guc_ids);
>>> +	spin_lock_init(&guc->submission_state.lock);
>>> +	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
>>> +	ida_init(&guc->submission_state.guc_ids);
>>> +	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>>> +	INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
>>>    	return 0;
>>>    }
>>> @@ -1137,6 +1146,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>>    		return;
>>>    	guc_lrc_desc_pool_destroy(guc);
>>> +	guc_flush_destroyed_contexts(guc);
>>>    	i915_sched_engine_put(guc->sched_engine);
>>>    }
>>> @@ -1191,15 +1201,16 @@ static void guc_submit_request(struct i915_request *rq)
>>>    static int new_guc_id(struct intel_guc *guc)
>>>    {
>>> -	return ida_simple_get(&guc->guc_ids, 0,
>>> -			      guc->num_guc_ids, GFP_KERNEL |
>>> +	return ida_simple_get(&guc->submission_state.guc_ids, 0,
>>> +			      guc->submission_state.num_guc_ids, GFP_KERNEL |
>>>    			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>>    }
>>>    static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	if (!context_guc_id_invalid(ce)) {
>>> -		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
>>> +		ida_simple_remove(&guc->submission_state.guc_ids,
>>> +				  ce->guc_id.id);
>>>    		reset_lrc_desc(guc, ce->guc_id.id);
>>>    		set_context_guc_id_invalid(ce);
>>>    	}
>>> @@ -1211,9 +1222,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    {
>>>    	unsigned long flags;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	__release_guc_id(guc, ce);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int steal_guc_id(struct intel_guc *guc)
>>> @@ -1221,10 +1232,10 @@ static int steal_guc_id(struct intel_guc *guc)
>>>    	struct intel_context *ce;
>>>    	int guc_id;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>> -	if (!list_empty(&guc->guc_id_list)) {
>>> -		ce = list_first_entry(&guc->guc_id_list,
>>> +	if (!list_empty(&guc->submission_state.guc_id_list)) {
>>> +		ce = list_first_entry(&guc->submission_state.guc_id_list,
>>>    				      struct intel_context,
>>>    				      guc_id.link);
>>> @@ -1249,7 +1260,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>>>    {
>>>    	int ret;
>>> -	lockdep_assert_held(&guc->contexts_lock);
>>> +	lockdep_assert_held(&guc->submission_state.lock);
>>>    	ret = new_guc_id(guc);
>>>    	if (unlikely(ret < 0)) {
>>> @@ -1271,7 +1282,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
>>>    try_again:
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	might_lock(&ce->guc_state.lock);
>>> @@ -1286,7 +1297,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	atomic_inc(&ce->guc_id.ref);
>>>    out_unlock:
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    	/*
>>>    	 * -EAGAIN indicates no guc_id are available, let's retire any
>>> @@ -1322,11 +1333,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	if (unlikely(context_guc_id_invalid(ce)))
>>>    		return;
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>>    	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
>>>    	    !atomic_read(&ce->guc_id.ref))
>>> -		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> +		list_add_tail(&ce->guc_id.link,
>>> +			      &guc->submission_state.guc_id_list);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>>    static int __guc_action_register_context(struct intel_guc *guc,
>>> @@ -1841,11 +1853,30 @@ static void guc_context_sched_disable(struct intel_context *ce)
>>>    static void guc_lrc_desc_unpin(struct intel_context *ce)
>>>    {
>>>    	struct intel_guc *guc = ce_to_guc(ce);
>>> +	struct intel_gt *gt = guc_to_gt(guc);
>>> +	unsigned long flags;
>>> +	bool disabled;
>>> +	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>>>    	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>>>    	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>>>    	GEM_BUG_ON(context_enabled(ce));
>>> +	/* Seal race with Reset */
>>> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> +	disabled = submission_disabled(guc);
>>> +	if (likely(!disabled)) {
>>> +		__intel_gt_pm_get(gt);
>>> +		set_context_destroyed(ce);
>>> +		clr_context_registered(ce);
>>> +	}
>>> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +	if (unlikely(disabled)) {
>>> +		release_guc_id(guc, ce);
>>> +		__guc_context_destroy(ce);
>>> +		return;
>>> +	}
>>> +
>>>    	deregister_context(ce, ce->guc_id.id, true);
>>>    }
>>> @@ -1873,78 +1904,88 @@ static void __guc_context_destroy(struct intel_context *ce)
>>>    	}
>>>    }
>>> +static void guc_flush_destroyed_contexts(struct intel_guc *guc)
>>> +{
>>> +	struct intel_context *ce, *cn;
>>> +	unsigned long flags;
>>> +
>>> +	GEM_BUG_ON(!submission_disabled(guc) &&
>>> +		   guc_submission_initialized(guc));
>>> +
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>> +	list_for_each_entry_safe(ce, cn,
>>> +				 &guc->submission_state.destroyed_contexts,
>>> +				 destroyed_link) {
>>> +		list_del_init(&ce->destroyed_link);
>>> +		__release_guc_id(guc, ce);
>>> +		__guc_context_destroy(ce);
>>> +	}
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>> +}
>>> +
>>> +static void deregister_destroyed_contexts(struct intel_guc *guc)
>>> +{
>>> +	struct intel_context *ce, *cn;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>> +	list_for_each_entry_safe(ce, cn,
>>> +				 &guc->submission_state.destroyed_contexts,
>>> +				 destroyed_link) {
>>> +		list_del_init(&ce->destroyed_link);
>>> +		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>
>> Lock drop and re-acquire is not safe - list_for_each_entry_safe only deal
>> with the list_del_init above. It does not help if someone else modifies the
>> list in the window while the lock is not held, in which case things will go
>> BOOM!
>>
> 
> Good catch, but is this true? We always add via list_add_tail which if
> I'm reading the list_for_each_entry_safe macro correctly I think is safe
> to do as 'n = list_next_entry(n, member)' will just evaluate to the new
> tail on each iteration, right?
> 
> Reference to macro below:
> #define list_for_each_entry_safe(pos, n, head, member)			\
> 	for (pos = list_first_entry(head, typeof(*pos), member),	\
> 		n = list_next_entry(pos, member);			\
> 	     &pos->member != (head); 					\
> 	     pos = n, n = list_next_entry(n, member))
> 
> Regardless probably safer to just not drop the lock.

Yes, otherwise a comment is needed to explain, but best to avoid such 
smarts.

Or just use a different list if it fits better. As mentioned if it is a 
very simple producer/consumer relationship which always unlinks all 
items, and list head is not re-used for something else, then you can 
consider llist_add/llist_del_all.

Regards,

Tvrtko

> 
>> Not sure in what differnt ways you use this list, but perhaps you could
>> employ a lockless single linked list for this. In which case you could
>> atomically unlink all elements. Or just change this to not do lock dropping,
>> if you can safely use the same list_head to put on a local list or
>> something.
>>
> 
> I think both options work - not sure why I drop the lock here. I'll play
> around with this and fix in the next rev.
> 
> Matt
> 
>> Regards,
>>
>> Tvrtko
>>
>>> +		guc_lrc_desc_unpin(ce);
>>> +		spin_lock_irqsave(&guc->submission_state.lock, flags);
>>> +	}
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>> +}
>>> +
>>> +static void destroyed_worker_func(struct work_struct *w)
>>> +{
>>> +	struct intel_guc *guc = container_of(w, struct intel_guc,
>>> +					     submission_state.destroyed_worker);
>>> +	struct intel_gt *gt = guc_to_gt(guc);
>>> +	int tmp;
>>> +
>>> +	with_intel_gt_pm(gt, tmp)
>>> +		deregister_destroyed_contexts(guc);
>>> +}
>>> +
>>>    static void guc_context_destroy(struct kref *kref)
>>>    {
>>>    	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
>>> -	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
>>>    	struct intel_guc *guc = ce_to_guc(ce);
>>> -	intel_wakeref_t wakeref;
>>>    	unsigned long flags;
>>> -	bool disabled;
>>> +	bool destroy;
>>>    	/*
>>>    	 * If the guc_id is invalid this context has been stolen and we can free
>>>    	 * it immediately. Also can be freed immediately if the context is not
>>>    	 * registered with the GuC or the GuC is in the middle of a reset.
>>>    	 */
>>> -	if (context_guc_id_invalid(ce)) {
>>> -		__guc_context_destroy(ce);
>>> -		return;
>>> -	} else if (submission_disabled(guc) ||
>>> -		   !lrc_desc_registered(guc, ce->guc_id.id)) {
>>> -		release_guc_id(guc, ce);
>>> -		__guc_context_destroy(ce);
>>> -		return;
>>> -	}
>>> -
>>> -	/*
>>> -	 * We have to acquire the context spinlock and check guc_id again, if it
>>> -	 * is valid it hasn't been stolen and needs to be deregistered. We
>>> -	 * delete this context from the list of unpinned guc_id available to
>>> -	 * steal to seal a race with guc_lrc_desc_pin(). When the G2H CTB
>>> -	 * returns indicating this context has been deregistered the guc_id is
>>> -	 * returned to the pool of available guc_id.
>>> -	 */
>>> -	spin_lock_irqsave(&guc->contexts_lock, flags);
>>> -	if (context_guc_id_invalid(ce)) {
>>> -		spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> -		__guc_context_destroy(ce);
>>> -		return;
>>> -	}
>>> -
>>> -	if (!list_empty(&ce->guc_id.link))
>>> -		list_del_init(&ce->guc_id.link);
>>> -	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>>> -
>>> -	/* Seal race with Reset */
>>> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> -	disabled = submission_disabled(guc);
>>> -	if (likely(!disabled)) {
>>> -		set_context_destroyed(ce);
>>> -		clr_context_registered(ce);
>>> +	spin_lock_irqsave(&guc->submission_state.lock, flags);
>>> +	destroy = submission_disabled(guc) || context_guc_id_invalid(ce) ||
>>> +		!lrc_desc_registered(guc, ce->guc_id.id);
>>> +	if (likely(!destroy)) {
>>> +		if (!list_empty(&ce->guc_id.link))
>>> +			list_del_init(&ce->guc_id.link);
>>> +		list_add_tail(&ce->destroyed_link,
>>> +			      &guc->submission_state.destroyed_contexts);
>>> +	} else {
>>> +		__release_guc_id(guc, ce);
>>>    	}
>>> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> -	if (unlikely(disabled)) {
>>> -		release_guc_id(guc, ce);
>>> +	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>> +	if (unlikely(destroy)) {
>>>    		__guc_context_destroy(ce);
>>>    		return;
>>>    	}
>>>    	/*
>>> -	 * We defer GuC context deregistration until the context is destroyed
>>> -	 * in order to save on CTBs. With this optimization ideally we only need
>>> -	 * 1 CTB to register the context during the first pin and 1 CTB to
>>> -	 * deregister the context when the context is destroyed. Without this
>>> -	 * optimization, a CTB would be needed every pin & unpin.
>>> -	 *
>>> -	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
>>> -	 * from context_free_worker when runtime wakeref is not held.
>>> -	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
>>> -	 * in H2G CTB to deregister the context. A future patch may defer this
>>> -	 * H2G CTB if the runtime wakeref is zero.
>>> +	 * We use a worker to issue the H2G to deregister the context as we can
>>> +	 * take the GT PM for the first time which isn't allowed from an atomic
>>> +	 * context.
>>>    	 */
>>> -	with_intel_runtime_pm(runtime_pm, wakeref)
>>> -		guc_lrc_desc_unpin(ce);
>>> +	queue_work(system_unbound_wq, &guc->submission_state.destroyed_worker);
>>>    }
>>>    static int guc_context_alloc(struct intel_context *ce)
>>> @@ -2703,8 +2744,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>>>    void intel_guc_submission_init_early(struct intel_guc *guc)
>>>    {
>>> -	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>>> -	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>>> +	guc->submission_state.max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>>> +	guc->submission_state.num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
>>>    	guc->submission_supported = __guc_submission_supported(guc);
>>>    	guc->submission_selected = __guc_submission_selected(guc);
>>>    }
>>> @@ -2714,10 +2755,10 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>>>    {
>>>    	struct intel_context *ce;
>>> -	if (unlikely(desc_idx >= guc->max_guc_ids)) {
>>> +	if (unlikely(desc_idx >= guc->submission_state.max_guc_ids)) {
>>>    		drm_err(&guc_to_gt(guc)->i915->drm,
>>>    			"Invalid desc_idx %u, max %u",
>>> -			desc_idx, guc->max_guc_ids);
>>> +			desc_idx, guc->submission_state.max_guc_ids);
>>>    		return NULL;
>>>    	}
>>> @@ -2771,6 +2812,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>>>    		intel_context_put(ce);
>>>    	} else if (context_destroyed(ce)) {
>>>    		/* Context has been destroyed */
>>> +		intel_gt_pm_put_async(guc_to_gt(guc));
>>>    		release_guc_id(guc, ce);
>>>    		__guc_context_destroy(ce);
>>>    	}
>>> @@ -3065,8 +3107,10 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>>>    	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>>>    		   atomic_read(&guc->outstanding_submission_g2h));
>>> -	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
>>> -	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
>>> +	drm_printf(p, "GuC Number GuC IDs: %u\n",
>>> +		   guc->submission_state.num_guc_ids);
>>> +	drm_printf(p, "GuC Max GuC IDs: %u\n",
>>> +		   guc->submission_state.max_guc_ids);
>>>    	drm_printf(p, "GuC tasklet count: %u\n\n",
>>>    		   atomic_read(&sched_engine->tasklet.count));
>>>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-14  8:34           ` Tvrtko Ursulin
@ 2021-09-14 18:04             ` Matthew Brost
  2021-09-15  8:24               ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-14 18:04 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Tue, Sep 14, 2021 at 09:34:08AM +0100, Tvrtko Ursulin wrote:
> 
> On 13/09/2021 17:50, Matthew Brost wrote:
> > On Mon, Sep 13, 2021 at 10:24:43AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 10/09/2021 20:49, Matthew Brost wrote:
> > > > On Fri, Sep 10, 2021 at 12:12:42PM +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 20/08/2021 23:44, Matthew Brost wrote:
> > > > > > Add logical engine mapping. This is required for split-frame, as
> > > > > > workloads need to be placed on engines in a logically contiguous manner.
> > > > > > 
> > > > > > v2:
> > > > > >     (Daniel Vetter)
> > > > > >      - Add kernel doc for new fields
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >     drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
> > > > > >     drivers/gpu/drm/i915/gt/intel_engine_types.h  |  5 ++
> > > > > >     .../drm/i915/gt/intel_execlists_submission.c  |  1 +
> > > > > >     drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
> > > > > >     .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
> > > > > >     5 files changed, 60 insertions(+), 29 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > > > index 0d9105a31d84..4d790f9a65dd 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > > > > @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
> > > > > >     	GEM_DEBUG_WARN_ON(iir);
> > > > > >     }
> > > > > > -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > > > > > +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> > > > > > +			      u8 logical_instance)
> > > > > >     {
> > > > > >     	const struct engine_info *info = &intel_engines[id];
> > > > > >     	struct drm_i915_private *i915 = gt->i915;
> > > > > > @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > > > > >     	engine->class = info->class;
> > > > > >     	engine->instance = info->instance;
> > > > > > +	engine->logical_mask = BIT(logical_instance);
> > > > > >     	__sprint_engine_name(engine);
> > > > > >     	engine->props.heartbeat_interval_ms =
> > > > > > @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
> > > > > >     	return info->engine_mask;
> > > > > >     }
> > > > > > +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> > > > > > +				 u8 class, const u8 *map, u8 num_instances)
> > > > > > +{
> > > > > > +	int i, j;
> > > > > > +	u8 current_logical_id = 0;
> > > > > > +
> > > > > > +	for (j = 0; j < num_instances; ++j) {
> > > > > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > > > > +			if (!HAS_ENGINE(gt, i) ||
> > > > > > +			    intel_engines[i].class != class)
> > > > > > +				continue;
> > > > > > +
> > > > > > +			if (intel_engines[i].instance == map[j]) {
> > > > > > +				logical_ids[intel_engines[i].instance] =
> > > > > > +					current_logical_id++;
> > > > > > +				break;
> > > > > > +			}
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> > > > > > +{
> > > > > > +	int i;
> > > > > > +	u8 map[MAX_ENGINE_INSTANCE + 1];
> > > > > > +
> > > > > > +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> > > > > > +		map[i] = i;
> > > > > 
> > > > > What's the point of the map array since it is 1:1 with instance?
> > > > > 
> > > > 
> > > > Future products do not have a 1 to 1 mapping and that mapping can change
> > > > based on fusing, e.g. XeHP SDV.
> > > > 
> > > > Also technically ICL / TGL / ADL physical instance 2 maps to logical
> > > > instance 1.
> > > 
> > > I don't follow the argument. All I can see is that "map[i] = i" always in
> > > the proposed code, which is then used to check "instance == map[instance]".
> > > So I'd suggest to remove this array from the code until there is a need for
> > > it.
> > > 
> > 
> > Ok, this logic is slightly confusing and makes more sense once we have
> > non-standard mappings. Yes, map is setup in a 1 to 1 mapping by default
> > with the value in map[i] being a physical instance. Populate_logical_ids
> > searches the map finding all physical instances present in the map
> > assigning each found instance a new logical id increasing by 1 each
> > time.
> > 
> > e.g. If the map is setup 0-N and only physical instance 0 / 2 are
> > present they will get logical mapping 0 / 1 respectfully.
> > 
> > This algorithm works for non-standard mappings too /w fused parts. e.g.
> > on XeHP SDV the map is: { 0, 2, 4, 6, 1, 3, 5, 7 } and if any of the
> > physical instances can't be found due to fusing the logical mapping is
> > still correct per the bspec.
> > 
> > This array is absolutely needed for multi-lrc submission to work, even
> > on ICL / TGL / ADL as the GuC only supports logically contiguous engine
> > instances.
> 
> No idea how can an array fixed at "map[i] = i" be absolutely needed when you
> can just write it like "i". Sometimes it is okay to lay some ground work for

You can't write "i", that is the point. The map is a search array saying
if entry is present assign + increase the logical id. That is how a map
of 0, 1, 2 with physical instancnes 0, 2 present result in a logical
mapping of 0, 1. This is the algorithm we use for all parts, albeit some
parts have different maps (e.g. XeHP SDV, PVC, etc...) compared to here
where we use the default map of sequential numbers.

> future platforms but in this case to me it's just obfuscation which should
> be added later, when it is required.

Same algorithm, see above. This should land is as it doesn't make sense
to hack a different algorithm only to replace it with the correct
algorithm later.

> 
> > > > > > +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> > > > > > +}
> > > > > > +
> > > > > >     /**
> > > > > >      * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
> > > > > >      * @gt: pointer to struct intel_gt
> > > > > > @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > > > > >     	struct drm_i915_private *i915 = gt->i915;
> > > > > >     	const unsigned int engine_mask = init_engine_mask(gt);
> > > > > >     	unsigned int mask = 0;
> > > > > > -	unsigned int i;
> > > > > > +	unsigned int i, class;
> > > > > > +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
> > > > > >     	int err;
> > > > > >     	drm_WARN_ON(&i915->drm, engine_mask == 0);
> > > > > > @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > > > > >     	if (i915_inject_probe_failure(i915))
> > > > > >     		return -ENODEV;
> > > > > > -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> > > > > > -		if (!HAS_ENGINE(gt, i))
> > > > > > -			continue;
> > > > > > +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> > > > > > +		setup_logical_ids(gt, logical_ids, class);
> > > > > > -		err = intel_engine_setup(gt, i);
> > > > > > -		if (err)
> > > > > > -			goto cleanup;
> > > > > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > > > > +			u8 instance = intel_engines[i].instance;
> > > > > > +
> > > > > > +			if (intel_engines[i].class != class ||
> > > > > > +			    !HAS_ENGINE(gt, i))
> > > > > > +				continue;
> > > > > > -		mask |= BIT(i);
> > > > > > +			err = intel_engine_setup(gt, i,
> > > > > > +						 logical_ids[instance]);
> > > > > > +			if (err)
> > > > > > +				goto cleanup;
> > > > > > +
> > > > > > +			mask |= BIT(i);
> > > > > 
> > > > > I still this there is a less clunky way to set this up in less code and more
> > > > > readable at the same time. Like do it in two passes so you can iterate
> > > > > gt->engine_class[] array instead of having to implement a skip condition
> > > > > (both on class and HAS_ENGINE at two places) and also avoid walking the flat
> > > > > intel_engines array recursively.
> > > > > 
> > > > 
> > > > Kinda a bikeshed arguing about a pretty simple loop structure, don't you
> > > > think? I personally like the way it laid out.
> > > > 
> > > > Pseudo code for your suggestion?
> > > 
> > > Leave the existing setup loop as is and add an additional "for engine class"
> > > walk after it. That way you can walk already setup gt->engine_class[] array
> > > so wouldn't need to skip wrong classes and have HAS_ENGINE checks when
> > > walking the flat intel_engines[] array several times. It also applies to the
> > > helper which counts logical instances per class.
> > > 
> > 
> > Ok, I think I see what you are getting at. Again IMO this is a total
> > bikeshed as this is 1 time setup step that we really should only care if
> > the loop works or not rather than it being optimized / looks a way a
> > certain person wants. I can change this if you really insist but again
> > IMO disucssing this is a total waste of energy.
> 
> It should be such a no brainer to go with simpler and less invasive change
> that I honestly don't understand where is the big deal. Here is my pseudo
> code one more time and that will be the last from me on the topic.
> 
> Today we have:
> 
> for_each intel_engines: // intel_engines is a flat list of all engines
> 	intel_engine_setup()
> 
> You propose to change it to:
> 
> for_each engine_class:
>    for 0..max_global_engine_instance:
>       for_each intel_engines:
>          skip engine not present
>          skip class not matching
> 
>          count logical instance
> 
>    for_each intel_engines:
>       skip engine not present
>       skip wrong class
> 
>       intel_engine_setup()
> 
> 
> I propose:
> 
> // Leave as is:
> 
> for_each intel_engines:
>    intel_engine_setup()
> 
> // Add:
> 
> for_each engine_class:
>    logical = 0
>    for_each gt->engine_class[class]:
>       skip engine not present
> 
>       engine->logical_instance = logical++
> 
> 
> When code which actually needs a preturbed "map" arrives you add that in to
> this second loop.
>

See above, why introduce an algorithm that doesn't work for future parts
+ future patches are land imminently? It makes zero sense whatsoever.
With your proposal we would literally land code to just throw it away a
couple of months from now + break patches we intend to land soon. This
algorithm works and has no reason whatsoever to be optimal as it a one
time setup call. I really don't understand why we are still talking
about this paint color.

> > > > > > +		}
> > > > > >     	}
> > > > > >     	/*
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > > > index ed91bcff20eb..fddf35546b58 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > > > > @@ -266,6 +266,11 @@ struct intel_engine_cs {
> > > > > >     	unsigned int guc_id;
> > > > > >     	intel_engine_mask_t mask;
> > > > > > +	/**
> > > > > > +	 * @logical_mask: logical mask of engine, reported to user space via
> > > > > > +	 * query IOCTL and used to communicate with the GuC in logical space
> > > > > > +	 */
> > > > > > +	intel_engine_mask_t logical_mask;
> > > > > 
> > > > > You could prefix the new field with uabi_ to match the existing scheme and
> > > > > to signify to anyone who might be touching it in the future it should not be
> > > > > changed.
> > > > 
> > > > This is kinda uabi, but also kinda isn't. We do report a logical
> > > > instance via IOCTL but it also really is tied the GuC backend as we only
> > > > can communicate with the GuC in logical space. IMO we should leave as
> > > > is.
> > > 
> > > Perhaps it would be best to call the new field uabi_logical_instance so it's
> > > clear it is reported in the query directly and do the BIT() transformation
> > > in the GuC backend?
> > > 
> > 
> > Virtual engines can have a multiple bits set in this mask, so this is
> > used for both the query on physical engines via UABI and submission in
> > the GuC backend.
> 
> You could add both fields if that would help. I just think it is preferrable
> to keep the existing convetion of uabi_ prefix for the fields which i915
> refactors must not change willy-nilly.
>

Sure can add another field to make the separation clear.
 
> > 
> > > > 
> > > > > 
> > > > > Also, I think comment should explain what is logical space ie. how the
> > > > > numbering works.
> > > > > 
> > > > 
> > > > Don't I already do that? I suppose I could add something like:
> > > 
> > > Where is it? Can't see it in the uapi kerneldoc AFAICS (for the new query)
> > > or here.
> > > 
> > 
> > 	/**
> > 	 * @logical_mask: logical mask of engine, reported to user space via
> > 	 * query IOCTL and used to communicate with the GuC in logical space
> > 	 */
> > > > 
> > > > The logical mask within engine class must be contiguous across all
> > > > instances.
> > > 
> > > Best not to start mentioning the mask for the first time. Just explain what
> > > logical numbering is in terms of how engines are enumerated in order of
> > > physical instances but skipping the fused off ones. In the kerneldoc for the
> > > new query is I think the right place.
> > > 
> > 
> > Maybe I can add:
> > 
> > The logical mapping is defined on per part basis in the bspec and can
> > very based the parts fusing.
> 
> Sounds good, I think that would be useful. But in the uapi kerneldoc.
> 
> Perhaps a cross-link to/from the kernel doc which talks about frame split to
> explain consecutive logical instances have to be used. That would tie the
> new query with that uapi in the narrative.
>

Sure, let me see what I can do here.

Matt
 
> Regards,
> 
> Tvrtko
> 
> > > > > Not sure the part about GuC needs to be in the comment since uapi is
> > > > > supposed to be backend agnostic.
> > > > > 
> > > > 
> > > > Communicating with the GuC in logical space is a pretty key point here.
> > > > The communication with the GuC in logical space is backend specific but
> > > > how our hardware works (e.g. split frame workloads must be placed
> > > > logical contiguous) is not. Mentioning the GuC requirement here makes
> > > > sense to me for completeness.
> > > 
> > > Yeah might be, I was thinking more about the new query. Query definitely is
> > > backend agnostic but yes it is fine to say in the comment here the new field
> > > is used both for the query and for communicating with GuC.
> > > 
> > 
> > Sounds good, will make it clear it used for the query and from
> > communicating with the GuC.
> > 
> > Matt
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > Regards,
> > > > > 
> > > > > Tvrtko
> > > > > 
> > > > > >     	u8 class;
> > > > > >     	u8 instance;
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > index cafb0608ffb4..813a6de01382 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > @@ -3875,6 +3875,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > > > > >     		ve->siblings[ve->num_siblings++] = sibling;
> > > > > >     		ve->base.mask |= sibling->mask;
> > > > > > +		ve->base.logical_mask |= sibling->logical_mask;
> > > > > >     		/*
> > > > > >     		 * All physical engines must be compatible for their emission
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > > > index 6926919bcac6..9f5f43a16182 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > > > > @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
> > > > > >     	for_each_engine(engine, gt, id) {
> > > > > >     		u8 guc_class = engine_class_to_guc_class(engine->class);
> > > > > > -		system_info->mapping_table[guc_class][engine->instance] =
> > > > > > +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
> > > > > >     			engine->instance;
> > > > > >     	}
> > > > > >     }
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > index e0eed70f9b92..ffafbac7335e 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > @@ -1401,23 +1401,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > > > > >     	return __guc_action_deregister_context(guc, guc_id, loop);
> > > > > >     }
> > > > > > -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> > > > > > -{
> > > > > > -	switch (class) {
> > > > > > -	case RENDER_CLASS:
> > > > > > -		return mask >> RCS0;
> > > > > > -	case VIDEO_ENHANCEMENT_CLASS:
> > > > > > -		return mask >> VECS0;
> > > > > > -	case VIDEO_DECODE_CLASS:
> > > > > > -		return mask >> VCS0;
> > > > > > -	case COPY_ENGINE_CLASS:
> > > > > > -		return mask >> BCS0;
> > > > > > -	default:
> > > > > > -		MISSING_CASE(class);
> > > > > > -		return 0;
> > > > > > -	}
> > > > > > -}
> > > > > > -
> > > > > >     static void guc_context_policy_init(struct intel_engine_cs *engine,
> > > > > >     				    struct guc_lrc_desc *desc)
> > > > > >     {
> > > > > > @@ -1459,8 +1442,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > > >     	desc = __get_lrc_desc(guc, desc_idx);
> > > > > >     	desc->engine_class = engine_class_to_guc_class(engine->class);
> > > > > > -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> > > > > > -						      engine->mask);
> > > > > > +	desc->engine_submit_mask = engine->logical_mask;
> > > > > >     	desc->hw_context_desc = ce->lrc.lrca;
> > > > > >     	desc->priority = ce->guc_state.prio;
> > > > > >     	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > > > > @@ -3260,6 +3242,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > > > > >     		}
> > > > > >     		ve->base.mask |= sibling->mask;
> > > > > > +		ve->base.logical_mask |= sibling->logical_mask;
> > > > > >     		if (n != 0 && ve->base.class != sibling->class) {
> > > > > >     			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> > > > > > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-14 18:04             ` Matthew Brost
@ 2021-09-15  8:24               ` Tvrtko Ursulin
  2021-09-15 16:58                 ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-15  8:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 14/09/2021 19:04, Matthew Brost wrote:
> On Tue, Sep 14, 2021 at 09:34:08AM +0100, Tvrtko Ursulin wrote:
>>

8<

>> Today we have:
>>
>> for_each intel_engines: // intel_engines is a flat list of all engines
>> 	intel_engine_setup()
>>
>> You propose to change it to:
>>
>> for_each engine_class:
>>     for 0..max_global_engine_instance:
>>        for_each intel_engines:
>>           skip engine not present
>>           skip class not matching
>>
>>           count logical instance
>>
>>     for_each intel_engines:
>>        skip engine not present
>>        skip wrong class
>>
>>        intel_engine_setup()
>>
>>
>> I propose:
>>
>> // Leave as is:
>>
>> for_each intel_engines:
>>     intel_engine_setup()
>>
>> // Add:
>>
>> for_each engine_class:
>>     logical = 0
>>     for_each gt->engine_class[class]:
>>        skip engine not present
>>
>>        engine->logical_instance = logical++
>>
>>
>> When code which actually needs a preturbed "map" arrives you add that in to
>> this second loop.
>>
> 
> See above, why introduce an algorithm that doesn't work for future parts
> + future patches are land imminently? It makes zero sense whatsoever.
> With your proposal we would literally land code to just throw it away a
> couple of months from now + break patches we intend to land soon. This

It sure works, it just walks the per class list instead of walking the 
flat list skipping one class at the time.

Just add the map based transformation to the second pass later, when it 
becomes required.

> algorithm works and has no reason whatsoever to be optimal as it a one
> time setup call. I really don't understand why we are still talking
> about this paint color.

I don't think bike shedding is not an appropriate term when complaint is 
how proposed algorithm is needlessly complicated.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-15  8:24               ` Tvrtko Ursulin
@ 2021-09-15 16:58                 ` Matthew Brost
  2021-09-16  8:31                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-15 16:58 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 15, 2021 at 09:24:15AM +0100, Tvrtko Ursulin wrote:
> 
> On 14/09/2021 19:04, Matthew Brost wrote:
> > On Tue, Sep 14, 2021 at 09:34:08AM +0100, Tvrtko Ursulin wrote:
> > > 
> 
> 8<
> 
> > > Today we have:
> > > 
> > > for_each intel_engines: // intel_engines is a flat list of all engines
> > > 	intel_engine_setup()
> > > 
> > > You propose to change it to:
> > > 
> > > for_each engine_class:
> > >     for 0..max_global_engine_instance:
> > >        for_each intel_engines:
> > >           skip engine not present
> > >           skip class not matching
> > > 
> > >           count logical instance
> > > 
> > >     for_each intel_engines:
> > >        skip engine not present
> > >        skip wrong class
> > > 
> > >        intel_engine_setup()
> > > 
> > > 
> > > I propose:
> > > 
> > > // Leave as is:
> > > 
> > > for_each intel_engines:
> > >     intel_engine_setup()
> > > 
> > > // Add:
> > > 
> > > for_each engine_class:
> > >     logical = 0
> > >     for_each gt->engine_class[class]:
> > >        skip engine not present
> > > 
> > >        engine->logical_instance = logical++
> > > 
> > > 
> > > When code which actually needs a preturbed "map" arrives you add that in to
> > > this second loop.
> > > 
> > 
> > See above, why introduce an algorithm that doesn't work for future parts
> > + future patches are land imminently? It makes zero sense whatsoever.
> > With your proposal we would literally land code to just throw it away a
> > couple of months from now + break patches we intend to land soon. This
> 
> It sure works, it just walks the per class list instead of walking the flat
> list skipping one class at the time.
> 
> Just add the map based transformation to the second pass later, when it
> becomes required.
> 

I can flatten the algorithm if that helps alleviate your concerns but
with that being said, I've played around this locally and IMO makes the
code way more ugly. Sure it eliminates some iterations of the loop but
who really cares about that in a one time setup function?

> > algorithm works and has no reason whatsoever to be optimal as it a one
> > time setup call. I really don't understand why we are still talking
> > about this paint color.
> 
> I don't think bike shedding is not an appropriate term when complaint is how
> proposed algorithm is needlessly complicated.
>

Are you just ignoring the fact that the algorithm (map) is needed in
pending patches? IMO it is more complicated to write throw away code
when the proper algorithm is already written. If the logical mapping was
straight forward on all platforms as the ones currently upstream I would
100% agree with your suggestion, but it isn't on unembargoed platforms
eminently going upstream. The algorithm I have works for the current
platforms + the pending platforms. IMO is 100% acceptable to merge
something looking towards a known future.

Matt

> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-15 19:21   ` John Harrison
  2021-09-15 19:31     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-15 19:21 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Add multi-lrc context registration H2G. In addition a workqueue and
> process descriptor are setup during multi-lrc context registration as
> these data structures are needed for multi-lrc submission.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
>   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
>   4 files changed, 126 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 0fafc178cf2c..6f567ebeb039 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -232,8 +232,20 @@ struct intel_context {
>   		/** @parent: pointer to parent if child */
>   		struct intel_context *parent;
>   
> +
> +		/** @guc_wqi_head: head pointer in work queue */
> +		u16 guc_wqi_head;
> +		/** @guc_wqi_tail: tail pointer in work queue */
> +		u16 guc_wqi_tail;
> +
These should be in the 'guc_state' sub-struct? Would be good to keep all 
GuC specific content in one self-contained struct. Especially given the 
other child/parent fields are no going to be guc_ prefixed any more.


>   		/** @guc_number_children: number of children if parent */
>   		u8 guc_number_children;
> +
> +		/**
> +		 * @parent_page: page in context used by parent for work queue,
Maybe 'page in context record'? Otherwise, exactly what 'context' is 
meant here? It isn't the 'struct intel_context'. The contetx record is 
saved as 'ce->state' / 'ce->lrc_reg_state', yes? Is it possible to link 
to either of those field? Probably not given that they don't appear to 
have any kerneldoc description :(. Maybe add that in too :).

> +		 * work queue descriptor
Later on, it is described as 'process descriptor and work queue'. It 
would be good to be consistent.

> +		 */
> +		u8 parent_page;
>   	};
>   
>   #ifdef CONFIG_DRM_I915_SELFTEST
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index bb4af4977920..0ddbad4e062a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
>   		context_size += PAGE_SIZE;
>   	}
>   
> +	if (intel_context_is_parent(ce)) {
> +		ce->parent_page = context_size / PAGE_SIZE;
> +		context_size += PAGE_SIZE;
> +	}
> +
>   	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
>   	if (IS_ERR(obj))
>   		obj = i915_gem_object_create_shmem(engine->i915, context_size);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..0e600a3b8f1e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -52,7 +52,7 @@
>   
>   #define GUC_DOORBELL_INVALID		256
>   
> -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> +#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
Is this size actually dictated by the GuC API? Or is it just a driver 
level decision? If the latter, shouldn't this be below instead?

>   
>   /* Work queue item header definitions */
>   #define WQ_STATUS_ACTIVE		1
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 14b24298cdd7..dbcb9ab28a9a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
>   	return rb_entry(rb, struct i915_priolist, node);
>   }
>   
> +/*
> + * When using multi-lrc submission an extra page in the context state is
> + * reserved for the process descriptor and work queue.
> + *
> + * The layout of this page is below:
> + * 0						guc_process_desc
> + * ...						unused
> + * PAGE_SIZE / 2				work queue start
> + * ...						work queue
> + * PAGE_SIZE - 1				work queue end
> + */
> +#define WQ_OFFSET	(PAGE_SIZE / 2)
Can this not be derived from GUC_WQ_SIZE given that the two are 
fundamentally linked? E.g. '#define WQ_OFFSET (PAGE_SIZE - 
GUC_WQ_SIZE)'? And maybe have a '#define WQ_TOTAL_SIZE PAGE_SIZE' and 
use that in all of WQ_OFFSET, GUC_WQ_SIZE and the allocation itself in 
intel_lrc.c?

Also, the process descriptor is actually an array of descriptors sized 
by the number of children? Or am I misunderstanding the code below? In 
so, shouldn't there be a 'COMPILE_BUG_ON((MAX_ENGINE_INSTANCE * 
sizeof(descriptor)) < (WQ_DESC_SIZE)' where WQ_DESC_SIZE is 
WQ_TOTAL_SIZE - WQ_SIZE?

> +static u32 __get_process_desc_offset(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!ce->parent_page);
> +
> +	return ce->parent_page * PAGE_SIZE;
> +}
> +
> +static u32 __get_wq_offset(struct intel_context *ce)
> +{
> +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> +}
> +
> +static struct guc_process_desc *
> +__get_process_desc(struct intel_context *ce)
> +{
> +	return (struct guc_process_desc *)
> +		(ce->lrc_reg_state +
> +		 ((__get_process_desc_offset(ce) -
> +		   LRC_STATE_OFFSET) / sizeof(u32)));
Where did the LRC_STATE_OFFSET come from? Is that built in to the 
lrg_reg_state pointer itself? That needs to be documented somewhere.

> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> +					   struct intel_context *ce,
> +					   u32 guc_id,
> +					   u32 offset,
> +					   bool loop)
> +{
> +	struct intel_context *child;
> +	u32 action[4 + MAX_ENGINE_INSTANCE];
> +	int len = 0;
> +
> +	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
> +
> +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> +	action[len++] = guc_id;
> +	action[len++] = ce->guc_number_children + 1;
> +	action[len++] = offset;
> +	for_each_child(ce, child) {
> +		offset += sizeof(struct guc_lrc_desc);
> +		action[len++] = offset;
> +	}
> +
> +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> +}
> +
>   static int __guc_action_register_context(struct intel_guc *guc,
>   					 u32 guc_id,
>   					 u32 offset,
> @@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
>   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
>   	int ret;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_register(ce);
>   
> -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> +	if (intel_context_is_parent(ce))
> +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> +						      offset, loop);
> +	else
> +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> +						    loop);
>   	if (likely(!ret)) {
>   		unsigned long flags;
>   
> @@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_deregister(ce);
>   
>   	return __guc_action_deregister_context(guc, guc_id, loop);
> @@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	struct guc_lrc_desc *desc;
>   	bool context_registered;
>   	intel_wakeref_t wakeref;
> +	struct intel_context *child;
>   	int ret = 0;
>   
>   	GEM_BUG_ON(!engine->mask);
> @@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   	guc_context_policy_init(engine, desc);
>   
> +	/*
> +	 * Context is a parent, we need to register a process descriptor
> +	 * describing a work queue and register all child contexts.
Technically, this should say 'If the context is a parent'. Or just move 
it to be inside the if block.

> +	 */
> +	if (intel_context_is_parent(ce)) {
> +		struct guc_process_desc *pdesc;
> +
> +		ce->guc_wqi_tail = 0;
> +		ce->guc_wqi_head = 0;
> +
> +		desc->process_desc = i915_ggtt_offset(ce->state) +
> +			__get_process_desc_offset(ce);
> +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> +			__get_wq_offset(ce);
> +		desc->wq_size = GUC_WQ_SIZE;
> +
> +		pdesc = __get_process_desc(ce);
> +		memset(pdesc, 0, sizeof(*(pdesc)));
> +		pdesc->stage_id = ce->guc_id.id;
> +		pdesc->wq_base_addr = desc->wq_addr;
> +		pdesc->wq_size_bytes = desc->wq_size;
> +		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
Should this not be inherited from the ce? And same below. Or are we not 
using this priority in that way?

John.

> +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> +
> +		for_each_child(ce, child) {
> +			desc = __get_lrc_desc(guc, child->guc_id.id);
> +
> +			desc->engine_class =
> +				engine_class_to_guc_class(engine->class);
> +			desc->hw_context_desc = child->lrc.lrca;
> +			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
> +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> +			guc_context_policy_init(engine, desc);
> +		}
> +	}
> +
>   	/*
>   	 * The context_lookup xarray is used to determine if the hardware
>   	 * context is currently registered. There are two cases in which it
> @@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   		return NULL;
>   	}
>   
> +	if (unlikely(intel_context_is_child(ce))) {
> +		drm_err(&guc_to_gt(guc)->i915->drm,
> +			"Context is child, desc_idx %u", desc_idx);
> +		return NULL;
> +	}
> +
>   	return ce;
>   }
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 13/27] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-15 19:24   ` John Harrison
  2021-09-15 19:34     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-15 19:24 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> In GuC parent-child contexts the parent context controls the scheduling,
> ensure only the parent does the scheduling operations.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++++++-----
>   1 file changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index dbcb9ab28a9a..00d54bb00bfb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -320,6 +320,12 @@ static void decr_context_committed_requests(struct intel_context *ce)
>   	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
>   }
>   
> +static struct intel_context *
> +request_to_scheduling_context(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
>   static bool context_guc_id_invalid(struct intel_context *ce)
>   {
>   	return ce->guc_id.id == GUC_INVALID_LRC_ID;
> @@ -1684,6 +1690,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
>   
>   	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	trace_intel_context_sched_disable(ce);
>   
>   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
> @@ -1898,6 +1905,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
>   	u16 guc_id;
>   	bool enabled;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
>   	    !lrc_desc_registered(guc, ce->guc_id.id)) {
>   		spin_lock_irqsave(&ce->guc_state.lock, flags);
> @@ -2286,6 +2295,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
>   {
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   	clr_context_wait_for_deregister_to_register(ce);
>   	__guc_signal_context_fence(ce);
> @@ -2315,7 +2326,7 @@ static void guc_context_init(struct intel_context *ce)
>   
>   static int guc_request_alloc(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	struct intel_guc *guc = ce_to_guc(ce);
>   	unsigned long flags;
>   	int ret;
> @@ -2358,11 +2369,12 @@ static int guc_request_alloc(struct i915_request *rq)
>   	 * exhausted and return -EAGAIN to the user indicating that they can try
>   	 * again in the future.
>   	 *
> -	 * There is no need for a lock here as the timeline mutex ensures at
> -	 * most one context can be executing this code path at once. The
> -	 * guc_id_ref is incremented once for every request in flight and
> -	 * decremented on each retire. When it is zero, a lock around the
> -	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
> +	 * There is no need for a lock here as the timeline mutex (or
> +	 * parallel_submit mutex in the case of multi-lrc) ensures at most one
> +	 * context can be executing this code path at once. The guc_id_ref is
Isn't that now two? One uni-LRC holding the timeline mutex and one 
multi-LRC holding the parallel submit mutex?

John.

> +	 * incremented once for every request in flight and decremented on each
> +	 * retire. When it is zero, a lock around the increment (in pin_guc_id)
> +	 * is needed to seal a race with unpin_guc_id.
>   	 */
>   	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
>   		goto out;


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
  2021-09-15 19:21   ` John Harrison
@ 2021-09-15 19:31     ` Matthew Brost
  2021-09-15 20:23       ` John Harrison
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-15 19:31 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 15, 2021 at 12:21:35PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Add multi-lrc context registration H2G. In addition a workqueue and
> > process descriptor are setup during multi-lrc context registration as
> > these data structures are needed for multi-lrc submission.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
> >   drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
> >   4 files changed, 126 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 0fafc178cf2c..6f567ebeb039 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -232,8 +232,20 @@ struct intel_context {
> >   		/** @parent: pointer to parent if child */
> >   		struct intel_context *parent;
> > +
> > +		/** @guc_wqi_head: head pointer in work queue */
> > +		u16 guc_wqi_head;
> > +		/** @guc_wqi_tail: tail pointer in work queue */
> > +		u16 guc_wqi_tail;
> > +
> These should be in the 'guc_state' sub-struct? Would be good to keep all GuC
> specific content in one self-contained struct. Especially given the other
> child/parent fields are no going to be guc_ prefixed any more.
> 

Right now I have everything in guc_state protected by guc_state.lock,
these fields are not protected by this lock. IMO it is better to use a
different sub-structure for the parallel fields (even if anonymous).

> 
> >   		/** @guc_number_children: number of children if parent */
> >   		u8 guc_number_children;
> > +
> > +		/**
> > +		 * @parent_page: page in context used by parent for work queue,
> Maybe 'page in context record'? Otherwise, exactly what 'context' is meant
> here? It isn't the 'struct intel_context'. The contetx record is saved as
> 'ce->state' / 'ce->lrc_reg_state', yes? Is it possible to link to either of

It is the page in ce->state / page minus LRC reg offset in
ce->lrg_reg_state. Will update the commit to make that clear.

> those field? Probably not given that they don't appear to have any kerneldoc
> description :(. Maybe add that in too :).
> 
> > +		 * work queue descriptor
> Later on, it is described as 'process descriptor and work queue'. It would
> be good to be consistent.
>

Yep. Will fix.

> > +		 */
> > +		u8 parent_page;
> >   	};
> >   #ifdef CONFIG_DRM_I915_SELFTEST
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index bb4af4977920..0ddbad4e062a 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
> >   		context_size += PAGE_SIZE;
> >   	}
> > +	if (intel_context_is_parent(ce)) {
> > +		ce->parent_page = context_size / PAGE_SIZE;
> > +		context_size += PAGE_SIZE;
> > +	}
> > +
> >   	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
> >   	if (IS_ERR(obj))
> >   		obj = i915_gem_object_create_shmem(engine->i915, context_size);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index fa4be13c8854..0e600a3b8f1e 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -52,7 +52,7 @@
> >   #define GUC_DOORBELL_INVALID		256
> > -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> > +#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
> Is this size actually dictated by the GuC API? Or is it just a driver level
> decision? If the latter, shouldn't this be below instead?
>

Driver level decision. What exactly do you mean by below?
 
> >   /* Work queue item header definitions */
> >   #define WQ_STATUS_ACTIVE		1
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 14b24298cdd7..dbcb9ab28a9a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> >   	return rb_entry(rb, struct i915_priolist, node);
> >   }
> > +/*
> > + * When using multi-lrc submission an extra page in the context state is
> > + * reserved for the process descriptor and work queue.
> > + *
> > + * The layout of this page is below:
> > + * 0						guc_process_desc
> > + * ...						unused
> > + * PAGE_SIZE / 2				work queue start
> > + * ...						work queue
> > + * PAGE_SIZE - 1				work queue end
> > + */
> > +#define WQ_OFFSET	(PAGE_SIZE / 2)
> Can this not be derived from GUC_WQ_SIZE given that the two are
> fundamentally linked? E.g. '#define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'?

Yes. I like 'define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'. Will change.

> And maybe have a '#define WQ_TOTAL_SIZE PAGE_SIZE' and use that in all of
> WQ_OFFSET, GUC_WQ_SIZE and the allocation itself in intel_lrc.c?
> 
> Also, the process descriptor is actually an array of descriptors sized by
> the number of children? Or am I misunderstanding the code below? In so,

No, it is fixed size descriptor.

A later patch in the series uses the space of above the process
descriptor for insertation of preeemption points handshake. That does
depend on the number of children. I will add a COMPILE_BUG_ON for that
to ensure everything fits in the memory layout.

> shouldn't there be a 'COMPILE_BUG_ON((MAX_ENGINE_INSTANCE *
> sizeof(descriptor)) < (WQ_DESC_SIZE)' where WQ_DESC_SIZE is WQ_TOTAL_SIZE -
> WQ_SIZE?
> 
> > +static u32 __get_process_desc_offset(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!ce->parent_page);
> > +
> > +	return ce->parent_page * PAGE_SIZE;
> > +}
> > +
> > +static u32 __get_wq_offset(struct intel_context *ce)
> > +{
> > +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > +}
> > +
> > +static struct guc_process_desc *
> > +__get_process_desc(struct intel_context *ce)
> > +{
> > +	return (struct guc_process_desc *)
> > +		(ce->lrc_reg_state +
> > +		 ((__get_process_desc_offset(ce) -
> > +		   LRC_STATE_OFFSET) / sizeof(u32)));
> Where did the LRC_STATE_OFFSET come from? Is that built in to the
> lrg_reg_state pointer itself? That needs to be documented somewhere.
> 

In gt/intel_lrc.c (lrc_pin) ce->lrc_reg_state is assigned to
mmap(ce->state) + LRC_STATE_OFFSET. I can update the kerneldoc for that
field in this patch.

> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> > +					   struct intel_context *ce,
> > +					   u32 guc_id,
> > +					   u32 offset,
> > +					   bool loop)
> > +{
> > +	struct intel_context *child;
> > +	u32 action[4 + MAX_ENGINE_INSTANCE];
> > +	int len = 0;
> > +
> > +	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
> > +
> > +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> > +	action[len++] = guc_id;
> > +	action[len++] = ce->guc_number_children + 1;
> > +	action[len++] = offset;
> > +	for_each_child(ce, child) {
> > +		offset += sizeof(struct guc_lrc_desc);
> > +		action[len++] = offset;
> > +	}
> > +
> > +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> > +}
> > +
> >   static int __guc_action_register_context(struct intel_guc *guc,
> >   					 u32 guc_id,
> >   					 u32 offset,
> > @@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
> >   		ce->guc_id.id * sizeof(struct guc_lrc_desc);
> >   	int ret;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_register(ce);
> > -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> > +	if (intel_context_is_parent(ce))
> > +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> > +						      offset, loop);
> > +	else
> > +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> > +						    loop);
> >   	if (likely(!ret)) {
> >   		unsigned long flags;
> > @@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_deregister(ce);
> >   	return __guc_action_deregister_context(guc, guc_id, loop);
> > @@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	struct guc_lrc_desc *desc;
> >   	bool context_registered;
> >   	intel_wakeref_t wakeref;
> > +	struct intel_context *child;
> >   	int ret = 0;
> >   	GEM_BUG_ON(!engine->mask);
> > @@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   	guc_context_policy_init(engine, desc);
> > +	/*
> > +	 * Context is a parent, we need to register a process descriptor
> > +	 * describing a work queue and register all child contexts.
> Technically, this should say 'If the context is a parent'. Or just move it
> to be inside the if block.
> 

I will add the "If the".

> > +	 */
> > +	if (intel_context_is_parent(ce)) {
> > +		struct guc_process_desc *pdesc;
> > +
> > +		ce->guc_wqi_tail = 0;
> > +		ce->guc_wqi_head = 0;
> > +
> > +		desc->process_desc = i915_ggtt_offset(ce->state) +
> > +			__get_process_desc_offset(ce);
> > +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> > +			__get_wq_offset(ce);
> > +		desc->wq_size = GUC_WQ_SIZE;
> > +
> > +		pdesc = __get_process_desc(ce);
> > +		memset(pdesc, 0, sizeof(*(pdesc)));
> > +		pdesc->stage_id = ce->guc_id.id;
> > +		pdesc->wq_base_addr = desc->wq_addr;
> > +		pdesc->wq_size_bytes = desc->wq_size;
> > +		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
> Should this not be inherited from the ce? And same below. Or are we not
> using this priority in that way?
> 

Honestly I don't think this field is used or maybe doesn't even exist
anymore. I'll check the GuC code and likely delete this or if it is
still present I'll inherited this from the ce.

Matt

> John.
> 
> > +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> > +
> > +		for_each_child(ce, child) {
> > +			desc = __get_lrc_desc(guc, child->guc_id.id);
> > +
> > +			desc->engine_class =
> > +				engine_class_to_guc_class(engine->class);
> > +			desc->hw_context_desc = child->lrc.lrca;
> > +			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
> > +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > +			guc_context_policy_init(engine, desc);
> > +		}
> > +	}
> > +
> >   	/*
> >   	 * The context_lookup xarray is used to determine if the hardware
> >   	 * context is currently registered. There are two cases in which it
> > @@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   		return NULL;
> >   	}
> > +	if (unlikely(intel_context_is_child(ce))) {
> > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > +			"Context is child, desc_idx %u", desc_idx);
> > +		return NULL;
> > +	}
> > +
> >   	return ce;
> >   }
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 13/27] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-09-15 19:24   ` John Harrison
@ 2021-09-15 19:34     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-15 19:34 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 15, 2021 at 12:24:41PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > In GuC parent-child contexts the parent context controls the scheduling,
> > ensure only the parent does the scheduling operations.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 ++++++++++++++-----
> >   1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index dbcb9ab28a9a..00d54bb00bfb 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -320,6 +320,12 @@ static void decr_context_committed_requests(struct intel_context *ce)
> >   	GEM_BUG_ON(ce->guc_state.number_committed_requests < 0);
> >   }
> > +static struct intel_context *
> > +request_to_scheduling_context(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> >   static bool context_guc_id_invalid(struct intel_context *ce)
> >   {
> >   	return ce->guc_id.id == GUC_INVALID_LRC_ID;
> > @@ -1684,6 +1690,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
> >   	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	trace_intel_context_sched_disable(ce);
> >   	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
> > @@ -1898,6 +1905,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >   	u16 guc_id;
> >   	bool enabled;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> >   	    !lrc_desc_registered(guc, ce->guc_id.id)) {
> >   		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > @@ -2286,6 +2295,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
> >   {
> >   	unsigned long flags;
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   	clr_context_wait_for_deregister_to_register(ce);
> >   	__guc_signal_context_fence(ce);
> > @@ -2315,7 +2326,7 @@ static void guc_context_init(struct intel_context *ce)
> >   static int guc_request_alloc(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	struct intel_guc *guc = ce_to_guc(ce);
> >   	unsigned long flags;
> >   	int ret;
> > @@ -2358,11 +2369,12 @@ static int guc_request_alloc(struct i915_request *rq)
> >   	 * exhausted and return -EAGAIN to the user indicating that they can try
> >   	 * again in the future.
> >   	 *
> > -	 * There is no need for a lock here as the timeline mutex ensures at
> > -	 * most one context can be executing this code path at once. The
> > -	 * guc_id_ref is incremented once for every request in flight and
> > -	 * decremented on each retire. When it is zero, a lock around the
> > -	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
> > +	 * There is no need for a lock here as the timeline mutex (or
> > +	 * parallel_submit mutex in the case of multi-lrc) ensures at most one
> > +	 * context can be executing this code path at once. The guc_id_ref is
> Isn't that now two? One uni-LRC holding the timeline mutex and one multi-LRC
> holding the parallel submit mutex?
> 

This is actually a stale comment and need scrub this. The
parallel_submit mutex is gone, now we grab the ce->timeline locks
starting at the parent and then all children in a loop. I think the
original comment is sufficient.

Matt

> John.
> 
> > +	 * incremented once for every request in flight and decremented on each
> > +	 * retire. When it is zero, a lock around the increment (in pin_guc_id)
> > +	 * is needed to seal a race with unpin_guc_id.
> >   	 */
> >   	if (atomic_add_unless(&ce->guc_id.ref, 1, 0))
> >   		goto out;
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 14/27] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-15 20:04   ` John Harrison
  2021-09-15 20:55     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-15 20:04 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Assign contexts in parent-child relationship consecutive guc_ids. This
> is accomplished by partitioning guc_id space between ones that need to
> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> available guc_ids). The consecutive search is implemented via the bitmap
> API.
>
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> when using the GuC multi-lrc interface.
>
> v2:
>   (Daniel Vetter)
>    - Explictly state why we assign consecutive guc_ids
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 +++++++++++++-----
>   2 files changed, 86 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 023953e77553..3f95b1b4f15c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -61,9 +61,13 @@ struct intel_guc {
>   		 */
>   		spinlock_t lock;
>   		/**
> -		 * @guc_ids: used to allocate new guc_ids
> +		 * @guc_ids: used to allocate new guc_ids, single-lrc
>   		 */
>   		struct ida guc_ids;
> +		/**
> +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> +		 */
> +		unsigned long *guc_ids_bitmap;
>   		/** @num_guc_ids: number of guc_ids that can be used */
>   		u32 num_guc_ids;
>   		/** @max_guc_ids: max number of guc_ids that can be used */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 00d54bb00bfb..e9dfd43d29a0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -125,6 +125,18 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> +/*
> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> + * per the GuC submission interface. A different allocation algorithm is used
> + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
The 'hence' clause seems to be attached to the wrong reason. The id 
space is partition because of the contiguous vs random requirements of 
multi vs single LRC, not because a different allocator is used in one 
partion vs the other.

> + * partition the guc_id space. We believe the number of multi-lrc contexts in
> + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> + * multi-lrc.
> + */
> +#define NUMBER_MULTI_LRC_GUC_ID(guc) \
> +	((guc)->submission_state.num_guc_ids / 16 > 32 ? \
> +	 (guc)->submission_state.num_guc_ids / 16 : 32)
> +
>   /*
>    * Below is a set of functions which control the GuC scheduling state which
>    * require a lock.
> @@ -1176,6 +1188,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
>   	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
>   				     destroyed_worker_func);
> +	guc->submission_state.guc_ids_bitmap =
> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> +	if (!guc->submission_state.guc_ids_bitmap)
> +		return -ENOMEM;
>   
>   	return 0;
>   }
> @@ -1188,6 +1204,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   	guc_lrc_desc_pool_destroy(guc);
>   	guc_flush_destroyed_contexts(guc);
>   	i915_sched_engine_put(guc->sched_engine);
> +	bitmap_free(guc->submission_state.guc_ids_bitmap);
>   }
>   
>   static void queue_request(struct i915_sched_engine *sched_engine,
> @@ -1239,18 +1256,43 @@ static void guc_submit_request(struct i915_request *rq)
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> -static int new_guc_id(struct intel_guc *guc)
> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> -			      guc->submission_state.num_guc_ids, GFP_KERNEL |
> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +	int ret;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (intel_context_is_parent(ce))
> +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> +					      order_base_2(ce->guc_number_children
> +							   + 1));
> +	else
> +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> +				     guc->submission_state.num_guc_ids,
> +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> +				     __GFP_NOWARN);
> +	if (unlikely(ret < 0))
> +		return ret;
> +
> +	ce->guc_id.id = ret;
> +	return 0;
>   }
>   
>   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->submission_state.guc_ids,
> -				  ce->guc_id.id);
> +		if (intel_context_is_parent(ce))
> +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> +					      ce->guc_id.id,
> +					      order_base_2(ce->guc_number_children
> +							   + 1));
Is there any check against adding/removing children when the guc_ids are 
allocated? Presumably it shouldn't ever happen but if it did then the 
bitmap_release would not match the allocation. Maybe add 
BUG_ON(ce->guc_id) to the parent/child link functions (if it's not there 
already?).

> +		else
> +			ida_simple_remove(&guc->submission_state.guc_ids,
> +					  ce->guc_id.id);
>   		reset_lrc_desc(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -1267,49 +1309,60 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   }
>   
> -static int steal_guc_id(struct intel_guc *guc)
> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
> -	struct intel_context *ce;
> -	int guc_id;
> +	struct intel_context *cn;
Leaving this as 'ce' and calling the input parameter 'ce_in' would have 
made for significantly easier to read diffs!

>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(intel_context_is_parent(ce));
>   
>   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> +		cn = list_first_entry(&guc->submission_state.guc_id_list,
>   				      struct intel_context,
>   				      guc_id.link);
>   
> -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> -		GEM_BUG_ON(context_guc_id_invalid(ce));
> -
> -		list_del_init(&ce->guc_id.link);
> -		guc_id = ce->guc_id.id;
> +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> +		GEM_BUG_ON(context_guc_id_invalid(cn));
> +		GEM_BUG_ON(intel_context_is_child(cn));
> +		GEM_BUG_ON(intel_context_is_parent(cn));
>   
> -		spin_lock(&ce->guc_state.lock);
As far as I can tell, the only actual change to this function (beyond 
'ce_in->id = id' vs 'return id' and adding anti-family asserts) is that 
this spinlock was dropped. However, I'm not seeing any replacement for 
it or any comment about why the spinlock is no longer necessary.

John.


> -		clr_context_registered(ce);
> -		spin_unlock(&ce->guc_state.lock);
> +		list_del_init(&cn->guc_id.link);
> +		ce->guc_id = cn->guc_id;
> +		clr_context_registered(cn);
> +		set_context_guc_id_invalid(cn);
>   
> -		set_context_guc_id_invalid(ce);
> -		return guc_id;
> +		return 0;
>   	} else {
>   		return -EAGAIN;
>   	}
>   }
>   
> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   {
>   	int ret;
>   
>   	lockdep_assert_held(&guc->submission_state.lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	ret = new_guc_id(guc);
> +	ret = new_guc_id(guc, ce);
>   	if (unlikely(ret < 0)) {
> -		ret = steal_guc_id(guc);
> +		if (intel_context_is_parent(ce))
> +			return -ENOSPC;
> +
> +		ret = steal_guc_id(guc, ce);
>   		if (ret < 0)
>   			return ret;
>   	}
>   
> -	*out = ret;
> +	if (intel_context_is_parent(ce)) {
> +		struct intel_context *child;
> +		int i = 1;
> +
> +		for_each_child(ce, child)
> +			child->guc_id.id = ce->guc_id.id + i++;
> +	}
> +
>   	return 0;
>   }
>   
> @@ -1327,7 +1380,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	might_lock(&ce->guc_state.lock);
>   
>   	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id.id);
> +		ret = assign_guc_id(guc, ce);
>   		if (ret)
>   			goto out_unlock;
>   		ret = 1;	/* Indidcates newly assigned guc_id */
> @@ -1369,8 +1422,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
> -	if (unlikely(context_guc_id_invalid(ce)))
> +	if (unlikely(context_guc_id_invalid(ce) ||
> +		     intel_context_is_parent(ce)))
>   		return;
>   
>   	spin_lock_irqsave(&guc->submission_state.lock, flags);


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
  2021-09-15 19:31     ` Matthew Brost
@ 2021-09-15 20:23       ` John Harrison
  2021-09-15 20:33         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-15 20:23 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On 9/15/2021 12:31, Matthew Brost wrote:
> On Wed, Sep 15, 2021 at 12:21:35PM -0700, John Harrison wrote:
>> On 8/20/2021 15:44, Matthew Brost wrote:
>>> Add multi-lrc context registration H2G. In addition a workqueue and
>>> process descriptor are setup during multi-lrc context registration as
>>> these data structures are needed for multi-lrc submission.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
>>>    drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
>>>    4 files changed, 126 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> index 0fafc178cf2c..6f567ebeb039 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> @@ -232,8 +232,20 @@ struct intel_context {
>>>    		/** @parent: pointer to parent if child */
>>>    		struct intel_context *parent;
>>> +
>>> +		/** @guc_wqi_head: head pointer in work queue */
>>> +		u16 guc_wqi_head;
>>> +		/** @guc_wqi_tail: tail pointer in work queue */
>>> +		u16 guc_wqi_tail;
>>> +
>> These should be in the 'guc_state' sub-struct? Would be good to keep all GuC
>> specific content in one self-contained struct. Especially given the other
>> child/parent fields are no going to be guc_ prefixed any more.
>>
> Right now I have everything in guc_state protected by guc_state.lock,
> these fields are not protected by this lock. IMO it is better to use a
> different sub-structure for the parallel fields (even if anonymous).
Hmm, I still think it is bad to be scattering back-end specific fields 
amongst regular fields. The GuC patches include a whole bunch of 
complaints about execlist back-end specific stuff leaking through to the 
higher levels, we really shouldn't be guilty of doing the same with GuC 
if at all possible. At the very least, the GuC specific fields should be 
grouped together at the end of the struct rather than inter-mingled.

>
>>>    		/** @guc_number_children: number of children if parent */
>>>    		u8 guc_number_children;
>>> +
>>> +		/**
>>> +		 * @parent_page: page in context used by parent for work queue,
>> Maybe 'page in context record'? Otherwise, exactly what 'context' is meant
>> here? It isn't the 'struct intel_context'. The contetx record is saved as
>> 'ce->state' / 'ce->lrc_reg_state', yes? Is it possible to link to either of
> It is the page in ce->state / page minus LRC reg offset in
> ce->lrg_reg_state. Will update the commit to make that clear.
>
>> those field? Probably not given that they don't appear to have any kerneldoc
>> description :(. Maybe add that in too :).
>>
>>> +		 * work queue descriptor
>> Later on, it is described as 'process descriptor and work queue'. It would
>> be good to be consistent.
>>
> Yep. Will fix.
>
>>> +		 */
>>> +		u8 parent_page;
>>>    	};
>>>    #ifdef CONFIG_DRM_I915_SELFTEST
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> index bb4af4977920..0ddbad4e062a 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> @@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
>>>    		context_size += PAGE_SIZE;
>>>    	}
>>> +	if (intel_context_is_parent(ce)) {
>>> +		ce->parent_page = context_size / PAGE_SIZE;
>>> +		context_size += PAGE_SIZE;
>>> +	}
>>> +
>>>    	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
>>>    	if (IS_ERR(obj))
>>>    		obj = i915_gem_object_create_shmem(engine->i915, context_size);
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> index fa4be13c8854..0e600a3b8f1e 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> @@ -52,7 +52,7 @@
>>>    #define GUC_DOORBELL_INVALID		256
>>> -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
>>> +#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
>> Is this size actually dictated by the GuC API? Or is it just a driver level
>> decision? If the latter, shouldn't this be below instead?
>>
> Driver level decision. What exactly do you mean by below?
The next chunk of the patch - where WQ_OFFSET is defined and the whole 
caboodle is described.

>   
>>>    /* Work queue item header definitions */
>>>    #define WQ_STATUS_ACTIVE		1
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 14b24298cdd7..dbcb9ab28a9a 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
>>>    	return rb_entry(rb, struct i915_priolist, node);
>>>    }
>>> +/*
>>> + * When using multi-lrc submission an extra page in the context state is
>>> + * reserved for the process descriptor and work queue.
>>> + *
>>> + * The layout of this page is below:
>>> + * 0						guc_process_desc
>>> + * ...						unused
>>> + * PAGE_SIZE / 2				work queue start
>>> + * ...						work queue
>>> + * PAGE_SIZE - 1				work queue end
>>> + */
>>> +#define WQ_OFFSET	(PAGE_SIZE / 2)
>> Can this not be derived from GUC_WQ_SIZE given that the two are
>> fundamentally linked? E.g. '#define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'?
> Yes. I like 'define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'. Will change.
>
>> And maybe have a '#define WQ_TOTAL_SIZE PAGE_SIZE' and use that in all of
>> WQ_OFFSET, GUC_WQ_SIZE and the allocation itself in intel_lrc.c?
>>
>> Also, the process descriptor is actually an array of descriptors sized by
>> the number of children? Or am I misunderstanding the code below? In so,
> No, it is fixed size descriptor.
Yeah, I think I was getting confused between pdesc and desc in the code 
below.

I still think it would be a good idea to have everything explicitly 
named and the only mention of PAGE_SIZE is in the 'total size' definition.

John.


>
> A later patch in the series uses the space of above the process
> descriptor for insertation of preeemption points handshake. That does
> depend on the number of children. I will add a COMPILE_BUG_ON for that
> to ensure everything fits in the memory layout.
>
>> shouldn't there be a 'COMPILE_BUG_ON((MAX_ENGINE_INSTANCE *
>> sizeof(descriptor)) < (WQ_DESC_SIZE)' where WQ_DESC_SIZE is WQ_TOTAL_SIZE -
>> WQ_SIZE?
>>
>>> +static u32 __get_process_desc_offset(struct intel_context *ce)
>>> +{
>>> +	GEM_BUG_ON(!ce->parent_page);
>>> +
>>> +	return ce->parent_page * PAGE_SIZE;
>>> +}
>>> +
>>> +static u32 __get_wq_offset(struct intel_context *ce)
>>> +{
>>> +	return __get_process_desc_offset(ce) + WQ_OFFSET;
>>> +}
>>> +
>>> +static struct guc_process_desc *
>>> +__get_process_desc(struct intel_context *ce)
>>> +{
>>> +	return (struct guc_process_desc *)
>>> +		(ce->lrc_reg_state +
>>> +		 ((__get_process_desc_offset(ce) -
>>> +		   LRC_STATE_OFFSET) / sizeof(u32)));
>> Where did the LRC_STATE_OFFSET come from? Is that built in to the
>> lrg_reg_state pointer itself? That needs to be documented somewhere.
>>
> In gt/intel_lrc.c (lrc_pin) ce->lrc_reg_state is assigned to
> mmap(ce->state) + LRC_STATE_OFFSET. I can update the kerneldoc for that
> field in this patch.
>
>>> +}
>>> +
>>>    static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>>>    {
>>>    	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>>> @@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>>>    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>    }
>>> +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
>>> +					   struct intel_context *ce,
>>> +					   u32 guc_id,
>>> +					   u32 offset,
>>> +					   bool loop)
>>> +{
>>> +	struct intel_context *child;
>>> +	u32 action[4 + MAX_ENGINE_INSTANCE];
>>> +	int len = 0;
>>> +
>>> +	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
>>> +
>>> +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
>>> +	action[len++] = guc_id;
>>> +	action[len++] = ce->guc_number_children + 1;
>>> +	action[len++] = offset;
>>> +	for_each_child(ce, child) {
>>> +		offset += sizeof(struct guc_lrc_desc);
>>> +		action[len++] = offset;
>>> +	}
>>> +
>>> +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
>>> +}
>>> +
>>>    static int __guc_action_register_context(struct intel_guc *guc,
>>>    					 u32 guc_id,
>>>    					 u32 offset,
>>> @@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
>>>    		ce->guc_id.id * sizeof(struct guc_lrc_desc);
>>>    	int ret;
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>    	trace_intel_context_register(ce);
>>> -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
>>> +	if (intel_context_is_parent(ce))
>>> +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
>>> +						      offset, loop);
>>> +	else
>>> +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
>>> +						    loop);
>>>    	if (likely(!ret)) {
>>>    		unsigned long flags;
>>> @@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>>>    {
>>>    	struct intel_guc *guc = ce_to_guc(ce);
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>    	trace_intel_context_deregister(ce);
>>>    	return __guc_action_deregister_context(guc, guc_id, loop);
>>> @@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	struct guc_lrc_desc *desc;
>>>    	bool context_registered;
>>>    	intel_wakeref_t wakeref;
>>> +	struct intel_context *child;
>>>    	int ret = 0;
>>>    	GEM_BUG_ON(!engine->mask);
>>> @@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>>    	guc_context_policy_init(engine, desc);
>>> +	/*
>>> +	 * Context is a parent, we need to register a process descriptor
>>> +	 * describing a work queue and register all child contexts.
>> Technically, this should say 'If the context is a parent'. Or just move it
>> to be inside the if block.
>>
> I will add the "If the".
>
>>> +	 */
>>> +	if (intel_context_is_parent(ce)) {
>>> +		struct guc_process_desc *pdesc;
>>> +
>>> +		ce->guc_wqi_tail = 0;
>>> +		ce->guc_wqi_head = 0;
>>> +
>>> +		desc->process_desc = i915_ggtt_offset(ce->state) +
>>> +			__get_process_desc_offset(ce);
>>> +		desc->wq_addr = i915_ggtt_offset(ce->state) +
>>> +			__get_wq_offset(ce);
>>> +		desc->wq_size = GUC_WQ_SIZE;
>>> +
>>> +		pdesc = __get_process_desc(ce);
>>> +		memset(pdesc, 0, sizeof(*(pdesc)));
>>> +		pdesc->stage_id = ce->guc_id.id;
>>> +		pdesc->wq_base_addr = desc->wq_addr;
>>> +		pdesc->wq_size_bytes = desc->wq_size;
>>> +		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
>> Should this not be inherited from the ce? And same below. Or are we not
>> using this priority in that way?
>>
> Honestly I don't think this field is used or maybe doesn't even exist
> anymore. I'll check the GuC code and likely delete this or if it is
> still present I'll inherited this from the ce.
>
> Matt
>
>> John.
>>
>>> +		pdesc->wq_status = WQ_STATUS_ACTIVE;
>>> +
>>> +		for_each_child(ce, child) {
>>> +			desc = __get_lrc_desc(guc, child->guc_id.id);
>>> +
>>> +			desc->engine_class =
>>> +				engine_class_to_guc_class(engine->class);
>>> +			desc->hw_context_desc = child->lrc.lrca;
>>> +			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
>>> +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>> +			guc_context_policy_init(engine, desc);
>>> +		}
>>> +	}
>>> +
>>>    	/*
>>>    	 * The context_lookup xarray is used to determine if the hardware
>>>    	 * context is currently registered. There are two cases in which it
>>> @@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>>>    		return NULL;
>>>    	}
>>> +	if (unlikely(intel_context_is_child(ce))) {
>>> +		drm_err(&guc_to_gt(guc)->i915->drm,
>>> +			"Context is child, desc_idx %u", desc_idx);
>>> +		return NULL;
>>> +	}
>>> +
>>>    	return ce;
>>>    }


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration
  2021-09-15 20:23       ` John Harrison
@ 2021-09-15 20:33         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-15 20:33 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 15, 2021 at 01:23:19PM -0700, John Harrison wrote:
> On 9/15/2021 12:31, Matthew Brost wrote:
> > On Wed, Sep 15, 2021 at 12:21:35PM -0700, John Harrison wrote:
> > > On 8/20/2021 15:44, Matthew Brost wrote:
> > > > Add multi-lrc context registration H2G. In addition a workqueue and
> > > > process descriptor are setup during multi-lrc context registration as
> > > > these data structures are needed for multi-lrc submission.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context_types.h |  12 ++
> > > >    drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 109 +++++++++++++++++-
> > > >    4 files changed, 126 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 0fafc178cf2c..6f567ebeb039 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -232,8 +232,20 @@ struct intel_context {
> > > >    		/** @parent: pointer to parent if child */
> > > >    		struct intel_context *parent;
> > > > +
> > > > +		/** @guc_wqi_head: head pointer in work queue */
> > > > +		u16 guc_wqi_head;
> > > > +		/** @guc_wqi_tail: tail pointer in work queue */
> > > > +		u16 guc_wqi_tail;
> > > > +
> > > These should be in the 'guc_state' sub-struct? Would be good to keep all GuC
> > > specific content in one self-contained struct. Especially given the other
> > > child/parent fields are no going to be guc_ prefixed any more.
> > > 
> > Right now I have everything in guc_state protected by guc_state.lock,
> > these fields are not protected by this lock. IMO it is better to use a
> > different sub-structure for the parallel fields (even if anonymous).
> Hmm, I still think it is bad to be scattering back-end specific fields
> amongst regular fields. The GuC patches include a whole bunch of complaints
> about execlist back-end specific stuff leaking through to the higher levels,
> we really shouldn't be guilty of doing the same with GuC if at all possible.
> At the very least, the GuC specific fields should be grouped together at the
> end of the struct rather than inter-mingled.
> 

How 2 different sub-structures - parallel (shared) & guc_parallel (guc specific)?

> > 
> > > >    		/** @guc_number_children: number of children if parent */
> > > >    		u8 guc_number_children;
> > > > +
> > > > +		/**
> > > > +		 * @parent_page: page in context used by parent for work queue,
> > > Maybe 'page in context record'? Otherwise, exactly what 'context' is meant
> > > here? It isn't the 'struct intel_context'. The contetx record is saved as
> > > 'ce->state' / 'ce->lrc_reg_state', yes? Is it possible to link to either of
> > It is the page in ce->state / page minus LRC reg offset in
> > ce->lrg_reg_state. Will update the commit to make that clear.
> > 
> > > those field? Probably not given that they don't appear to have any kerneldoc
> > > description :(. Maybe add that in too :).
> > > 
> > > > +		 * work queue descriptor
> > > Later on, it is described as 'process descriptor and work queue'. It would
> > > be good to be consistent.
> > > 
> > Yep. Will fix.
> > 
> > > > +		 */
> > > > +		u8 parent_page;
> > > >    	};
> > > >    #ifdef CONFIG_DRM_I915_SELFTEST
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > index bb4af4977920..0ddbad4e062a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > @@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
> > > >    		context_size += PAGE_SIZE;
> > > >    	}
> > > > +	if (intel_context_is_parent(ce)) {
> > > > +		ce->parent_page = context_size / PAGE_SIZE;
> > > > +		context_size += PAGE_SIZE;
> > > > +	}
> > > > +
> > > >    	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
> > > >    	if (IS_ERR(obj))
> > > >    		obj = i915_gem_object_create_shmem(engine->i915, context_size);
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > index fa4be13c8854..0e600a3b8f1e 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > @@ -52,7 +52,7 @@
> > > >    #define GUC_DOORBELL_INVALID		256
> > > > -#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
> > > > +#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
> > > Is this size actually dictated by the GuC API? Or is it just a driver level
> > > decision? If the latter, shouldn't this be below instead?
> > > 
> > Driver level decision. What exactly do you mean by below?
> The next chunk of the patch - where WQ_OFFSET is defined and the whole
> caboodle is described.
> 
> > > >    /* Work queue item header definitions */
> > > >    #define WQ_STATUS_ACTIVE		1
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 14b24298cdd7..dbcb9ab28a9a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -340,6 +340,39 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> > > >    	return rb_entry(rb, struct i915_priolist, node);
> > > >    }
> > > > +/*
> > > > + * When using multi-lrc submission an extra page in the context state is
> > > > + * reserved for the process descriptor and work queue.
> > > > + *
> > > > + * The layout of this page is below:
> > > > + * 0						guc_process_desc
> > > > + * ...						unused
> > > > + * PAGE_SIZE / 2				work queue start
> > > > + * ...						work queue
> > > > + * PAGE_SIZE - 1				work queue end
> > > > + */
> > > > +#define WQ_OFFSET	(PAGE_SIZE / 2)
> > > Can this not be derived from GUC_WQ_SIZE given that the two are
> > > fundamentally linked? E.g. '#define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'?
> > Yes. I like 'define WQ_OFFSET (PAGE_SIZE - GUC_WQ_SIZE)'. Will change.
> > 
> > > And maybe have a '#define WQ_TOTAL_SIZE PAGE_SIZE' and use that in all of
> > > WQ_OFFSET, GUC_WQ_SIZE and the allocation itself in intel_lrc.c?
> > > 
> > > Also, the process descriptor is actually an array of descriptors sized by
> > > the number of children? Or am I misunderstanding the code below? In so,
> > No, it is fixed size descriptor.
> Yeah, I think I was getting confused between pdesc and desc in the code
> below.
> 
> I still think it would be a good idea to have everything explicitly named
> and the only mention of PAGE_SIZE is in the 'total size' definition.
> 

#define PARENT_SCRATCH_SIZE 	PAGE_SIZE?

Matt 

> John.
> 
> 
> > 
> > A later patch in the series uses the space of above the process
> > descriptor for insertation of preeemption points handshake. That does
> > depend on the number of children. I will add a COMPILE_BUG_ON for that
> > to ensure everything fits in the memory layout.
> > 
> > > shouldn't there be a 'COMPILE_BUG_ON((MAX_ENGINE_INSTANCE *
> > > sizeof(descriptor)) < (WQ_DESC_SIZE)' where WQ_DESC_SIZE is WQ_TOTAL_SIZE -
> > > WQ_SIZE?
> > > 
> > > > +static u32 __get_process_desc_offset(struct intel_context *ce)
> > > > +{
> > > > +	GEM_BUG_ON(!ce->parent_page);
> > > > +
> > > > +	return ce->parent_page * PAGE_SIZE;
> > > > +}
> > > > +
> > > > +static u32 __get_wq_offset(struct intel_context *ce)
> > > > +{
> > > > +	return __get_process_desc_offset(ce) + WQ_OFFSET;
> > > > +}
> > > > +
> > > > +static struct guc_process_desc *
> > > > +__get_process_desc(struct intel_context *ce)
> > > > +{
> > > > +	return (struct guc_process_desc *)
> > > > +		(ce->lrc_reg_state +
> > > > +		 ((__get_process_desc_offset(ce) -
> > > > +		   LRC_STATE_OFFSET) / sizeof(u32)));
> > > Where did the LRC_STATE_OFFSET come from? Is that built in to the
> > > lrg_reg_state pointer itself? That needs to be documented somewhere.
> > > 
> > In gt/intel_lrc.c (lrc_pin) ce->lrc_reg_state is assigned to
> > mmap(ce->state) + LRC_STATE_OFFSET. I can update the kerneldoc for that
> > field in this patch.
> > 
> > > > +}
> > > > +
> > > >    static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> > > >    {
> > > >    	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > > > @@ -1342,6 +1375,30 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >    	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> > > >    }
> > > > +static int __guc_action_register_multi_lrc(struct intel_guc *guc,
> > > > +					   struct intel_context *ce,
> > > > +					   u32 guc_id,
> > > > +					   u32 offset,
> > > > +					   bool loop)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +	u32 action[4 + MAX_ENGINE_INSTANCE];
> > > > +	int len = 0;
> > > > +
> > > > +	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
> > > > +
> > > > +	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
> > > > +	action[len++] = guc_id;
> > > > +	action[len++] = ce->guc_number_children + 1;
> > > > +	action[len++] = offset;
> > > > +	for_each_child(ce, child) {
> > > > +		offset += sizeof(struct guc_lrc_desc);
> > > > +		action[len++] = offset;
> > > > +	}
> > > > +
> > > > +	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
> > > > +}
> > > > +
> > > >    static int __guc_action_register_context(struct intel_guc *guc,
> > > >    					 u32 guc_id,
> > > >    					 u32 offset,
> > > > @@ -1364,9 +1421,15 @@ static int register_context(struct intel_context *ce, bool loop)
> > > >    		ce->guc_id.id * sizeof(struct guc_lrc_desc);
> > > >    	int ret;
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >    	trace_intel_context_register(ce);
> > > > -	ret = __guc_action_register_context(guc, ce->guc_id.id, offset, loop);
> > > > +	if (intel_context_is_parent(ce))
> > > > +		ret = __guc_action_register_multi_lrc(guc, ce, ce->guc_id.id,
> > > > +						      offset, loop);
> > > > +	else
> > > > +		ret = __guc_action_register_context(guc, ce->guc_id.id, offset,
> > > > +						    loop);
> > > >    	if (likely(!ret)) {
> > > >    		unsigned long flags;
> > > > @@ -1396,6 +1459,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > > >    {
> > > >    	struct intel_guc *guc = ce_to_guc(ce);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >    	trace_intel_context_deregister(ce);
> > > >    	return __guc_action_deregister_context(guc, guc_id, loop);
> > > > @@ -1423,6 +1487,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >    	struct guc_lrc_desc *desc;
> > > >    	bool context_registered;
> > > >    	intel_wakeref_t wakeref;
> > > > +	struct intel_context *child;
> > > >    	int ret = 0;
> > > >    	GEM_BUG_ON(!engine->mask);
> > > > @@ -1448,6 +1513,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >    	desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > >    	guc_context_policy_init(engine, desc);
> > > > +	/*
> > > > +	 * Context is a parent, we need to register a process descriptor
> > > > +	 * describing a work queue and register all child contexts.
> > > Technically, this should say 'If the context is a parent'. Or just move it
> > > to be inside the if block.
> > > 
> > I will add the "If the".
> > 
> > > > +	 */
> > > > +	if (intel_context_is_parent(ce)) {
> > > > +		struct guc_process_desc *pdesc;
> > > > +
> > > > +		ce->guc_wqi_tail = 0;
> > > > +		ce->guc_wqi_head = 0;
> > > > +
> > > > +		desc->process_desc = i915_ggtt_offset(ce->state) +
> > > > +			__get_process_desc_offset(ce);
> > > > +		desc->wq_addr = i915_ggtt_offset(ce->state) +
> > > > +			__get_wq_offset(ce);
> > > > +		desc->wq_size = GUC_WQ_SIZE;
> > > > +
> > > > +		pdesc = __get_process_desc(ce);
> > > > +		memset(pdesc, 0, sizeof(*(pdesc)));
> > > > +		pdesc->stage_id = ce->guc_id.id;
> > > > +		pdesc->wq_base_addr = desc->wq_addr;
> > > > +		pdesc->wq_size_bytes = desc->wq_size;
> > > > +		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
> > > Should this not be inherited from the ce? And same below. Or are we not
> > > using this priority in that way?
> > > 
> > Honestly I don't think this field is used or maybe doesn't even exist
> > anymore. I'll check the GuC code and likely delete this or if it is
> > still present I'll inherited this from the ce.
> > 
> > Matt
> > 
> > > John.
> > > 
> > > > +		pdesc->wq_status = WQ_STATUS_ACTIVE;
> > > > +
> > > > +		for_each_child(ce, child) {
> > > > +			desc = __get_lrc_desc(guc, child->guc_id.id);
> > > > +
> > > > +			desc->engine_class =
> > > > +				engine_class_to_guc_class(engine->class);
> > > > +			desc->hw_context_desc = child->lrc.lrca;
> > > > +			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
> > > > +			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > > +			guc_context_policy_init(engine, desc);
> > > > +		}
> > > > +	}
> > > > +
> > > >    	/*
> > > >    	 * The context_lookup xarray is used to determine if the hardware
> > > >    	 * context is currently registered. There are two cases in which it
> > > > @@ -2858,6 +2959,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> > > >    		return NULL;
> > > >    	}
> > > > +	if (unlikely(intel_context_is_child(ce))) {
> > > > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > > > +			"Context is child, desc_idx %u", desc_idx);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > >    	return ce;
> > > >    }
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 14/27] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-09-15 20:04   ` John Harrison
@ 2021-09-15 20:55     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-15 20:55 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 15, 2021 at 01:04:45PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Assign contexts in parent-child relationship consecutive guc_ids. This
> > is accomplished by partitioning guc_id space between ones that need to
> > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > available guc_ids). The consecutive search is implemented via the bitmap
> > API.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > when using the GuC multi-lrc interface.
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Explictly state why we assign consecutive guc_ids
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 107 +++++++++++++-----
> >   2 files changed, 86 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 023953e77553..3f95b1b4f15c 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -61,9 +61,13 @@ struct intel_guc {
> >   		 */
> >   		spinlock_t lock;
> >   		/**
> > -		 * @guc_ids: used to allocate new guc_ids
> > +		 * @guc_ids: used to allocate new guc_ids, single-lrc
> >   		 */
> >   		struct ida guc_ids;
> > +		/**
> > +		 * @guc_ids_bitmap: used to allocate new guc_ids, multi-lrc
> > +		 */
> > +		unsigned long *guc_ids_bitmap;
> >   		/** @num_guc_ids: number of guc_ids that can be used */
> >   		u32 num_guc_ids;
> >   		/** @max_guc_ids: max number of guc_ids that can be used */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 00d54bb00bfb..e9dfd43d29a0 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -125,6 +125,18 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > +/*
> > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > + * per the GuC submission interface. A different allocation algorithm is used
> > + * (bitmap vs. ida) between multi-lrc and single-lrc hence the reason to
> The 'hence' clause seems to be attached to the wrong reason. The id space is
> partition because of the contiguous vs random requirements of multi vs
> single LRC, not because a different allocator is used in one partion vs the
> other.
> 

Kinda? The reason I partitioned it because to algorithms are different,
we could a unified space with a single algorithm, right? It was just
easier split the space and use 2 already existing data structures rather
cook up an algorithm in a unified space. There isn't a requirement from
the GuC that the space is partitioned, the only requirement is multi-lrc
IDs are contiguous. All this being said, I think comment is correct.

> > + * partition the guc_id space. We believe the number of multi-lrc contexts in
> > + * use should be low and 1/16 should be sufficient. Minimum of 32 guc_ids for
> > + * multi-lrc.
> > + */
> > +#define NUMBER_MULTI_LRC_GUC_ID(guc) \
> > +	((guc)->submission_state.num_guc_ids / 16 > 32 ? \
> > +	 (guc)->submission_state.num_guc_ids / 16 : 32)
> > +
> >   /*
> >    * Below is a set of functions which control the GuC scheduling state which
> >    * require a lock.
> > @@ -1176,6 +1188,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >   	INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts);
> >   	intel_gt_pm_unpark_work_init(&guc->submission_state.destroyed_worker,
> >   				     destroyed_worker_func);
> > +	guc->submission_state.guc_ids_bitmap =
> > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > +	if (!guc->submission_state.guc_ids_bitmap)
> > +		return -ENOMEM;
> >   	return 0;
> >   }
> > @@ -1188,6 +1204,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >   	guc_lrc_desc_pool_destroy(guc);
> >   	guc_flush_destroyed_contexts(guc);
> >   	i915_sched_engine_put(guc->sched_engine);
> > +	bitmap_free(guc->submission_state.guc_ids_bitmap);
> >   }
> >   static void queue_request(struct i915_sched_engine *sched_engine,
> > @@ -1239,18 +1256,43 @@ static void guc_submit_request(struct i915_request *rq)
> >   	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >   }
> > -static int new_guc_id(struct intel_guc *guc)
> > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	return ida_simple_get(&guc->submission_state.guc_ids, 0,
> > -			      guc->submission_state.num_guc_ids, GFP_KERNEL |
> > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +	int ret;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (intel_context_is_parent(ce))
> > +		ret = bitmap_find_free_region(guc->submission_state.guc_ids_bitmap,
> > +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> > +					      order_base_2(ce->guc_number_children
> > +							   + 1));
> > +	else
> > +		ret = ida_simple_get(&guc->submission_state.guc_ids,
> > +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> > +				     guc->submission_state.num_guc_ids,
> > +				     GFP_KERNEL | __GFP_RETRY_MAYFAIL |
> > +				     __GFP_NOWARN);
> > +	if (unlikely(ret < 0))
> > +		return ret;
> > +
> > +	ce->guc_id.id = ret;
> > +	return 0;
> >   }
> >   static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >   	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->submission_state.guc_ids,
> > -				  ce->guc_id.id);
> > +		if (intel_context_is_parent(ce))
> > +			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
> > +					      ce->guc_id.id,
> > +					      order_base_2(ce->guc_number_children
> > +							   + 1));
> Is there any check against adding/removing children when the guc_ids are
> allocated? Presumably it shouldn't ever happen but if it did then the

I don't have any protection for that but adding something like this
isn't bad idea.

> bitmap_release would not match the allocation. Maybe add BUG_ON(ce->guc_id)
> to the parent/child link functions (if it's not there already?).
>

Do you something like below in this function?

GEM_BUG_ON(guc_id_not_is_use());

> > +		else
> > +			ida_simple_remove(&guc->submission_state.guc_ids,
> > +					  ce->guc_id.id);
> >   		reset_lrc_desc(guc, ce->guc_id.id);
> >   		set_context_guc_id_invalid(ce);
> >   	}
> > @@ -1267,49 +1309,60 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> >   }
> > -static int steal_guc_id(struct intel_guc *guc)
> > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> > -	struct intel_context *ce;
> > -	int guc_id;
> > +	struct intel_context *cn;
> Leaving this as 'ce' and calling the input parameter 'ce_in' would have made
> for significantly easier to read diffs!
> 

Yea probably but I think we should change the style to make diff easier
to read.

> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(intel_context_is_parent(ce));
> >   	if (!list_empty(&guc->submission_state.guc_id_list)) {
> > -		ce = list_first_entry(&guc->submission_state.guc_id_list,
> > +		cn = list_first_entry(&guc->submission_state.guc_id_list,
> >   				      struct intel_context,
> >   				      guc_id.link);
> > -		GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
> > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > -
> > -		list_del_init(&ce->guc_id.link);
> > -		guc_id = ce->guc_id.id;
> > +		GEM_BUG_ON(atomic_read(&cn->guc_id.ref));
> > +		GEM_BUG_ON(context_guc_id_invalid(cn));
> > +		GEM_BUG_ON(intel_context_is_child(cn));
> > +		GEM_BUG_ON(intel_context_is_parent(cn));
> > -		spin_lock(&ce->guc_state.lock);
> As far as I can tell, the only actual change to this function (beyond
> 'ce_in->id = id' vs 'return id' and adding anti-family asserts) is that this
> spinlock was dropped. However, I'm not seeing any replacement for it or any
> comment about why the spinlock is no longer necessary.
>

Good catch, the lock shouldn't be dropped.

Matt
 
> John.
> 
> 
> > -		clr_context_registered(ce);
> > -		spin_unlock(&ce->guc_state.lock);
> > +		list_del_init(&cn->guc_id.link);
> > +		ce->guc_id = cn->guc_id;
> > +		clr_context_registered(cn);
> > +		set_context_guc_id_invalid(cn);
> > -		set_context_guc_id_invalid(ce);
> > -		return guc_id;
> > +		return 0;
> >   	} else {
> >   		return -EAGAIN;
> >   	}
> >   }
> > -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   {
> >   	int ret;
> >   	lockdep_assert_held(&guc->submission_state.lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	ret = new_guc_id(guc);
> > +	ret = new_guc_id(guc, ce);
> >   	if (unlikely(ret < 0)) {
> > -		ret = steal_guc_id(guc);
> > +		if (intel_context_is_parent(ce))
> > +			return -ENOSPC;
> > +
> > +		ret = steal_guc_id(guc, ce);
> >   		if (ret < 0)
> >   			return ret;
> >   	}
> > -	*out = ret;
> > +	if (intel_context_is_parent(ce)) {
> > +		struct intel_context *child;
> > +		int i = 1;
> > +
> > +		for_each_child(ce, child)
> > +			child->guc_id.id = ce->guc_id.id + i++;
> > +	}
> > +
> >   	return 0;
> >   }
> > @@ -1327,7 +1380,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	might_lock(&ce->guc_state.lock);
> >   	if (context_guc_id_invalid(ce)) {
> > -		ret = assign_guc_id(guc, &ce->guc_id.id);
> > +		ret = assign_guc_id(guc, ce);
> >   		if (ret)
> >   			goto out_unlock;
> >   		ret = 1;	/* Indidcates newly assigned guc_id */
> > @@ -1369,8 +1422,10 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   	unsigned long flags;
> >   	GEM_BUG_ON(atomic_read(&ce->guc_id.ref) < 0);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > -	if (unlikely(context_guc_id_invalid(ce)))
> > +	if (unlikely(context_guc_id_invalid(ce) ||
> > +		     intel_context_is_parent(ce)))
> >   		return;
> >   	spin_lock_irqsave(&guc->submission_state.lock, flags);
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 08/27] drm/i915: Add logical engine mapping
  2021-09-15 16:58                 ` Matthew Brost
@ 2021-09-16  8:31                   ` Tvrtko Ursulin
  0 siblings, 0 replies; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-16  8:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu


On 15/09/2021 17:58, Matthew Brost wrote:
> On Wed, Sep 15, 2021 at 09:24:15AM +0100, Tvrtko Ursulin wrote:
>>
>> On 14/09/2021 19:04, Matthew Brost wrote:
>>> On Tue, Sep 14, 2021 at 09:34:08AM +0100, Tvrtko Ursulin wrote:
>>>>
>>
>> 8<
>>
>>>> Today we have:
>>>>
>>>> for_each intel_engines: // intel_engines is a flat list of all engines
>>>> 	intel_engine_setup()
>>>>
>>>> You propose to change it to:
>>>>
>>>> for_each engine_class:
>>>>      for 0..max_global_engine_instance:
>>>>         for_each intel_engines:
>>>>            skip engine not present
>>>>            skip class not matching
>>>>
>>>>            count logical instance
>>>>
>>>>      for_each intel_engines:
>>>>         skip engine not present
>>>>         skip wrong class
>>>>
>>>>         intel_engine_setup()
>>>>
>>>>
>>>> I propose:
>>>>
>>>> // Leave as is:
>>>>
>>>> for_each intel_engines:
>>>>      intel_engine_setup()
>>>>
>>>> // Add:
>>>>
>>>> for_each engine_class:
>>>>      logical = 0
>>>>      for_each gt->engine_class[class]:
>>>>         skip engine not present
>>>>
>>>>         engine->logical_instance = logical++
>>>>
>>>>
>>>> When code which actually needs a preturbed "map" arrives you add that in to
>>>> this second loop.
>>>>
>>>
>>> See above, why introduce an algorithm that doesn't work for future parts
>>> + future patches are land imminently? It makes zero sense whatsoever.
>>> With your proposal we would literally land code to just throw it away a
>>> couple of months from now + break patches we intend to land soon. This
>>
>> It sure works, it just walks the per class list instead of walking the flat
>> list skipping one class at the time.
>>
>> Just add the map based transformation to the second pass later, when it
>> becomes required.
>>
> 
> I can flatten the algorithm if that helps alleviate your concerns but
> with that being said, I've played around this locally and IMO makes the
> code way more ugly. Sure it eliminates some iterations of the loop but
> who really cares about that in a one time setup function?
> 
>>> algorithm works and has no reason whatsoever to be optimal as it a one
>>> time setup call. I really don't understand why we are still talking
>>> about this paint color.
>>
>> I don't think bike shedding is not an appropriate term when complaint is how
>> proposed algorithm is needlessly complicated.
>>
> 
> Are you just ignoring the fact that the algorithm (map) is needed in
> pending patches? IMO it is more complicated to write throw away code
> when the proper algorithm is already written. If the logical mapping was
> straight forward on all platforms as the ones currently upstream I would
> 100% agree with your suggestion, but it isn't on unembargoed platforms
> eminently going upstream. The algorithm I have works for the current
> platforms + the pending platforms. IMO is 100% acceptable to merge
> something looking towards a known future.

FWIW my 2c is that unused bits detract from review. And my gut feeling 
still is that code can be written in a simpler way and that the map 
transform can still plug in easily on top in a later series.

I said FWIW since even if I am right you can still view my comments as 
external/community inputs at this point and proceed however you wish.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
                     ` (2 preceding siblings ...)
  (?)
@ 2021-09-20 21:48   ` John Harrison
  2021-09-22 16:25     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-20 21:48 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Implement multi-lrc submission via a single workqueue entry and single
> H2G. The workqueue entry contains an updated tail value for each
> request, of all the contexts in the multi-lrc submission, and updates
> these values simultaneously. As such, the tasklet and bypass path have
> been updated to coalesce requests into a single submission.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
>   drivers/gpu/drm/i915/i915_request.h           |   8 +
>   6 files changed, 317 insertions(+), 62 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> index fbfcae727d7f..879aef662b2e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> @@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
>   		}
>   	}
>   }
> +
> +void intel_guc_write_barrier(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> +		GEM_BUG_ON(guc->send_regs.fw_domains);
Granted, this patch is just moving code from one file to another not 
changing it. However, I think it would be worth adding a blank line in 
here. Otherwise the 'this register' comment below can be confusingly 
read as referring to the send_regs.fw_domain entry above.

And maybe add a comment why it is a bug for the send_regs value to be 
set? I'm not seeing any obvious connection between it and the reset of 
this code.

> +		/*
> +		 * This register is used by the i915 and GuC for MMIO based
> +		 * communication. Once we are in this code CTBs are the only
> +		 * method the i915 uses to communicate with the GuC so it is
> +		 * safe to write to this register (a value of 0 is NOP for MMIO
> +		 * communication). If we ever start mixing CTBs and MMIOs a new
> +		 * register will have to be chosen.
> +		 */
> +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> +	} else {
> +		/* wmb() sufficient for a barrier if in smem */
> +		wmb();
> +	}
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 3f95b1b4f15c..0ead2406d03c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -37,6 +37,12 @@ struct intel_guc {
>   	/* Global engine used to submit requests to GuC */
>   	struct i915_sched_engine *sched_engine;
>   	struct i915_request *stalled_request;
> +	enum {
> +		STALL_NONE,
> +		STALL_REGISTER_CONTEXT,
> +		STALL_MOVE_LRC_TAIL,
> +		STALL_ADD_REQUEST,
> +	} submission_stall_reason;
>   
>   	/* intel_guc_recv interrupt related state */
>   	spinlock_t irq_lock;
> @@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
>   
>   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
>   
> +void intel_guc_write_barrier(struct intel_guc *guc);
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 20c710a74498..10d1878d2826 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
>   	return ++ct->requests.last_fence;
>   }
>   
> -static void write_barrier(struct intel_guc_ct *ct)
> -{
> -	struct intel_guc *guc = ct_to_guc(ct);
> -	struct intel_gt *gt = guc_to_gt(guc);
> -
> -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> -		GEM_BUG_ON(guc->send_regs.fw_domains);
> -		/*
> -		 * This register is used by the i915 and GuC for MMIO based
> -		 * communication. Once we are in this code CTBs are the only
> -		 * method the i915 uses to communicate with the GuC so it is
> -		 * safe to write to this register (a value of 0 is NOP for MMIO
> -		 * communication). If we ever start mixing CTBs and MMIOs a new
> -		 * register will have to be chosen.
> -		 */
> -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> -	} else {
> -		/* wmb() sufficient for a barrier if in smem */
> -		wmb();
> -	}
> -}
> -
>   static int ct_write(struct intel_guc_ct *ct,
>   		    const u32 *action,
>   		    u32 len /* in dwords */,
> @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   	 * make sure H2G buffer update and LRC tail update (if this triggering a
>   	 * submission) are visible before updating the descriptor tail
>   	 */
> -	write_barrier(ct);
> +	intel_guc_write_barrier(ct_to_guc(ct));
>   
>   	/* update local copies */
>   	ctb->tail = tail;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 0e600a3b8f1e..6cd26dc060d1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -65,12 +65,14 @@
>   #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
>   #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
>   #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> -#define WQ_TARGET_SHIFT			10
> +#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
> +#define WQ_TARGET_SHIFT			8
>   #define WQ_LEN_SHIFT			16
>   #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
>   #define WQ_PRESENT_WORKLOAD		(1 << 28)
>   
> -#define WQ_RING_TAIL_SHIFT		20
> +#define WQ_GUC_ID_SHIFT			0
> +#define WQ_RING_TAIL_SHIFT		18
Presumably all of these API changes are not actually new? They really 
came in with the reset of the v40 re-write? It's just that this is the 
first time we are using them and therefore need to finally update the 
defines?

>   #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
>   #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e9dfd43d29a0..b107ad095248 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
>   		   LRC_STATE_OFFSET) / sizeof(u32)));
>   }
>   
> +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> +			   struct intel_context *ce,
> +			   u32 wqi_size)
> +{
> +	/*
> +	 * Check for space in work queue. Caching a value of head pointer in
> +	 * intel_context structure in order reduce the number accesses to shared
> +	 * GPU memory which may be across a PCIe bus.
> +	 */
> +#define AVAILABLE_SPACE	\
> +	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
> +	if (wqi_size > AVAILABLE_SPACE) {
> +		ce->guc_wqi_head = READ_ONCE(desc->head);
> +
> +		if (wqi_size > AVAILABLE_SPACE)
> +			return NULL;
> +	}
> +#undef AVAILABLE_SPACE
> +
> +	return ((u32 *)__get_process_desc(ce)) +
> +		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
> +}
> +
>   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>   {
>   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> @@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>   
>   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
>   
> -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   {
>   	int err = 0;
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u32 action[3];
>   	int len = 0;
>   	u32 g2h_len_dw = 0;
> @@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>   	GEM_BUG_ON(context_guc_id_invalid(ce));
>   
> -	/*
> -	 * Corner case where the GuC firmware was blown away and reloaded while
> -	 * this context was pinned.
> -	 */
> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> -		err = guc_lrc_desc_pin(ce, false);
> -		if (unlikely(err))
> -			return err;
> -	}
> -
>   	spin_lock(&ce->guc_state.lock);
>   
>   	/*
>   	 * The request / context will be run on the hardware when scheduling
> -	 * gets enabled in the unblock.
> +	 * gets enabled in the unblock. For multi-lrc we still submit the
> +	 * context to move the LRC tails.
>   	 */
> -	if (unlikely(context_blocked(ce)))
> +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
>   		goto out;
>   
> -	enabled = context_enabled(ce);
> +	enabled = context_enabled(ce) || context_blocked(ce);
Would be better to say '|| is_parent(ce)' rather than blocked? The 
reason for reason for claiming enabled when not is because it's a 
multi-LRC parent, right? Or can there be a parent that is neither 
enabled nor blocked for which we don't want to do the processing? But 
why would that make sense/be possible?

>   
>   	if (!enabled) {
>   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> @@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   		trace_intel_context_sched_enable(ce);
>   		atomic_inc(&guc->outstanding_submission_g2h);
>   		set_context_enabled(ce);
> +
> +		/*
> +		 * Without multi-lrc KMD does the submission step (moving the
> +		 * lrc tail) so enabling scheduling is sufficient to submit the
> +		 * context. This isn't the case in multi-lrc submission as the
> +		 * GuC needs to move the tails, hence the need for another H2G
> +		 * to submit a multi-lrc context after enabling scheduling.
> +		 */
> +		if (intel_context_is_parent(ce)) {
> +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> +		}
>   	} else if (!enabled) {
>   		clr_context_pending_enable(ce);
>   		intel_context_put(ce);
> @@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>   	return err;
>   }
>   
> +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	int ret = __guc_add_request(guc, rq);
> +
> +	if (unlikely(ret == -EBUSY)) {
> +		guc->stalled_request= rq;
> +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> +	}
> +
> +	return ret;
> +}
> +
>   static void guc_set_lrc_tail(struct i915_request *rq)
>   {
>   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> @@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static bool is_multi_lrc_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_child(rq->context) ||
> +		intel_context_is_parent(rq->context);
> +}
> +
> +static bool can_merge_rq(struct i915_request *rq,
> +			 struct i915_request *last)
> +{
> +	return request_to_scheduling_context(rq) ==
> +		request_to_scheduling_context(last);
> +}
> +
> +static u32 wq_space_until_wrap(struct intel_context *ce)
> +{
> +	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
> +}
> +
> +static void write_wqi(struct guc_process_desc *desc,
> +		      struct intel_context *ce,
> +		      u32 wqi_size)
> +{
> +	/*
> +	 * Ensure WQE are visible before updating tail
WQE or WQI?

> +	 */
> +	intel_guc_write_barrier(ce_to_guc(ce));
> +
> +	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
> +	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
> +}
> +
> +static int guc_wq_noop_append(struct intel_context *ce)
> +{
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> +
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	*wqi = WQ_TYPE_NOOP |
> +		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
This should have a BUG_ON check that the requested size fits within the 
WQ_LEN field?

Indeed, would be better to use the FIELD macros as they do that kind of 
thing for you.


> +	ce->guc_wqi_tail = 0;
> +
> +	return 0;
> +}
> +
> +static int __guc_wq_item_append(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	struct intel_context *child;
> +	struct guc_process_desc *desc = __get_process_desc(ce);
> +	unsigned int wqi_size = (ce->guc_number_children + 4) *
> +		sizeof(u32);
> +	u32 *wqi;
> +	int ret;
> +
> +	/* Ensure context is in correct state updating work queue */
> +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> +	GEM_BUG_ON(context_guc_id_invalid(ce));
> +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> +
> +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> +	if (wqi_size > wq_space_until_wrap(ce)) {
> +		ret = guc_wq_noop_append(ce);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	wqi = get_wq_pointer(desc, ce, wqi_size);
> +	if (!wqi)
> +		return -EBUSY;
> +
> +	*wqi++ = WQ_TYPE_MULTI_LRC |
> +		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
> +	*wqi++ = ce->lrc.lrca;
> +	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
> +		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
As above, would be better to use FIELD macros instead of manual shifting.

John.


> +	*wqi++ = 0;	/* fence_id */
> +	for_each_child(ce, child)
> +		*wqi++ = child->ring->tail / sizeof(u64);
> +
> +	write_wqi(desc, ce, wqi_size);
> +
> +	return 0;
> +}
> +
> +static int guc_wq_item_append(struct intel_guc *guc,
> +			      struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +	int ret = 0;
> +
> +	if (likely(!intel_context_is_banned(ce))) {
> +		ret = __guc_wq_item_append(rq);
> +
> +		if (unlikely(ret == -EBUSY)) {
> +			guc->stalled_request = rq;
> +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static bool multi_lrc_submit(struct i915_request *rq)
> +{
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	intel_ring_set_tail(rq->ring, rq->tail);
> +
> +	/*
> +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> +	 * request generated from a multi-BB submission. This indicates to the
> +	 * backend (GuC interface) that we should submit this context thus
> +	 * submitting all the requests generated in parallel.
> +	 */
> +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
> +		intel_context_is_banned(ce);
> +}
> +
>   static int guc_dequeue_one_context(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> @@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   	if (guc->stalled_request) {
>   		submit = true;
>   		last = guc->stalled_request;
> -		goto resubmit;
> +
> +		switch (guc->submission_stall_reason) {
> +		case STALL_REGISTER_CONTEXT:
> +			goto register_context;
> +		case STALL_MOVE_LRC_TAIL:
> +			goto move_lrc_tail;
> +		case STALL_ADD_REQUEST:
> +			goto add_request;
> +		default:
> +			MISSING_CASE(guc->submission_stall_reason);
> +		}
>   	}
>   
>   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> @@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   		struct i915_request *rq, *rn;
>   
>   		priolist_for_each_request_consume(rq, rn, p) {
> -			if (last && rq->context != last->context)
> -				goto done;
> +			if (last && !can_merge_rq(rq, last))
> +				goto register_context;
>   
>   			list_del_init(&rq->sched.link);
>   
> @@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>   
>   			trace_i915_request_in(rq, 0);
>   			last = rq;
> -			submit = true;
> +
> +			if (is_multi_lrc_rq(rq)) {
> +				/*
> +				 * We need to coalesce all multi-lrc requests in
> +				 * a relationship into a single H2G. We are
> +				 * guaranteed that all of these requests will be
> +				 * submitted sequentially.
> +				 */
> +				if (multi_lrc_submit(rq)) {
> +					submit = true;
> +					goto register_context;
> +				}
> +			} else {
> +				submit = true;
> +			}
>   		}
>   
>   		rb_erase_cached(&p->node, &sched_engine->queue);
>   		i915_priolist_free(p);
>   	}
> -done:
> +
> +register_context:
>   	if (submit) {
> -		guc_set_lrc_tail(last);
> -resubmit:
> +		struct intel_context *ce = request_to_scheduling_context(last);
> +
> +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> +			     !intel_context_is_banned(ce))) {
> +			ret = guc_lrc_desc_pin(ce, false);
> +			if (unlikely(ret == -EPIPE)) {
> +				goto deadlk;
> +			} else if (ret == -EBUSY) {
> +				guc->stalled_request = last;
> +				guc->submission_stall_reason =
> +					STALL_REGISTER_CONTEXT;
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		}
> +
> +move_lrc_tail:
> +		if (is_multi_lrc_rq(last)) {
> +			ret = guc_wq_item_append(guc, last);
> +			if (ret == -EBUSY) {
> +				goto schedule_tasklet;
> +			} else if (ret != 0) {
> +				GEM_WARN_ON(ret);	/* Unexpected */
> +				goto deadlk;
> +			}
> +		} else {
> +			guc_set_lrc_tail(last);
> +		}
> +
> +add_request:
>   		ret = guc_add_request(guc, last);
> -		if (unlikely(ret == -EPIPE))
> +		if (unlikely(ret == -EPIPE)) {
> +			goto deadlk;
> +		} else if (ret == -EBUSY) {
> +			goto schedule_tasklet;
> +		} else if (ret != 0) {
> +			GEM_WARN_ON(ret);	/* Unexpected */
>   			goto deadlk;
> -		else if (ret == -EBUSY) {
> -			tasklet_schedule(&sched_engine->tasklet);
> -			guc->stalled_request = last;
> -			return false;
>   		}
>   	}
>   
>   	guc->stalled_request = NULL;
> +	guc->submission_stall_reason = STALL_NONE;
>   	return submit;
>   
>   deadlk:
>   	sched_engine->tasklet.callback = NULL;
>   	tasklet_disable_nosync(&sched_engine->tasklet);
>   	return false;
> +
> +schedule_tasklet:
> +	tasklet_schedule(&sched_engine->tasklet);
> +	return false;
>   }
>   
>   static void guc_submission_tasklet(struct tasklet_struct *t)
> @@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   
>   	trace_i915_request_in(rq, 0);
>   
> -	guc_set_lrc_tail(rq);
> -	ret = guc_add_request(guc, rq);
> -	if (ret == -EBUSY)
> -		guc->stalled_request = rq;
> +	if (is_multi_lrc_rq(rq)) {
> +		if (multi_lrc_submit(rq)) {
> +			ret = guc_wq_item_append(guc, rq);
> +			if (!ret)
> +				ret = guc_add_request(guc, rq);
> +		}
> +	} else {
> +		guc_set_lrc_tail(rq);
> +		ret = guc_add_request(guc, rq);
> +	}
>   
>   	if (unlikely(ret == -EPIPE))
>   		disable_submission(guc);
> @@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>   	return ret;
>   }
>   
> +bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	return submission_disabled(guc) || guc->stalled_request ||
> +		!i915_sched_engine_is_empty(sched_engine) ||
> +		!lrc_desc_registered(guc, ce->guc_id.id);
> +}
> +
>   static void guc_submit_request(struct i915_request *rq)
>   {
>   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> @@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
>   	/* Will be called from irq-context when using foreign fences. */
>   	spin_lock_irqsave(&sched_engine->lock, flags);
>   
> -	if (submission_disabled(guc) || guc->stalled_request ||
> -	    !i915_sched_engine_is_empty(sched_engine))
> +	if (need_tasklet(guc, rq))
>   		queue_request(sched_engine, rq, rq_prio(rq));
>   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
>   		tasklet_hi_schedule(&sched_engine->tasklet);
> @@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
>   
>   static void add_to_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
>   
>   	spin_lock(&ce->guc_state.lock);
> @@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
>   
>   static void remove_from_context(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irq(&ce->guc_state.lock);
>   
> @@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
>   static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   					   int prio)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
>   
>   	/* Short circuit function */
> @@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>   
>   static void guc_retire_inflight_request_prio(struct i915_request *rq)
>   {
> -	struct intel_context *ce = rq->context;
> +	struct intel_context *ce = request_to_scheduling_context(rq);
>   
>   	spin_lock(&ce->guc_state.lock);
>   	guc_prio_fini(rq, ce);
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 177eaf55adff..8f0073e19079 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -139,6 +139,14 @@ enum {
>   	 * the GPU. Here we track such boost requests on a per-request basis.
>   	 */
>   	I915_FENCE_FLAG_BOOST,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) should
> +	 * trigger a submission to the GuC rather than just moving the context
> +	 * tail.
> +	 */
> +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 16/27] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-20 21:57   ` John Harrison
  -1 siblings, 0 replies; 145+ messages in thread
From: John Harrison @ 2021-09-20 21:57 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> The GuC must receive requests in the order submitted for contexts in a
> parent-child relationship to function correctly. To ensure this, insert
> a submit fence between the current request and last request submitted
> for requests / contexts in a parent child relationship. This is
> conceptually similar to a single timeline.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   5 +-
>   drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
>   4 files changed, 109 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c2985822ab74..9dcc1b14697b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -75,6 +75,11 @@ intel_context_to_parent(struct intel_context *ce)
>           }
>   }
>   
> +static inline bool intel_context_is_parallel(struct intel_context *ce)
> +{
> +	return intel_context_is_child(ce) || intel_context_is_parent(ce);
> +}
> +
>   void intel_context_bind_parent_child(struct intel_context *parent,
>   				     struct intel_context *child);
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 6f567ebeb039..a63329520c35 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -246,6 +246,13 @@ struct intel_context {
>   		 * work queue descriptor
>   		 */
>   		u8 parent_page;
> +
> +		/**
> +		 * @last_rq: last request submitted on a parallel context, used
> +		 * to insert submit fences between request in the parallel
request -> requests

With that fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>


> +		 * context.
> +		 */
> +		struct i915_request *last_rq;
>   	};
>   
>   #ifdef CONFIG_DRM_I915_SELFTEST
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b107ad095248..f0b60fecf253 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -672,8 +672,7 @@ static int rq_prio(const struct i915_request *rq)
>   
>   static bool is_multi_lrc_rq(struct i915_request *rq)
>   {
> -	return intel_context_is_child(rq->context) ||
> -		intel_context_is_parent(rq->context);
> +	return intel_context_is_parallel(rq->context);
>   }
>   
>   static bool can_merge_rq(struct i915_request *rq,
> @@ -2843,6 +2842,8 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>   	GEM_BUG_ON(!intel_context_is_parent(ce));
>   	GEM_BUG_ON(!intel_engine_is_virtual(ce->engine));
>   
> +	if (ce->last_rq)
> +		i915_request_put(ce->last_rq);
>   	unpin_guc_id(guc, ce);
>   	lrc_unpin(ce);
>   }
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index ce446716d092..2e51c8999088 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
>   	return ret;
>   }
>   
> +static inline bool is_parallel_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_parallel(rq->context);
> +}
> +
> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
>   static struct i915_request *
> -__i915_request_add_to_timeline(struct i915_request *rq)
> +__i915_request_ensure_parallel_ordering(struct i915_request *rq,
> +					struct intel_timeline *timeline)
>   {
> -	struct intel_timeline *timeline = i915_request_timeline(rq);
>   	struct i915_request *prev;
>   
> -	/*
> -	 * Dependency tracking and request ordering along the timeline
> -	 * is special cased so that we can eliminate redundant ordering
> -	 * operations while building the request (we know that the timeline
> -	 * itself is ordered, and here we guarantee it).
> -	 *
> -	 * As we know we will need to emit tracking along the timeline,
> -	 * we embed the hooks into our request struct -- at the cost of
> -	 * having to have specialised no-allocation interfaces (which will
> -	 * be beneficial elsewhere).
> -	 *
> -	 * A second benefit to open-coding i915_request_await_request is
> -	 * that we can apply a slight variant of the rules specialised
> -	 * for timelines that jump between engines (such as virtual engines).
> -	 * If we consider the case of virtual engine, we must emit a dma-fence
> -	 * to prevent scheduling of the second request until the first is
> -	 * complete (to maximise our greedy late load balancing) and this
> -	 * precludes optimising to use semaphores serialisation of a single
> -	 * timeline across engines.
> -	 */
> +	GEM_BUG_ON(!is_parallel_rq(rq));
> +
> +	prev = request_to_parent(rq)->last_rq;
> +	if (prev) {
> +		if (!__i915_request_is_complete(prev)) {
> +			i915_sw_fence_await_sw_fence(&rq->submit,
> +						     &prev->submit,
> +						     &rq->submitq);
> +
> +			if (rq->engine->sched_engine->schedule)
> +				__i915_sched_node_add_dependency(&rq->sched,
> +								 &prev->sched,
> +								 &rq->dep,
> +								 0);
> +		}
> +		i915_request_put(prev);
> +	}
> +
> +	request_to_parent(rq)->last_rq = i915_request_get(rq);
> +
> +	return to_request(__i915_active_fence_set(&timeline->last_request,
> +						  &rq->fence));
> +}
> +
> +static struct i915_request *
> +__i915_request_ensure_ordering(struct i915_request *rq,
> +			       struct intel_timeline *timeline)
> +{
> +	struct i915_request *prev;
> +
> +	GEM_BUG_ON(is_parallel_rq(rq));
> +
>   	prev = to_request(__i915_active_fence_set(&timeline->last_request,
>   						  &rq->fence));
> +
>   	if (prev && !__i915_request_is_complete(prev)) {
>   		bool uses_guc = intel_engine_uses_guc(rq->engine);
> +		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
> +					  rq->engine->mask);
> +		bool same_context = prev->context == rq->context;
>   
>   		/*
>   		 * The requests are supposed to be kept in order. However,
> @@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
>   		 * is used as a barrier for external modification to this
>   		 * context.
>   		 */
> -		GEM_BUG_ON(prev->context == rq->context &&
> +		GEM_BUG_ON(same_context &&
>   			   i915_seqno_passed(prev->fence.seqno,
>   					     rq->fence.seqno));
>   
> -		if ((!uses_guc &&
> -		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
> -		    (uses_guc && prev->context == rq->context))
> +		if ((same_context && uses_guc) || (!uses_guc && pow2))
>   			i915_sw_fence_await_sw_fence(&rq->submit,
>   						     &prev->submit,
>   						     &rq->submitq);
> @@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
>   							 0);
>   	}
>   
> +	return prev;
> +}
> +
> +static struct i915_request *
> +__i915_request_add_to_timeline(struct i915_request *rq)
> +{
> +	struct intel_timeline *timeline = i915_request_timeline(rq);
> +	struct i915_request *prev;
> +
> +	/*
> +	 * Dependency tracking and request ordering along the timeline
> +	 * is special cased so that we can eliminate redundant ordering
> +	 * operations while building the request (we know that the timeline
> +	 * itself is ordered, and here we guarantee it).
> +	 *
> +	 * As we know we will need to emit tracking along the timeline,
> +	 * we embed the hooks into our request struct -- at the cost of
> +	 * having to have specialised no-allocation interfaces (which will
> +	 * be beneficial elsewhere).
> +	 *
> +	 * A second benefit to open-coding i915_request_await_request is
> +	 * that we can apply a slight variant of the rules specialised
> +	 * for timelines that jump between engines (such as virtual engines).
> +	 * If we consider the case of virtual engine, we must emit a dma-fence
> +	 * to prevent scheduling of the second request until the first is
> +	 * complete (to maximise our greedy late load balancing) and this
> +	 * precludes optimising to use semaphores serialisation of a single
> +	 * timeline across engines.
> +	 *
> +	 * We do not order parallel submission requests on the timeline as each
> +	 * parallel submission context has its own timeline and the ordering
> +	 * rules for parallel requests are that they must be submitted in the
> +	 * order received from the execbuf IOCTL. So rather than using the
> +	 * timeline we store a pointer to last request submitted in the
> +	 * relationship in the gem context and insert a submission fence
> +	 * between that request and request passed into this function or
> +	 * alternatively we use completion fence if gem context has a single
> +	 * timeline and this is the first submission of an execbuf IOCTL.
> +	 */
> +	if (likely(!is_parallel_rq(rq)))
> +		prev = __i915_request_ensure_ordering(rq, timeline);
> +	else
> +		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
> +
>   	/*
>   	 * Make sure that no request gazumped us - if it was allocated after
>   	 * our i915_request_alloc() and called __i915_request_add() before


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 17/27] drm/i915/guc: Implement multi-lrc reset
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-20 22:44   ` John Harrison
  2021-09-22 16:16     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-20 22:44 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

[-- Attachment #1: Type: text/plain, Size: 8144 bytes --]

On 8/20/2021 15:44, Matthew Brost wrote:
> Update context and full GPU reset to work with multi-lrc. The idea is
> parent context tracks all the active requests inflight for itself and
> its' children. The parent context owns the reset replaying / canceling
its' -> its

> requests as needed.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       | 11 ++--
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 63 +++++++++++++------
>   2 files changed, 51 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 00d1aee6d199..5615be32879c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -528,20 +528,21 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
>   
>   struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   {
> +	struct intel_context *parent = intel_context_to_parent(ce);
>   	struct i915_request *rq, *active = NULL;
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
Should this not check the parent as well/instead?

And to be clear, this can be called on regular contexts (where ce == 
parent) and on both the parent or child contexts of multi-LRC contexts 
(where ce may or may not match parent)?


>   
> -	spin_lock_irqsave(&ce->guc_state.lock, flags);
> -	list_for_each_entry_reverse(rq, &ce->guc_state.requests,
> +	spin_lock_irqsave(&parent->guc_state.lock, flags);
> +	list_for_each_entry_reverse(rq, &parent->guc_state.requests,
>   				    sched.link) {
> -		if (i915_request_completed(rq))
> +		if (i915_request_completed(rq) && rq->context == ce)
'rq->context == ce' means:

 1. single-LRC context, rq is owned by ce
 2. multi-LRC context, ce is child, rq really belongs to ce but is being
    tracked by parent
 3. multi-LRC context, ce is parent, rq really is owned by ce

So when 'rq->ce != ce', it means that the request is owned by a 
different child to 'ce' but within the same multi-LRC group. So we want 
to ignore that request and keep searching until we find one that is 
really owned by the target ce?

>   			break;
>   
> -		active = rq;
> +		active = (rq->context == ce) ? rq : active;
Would be clearer to say 'if(rq->ce != ce) continue;' and leave 'active = 
rq;' alone?

And again, the intention is to ignore requests that are owned by other 
members of the same multi-LRC group?

Would be good to add some documentation to this function to explain the 
above (assuming my description is correct?).

>   	}
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f0b60fecf253..e34e0ea9136a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -670,6 +670,11 @@ static int rq_prio(const struct i915_request *rq)
>   	return rq->sched.attr.priority;
>   }
>   
> +static inline bool is_multi_lrc(struct intel_context *ce)
> +{
> +	return intel_context_is_parallel(ce);
> +}
> +
>   static bool is_multi_lrc_rq(struct i915_request *rq)
>   {
>   	return intel_context_is_parallel(rq->context);
> @@ -1179,10 +1184,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
>   
>   static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   {
> +	bool local_stalled;
>   	struct i915_request *rq;
>   	unsigned long flags;
>   	u32 head;
> +	int i, number_children = ce->guc_number_children;
If this is a child context, does it not need to pull the child count 
from the parent? Likewise the list/link pointers below? Or does each 
child context have a full list of its siblings + parent?

>   	bool skip = false;
> +	struct intel_context *parent = ce;
>   
>   	intel_context_get(ce);
>   
> @@ -1209,25 +1217,34 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>   	if (unlikely(skip))
>   		goto out_put;
>   
> -	rq = intel_context_find_active_request(ce);
> -	if (!rq) {
> -		head = ce->ring->tail;
> -		stalled = false;
> -		goto out_replay;
> -	}
> +	for (i = 0; i < number_children + 1; ++i) {
> +		if (!intel_context_is_pinned(ce))
> +			goto next_context;
> +
> +		local_stalled = false;
> +		rq = intel_context_find_active_request(ce);
> +		if (!rq) {
> +			head = ce->ring->tail;
> +			goto out_replay;
> +		}
>   
> -	if (!i915_request_started(rq))
> -		stalled = false;
> +		GEM_BUG_ON(i915_active_is_idle(&ce->active));
> +		head = intel_ring_wrap(ce->ring, rq->head);
>   
> -	GEM_BUG_ON(i915_active_is_idle(&ce->active));
> -	head = intel_ring_wrap(ce->ring, rq->head);
> -	__i915_request_reset(rq, stalled);
> +		if (i915_request_started(rq))
Why change the ordering of the started test versus the wrap/reset call? 
Is it significant? Why is it now important to be reversed?

> +			local_stalled = true;
>   
> +		__i915_request_reset(rq, local_stalled && stalled);
>   out_replay:
> -	guc_reset_state(ce, head, stalled);
> -	__unwind_incomplete_requests(ce);
> +		guc_reset_state(ce, head, local_stalled && stalled);
> +next_context:
> +		if (i != number_children)
> +			ce = list_next_entry(ce, guc_child_link);
Can this not be put in to the step clause of the for statement?

> +	}
> +
> +	__unwind_incomplete_requests(parent);
>   out_put:
> -	intel_context_put(ce);
> +	intel_context_put(parent);
As above, I think this function would benefit from some comments to 
explain exactly what is being done and why.

John.

>   }
>   
>   void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> @@ -1248,7 +1265,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			__guc_reset_context(ce, stalled);
>   
>   		intel_context_put(ce);
> @@ -1340,7 +1358,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>   
>   		xa_unlock(&guc->context_lookup);
>   
> -		if (intel_context_is_pinned(ce))
> +		if (intel_context_is_pinned(ce) &&
> +		    !intel_context_is_child(ce))
>   			guc_cancel_context_requests(ce);
>   
>   		intel_context_put(ce);
> @@ -2031,6 +2050,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>   	u16 guc_id;
>   	bool enabled;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
>   	incr_context_blocked(ce);
> @@ -2068,6 +2089,7 @@ static void guc_context_unblock(struct intel_context *ce)
>   	bool enable;
>   
>   	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
> @@ -2099,11 +2121,14 @@ static void guc_context_unblock(struct intel_context *ce)
>   static void guc_context_cancel_request(struct intel_context *ce,
>   				       struct i915_request *rq)
>   {
> +	struct intel_context *block_context =
> +		request_to_scheduling_context(rq);
> +
>   	if (i915_sw_fence_signaled(&rq->submit)) {
>   		struct i915_sw_fence *fence;
>   
>   		intel_context_get(ce);
> -		fence = guc_context_block(ce);
> +		fence = guc_context_block(block_context);
>   		i915_sw_fence_wait(fence);
>   		if (!i915_request_completed(rq)) {
>   			__i915_request_skip(rq);
> @@ -2117,7 +2142,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>   		 */
>   		flush_work(&ce_to_guc(ce)->ct.requests.worker);
>   
> -		guc_context_unblock(ce);
> +		guc_context_unblock(block_context);
>   		intel_context_put(ce);
>   	}
>   }
> @@ -2143,6 +2168,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>   	intel_wakeref_t wakeref;
>   	unsigned long flags;
>   
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>   	guc_flush_submissions(guc);
>   
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);


[-- Attachment #2: Type: text/html, Size: 9794 bytes --]

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 18/27] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-20 22:44   ` Matthew Brost
  (?)
@ 2021-09-20 22:48   ` John Harrison
  2021-09-21 19:13     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-20 22:48 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu



On 8/20/2021 15:44, Matthew Brost wrote:
> Display the workqueue status in debugfs for GuC contexts that are in
> parent-child relationship.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 51 ++++++++++++++-----
>   1 file changed, 37 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index e34e0ea9136a..07eee9a399c8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3673,6 +3673,26 @@ static void guc_log_context_priority(struct drm_printer *p,
>   	drm_printf(p, "\n");
>   }
>   
> +
> +static inline void guc_log_context(struct drm_printer *p,
> +				   struct intel_context *ce)
> +{
> +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> +		   ce->ring->head,
> +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> +		   ce->ring->tail,
> +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> +		   atomic_read(&ce->pin_count));
> +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> +		   atomic_read(&ce->guc_id.ref));
> +	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> +		   ce->guc_state.sched_state);
> +}
> +
>   void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   					     struct drm_printer *p)
>   {
> @@ -3682,22 +3702,25 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> -			   ce->ring->head,
> -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> -			   ce->ring->tail,
> -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> -			   atomic_read(&ce->pin_count));
> -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> -			   atomic_read(&ce->guc_id.ref));
> -		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> -			   ce->guc_state.sched_state);
> +		GEM_BUG_ON(intel_context_is_child(ce));
>   
> +		guc_log_context(p, ce);
>   		guc_log_context_priority(p, ce);
> +
> +		if (intel_context_is_parent(ce)) {
> +			struct guc_process_desc *desc = __get_process_desc(ce);
> +			struct intel_context *child;
> +
> +			drm_printf(p, "\t\tWQI Head: %u\n",
> +				   READ_ONCE(desc->head));
> +			drm_printf(p, "\t\tWQI Tail: %u\n",
> +				   READ_ONCE(desc->tail));
> +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> +				   READ_ONCE(desc->wq_status));
> +
> +			for_each_child(ce, child)
> +				guc_log_context(p, child);
There should be some indication that this is a child context and/or how 
many children there are. Otherwise how does the reader differentiation 
between the list of child contexts and the next parent/single context?

John.

> +		}
>   	}
>   	xa_unlock_irqrestore(&guc->context_lookup, flags);
>   }


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-20 22:57   ` John Harrison
  2021-09-21 14:49     ` Tvrtko Ursulin
  2021-09-21 19:28     ` Matthew Brost
  -1 siblings, 2 replies; 145+ messages in thread
From: John Harrison @ 2021-09-20 22:57 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Set number of engines before attempting to create contexts so the
> function free_engines can clean up properly.
>
> Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index dbaeb924a437..bcaaf514876b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -944,6 +944,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	unsigned int n;
>   
>   	e = alloc_engines(num_engines);
This can return null when out of memory. There needs to be an early exit 
check before dereferencing a null pointer. Not sure if that is a worse 
bug or not than leaking memory! Either way, it would be good to fix that 
too.

John.

> +	e->num_engines = num_engines;
>   	for (n = 0; n < num_engines; n++) {
>   		struct intel_context *ce;
>   		int ret;
> @@ -977,7 +978,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			goto free_engines;
>   		}
>   	}
> -	e->num_engines = num_engines;
>   
>   	return e;
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-08-20 22:44   ` Matthew Brost
                     ` (2 preceding siblings ...)
  (?)
@ 2021-09-21  0:09   ` John Harrison
  2021-09-22 16:38     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-21  0:09 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Introduce 'set parallel submit' extension to connect UAPI to GuC
> multi-lrc interface. Kernel doc in new uAPI should explain it all.
>
> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: link to come
Is this link still not available?

Also, see 'kernel test robot' emails saying that sparse is complaining 
about something I don't understand but presumably needs to be fixed.


>
> v2:
>   (Daniel Vetter)
>    - Add IGT link and placeholder for media UMD link
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 220 +++++++++++++++++-
>   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
>   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
>   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
>   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
>   include/uapi/drm/i915_drm.h                   | 128 ++++++++++
>   9 files changed, 485 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index bcaaf514876b..de0fd145fb47 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -522,9 +522,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
>   	return 0;
>   }
>   
> +static int
> +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> +				      void *data)
> +{
> +	struct i915_context_engines_parallel_submit __user *ext =
> +		container_of_user(base, typeof(*ext), base);
> +	const struct set_proto_ctx_engines *set = data;
> +	struct drm_i915_private *i915 = set->i915;
> +	u64 flags;
> +	int err = 0, n, i, j;
> +	u16 slot, width, num_siblings;
> +	struct intel_engine_cs **siblings = NULL;
> +	intel_engine_mask_t prev_mask;
> +
> +	/* Disabling for now */
> +	return -ENODEV;
> +
> +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> +		return -ENODEV;
This needs a FIXME comment to say that exec list will be added later.

> +
> +	if (get_user(slot, &ext->engine_index))
> +		return -EFAULT;
> +
> +	if (get_user(width, &ext->width))
> +		return -EFAULT;
> +
> +	if (get_user(num_siblings, &ext->num_siblings))
> +		return -EFAULT;
> +
> +	if (slot >= set->num_engines) {
> +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> +			slot, set->num_engines);
> +		return -EINVAL;
> +	}
> +
> +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> +		drm_dbg(&i915->drm,
> +			"Invalid placement[%d], already occupied\n", slot);
> +		return -EINVAL;
> +	}
> +
> +	if (get_user(flags, &ext->flags))
> +		return -EFAULT;
> +
> +	if (flags) {
> +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> +		return -EINVAL;
> +	}
> +
> +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> +		err = check_user_mbz(&ext->mbz64[n]);
> +		if (err)
> +			return err;
> +	}
> +
> +	if (width < 2) {
> +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> +		return -EINVAL;
> +	}
> +
> +	if (num_siblings < 1) {
> +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> +			num_siblings);
> +		return -EINVAL;
> +	}
> +
> +	siblings = kmalloc_array(num_siblings * width,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return -ENOMEM;
> +
> +	/* Create contexts / engines */
> +	for (i = 0; i < width; ++i) {
> +		intel_engine_mask_t current_mask = 0;
> +		struct i915_engine_class_instance prev_engine;
> +
> +		for (j = 0; j < num_siblings; ++j) {
> +			struct i915_engine_class_instance ci;
> +
> +			n = i * num_siblings + j;
> +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> +				err = -EFAULT;
> +				goto out_err;
> +			}
> +
> +			siblings[n] =
> +				intel_engine_lookup_user(i915, ci.engine_class,
> +							 ci.engine_instance);
> +			if (!siblings[n]) {
> +				drm_dbg(&i915->drm,
> +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> +					n, ci.engine_class, ci.engine_instance);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +
> +			if (n) {
> +				if (prev_engine.engine_class !=
> +				    ci.engine_class) {
> +					drm_dbg(&i915->drm,
> +						"Mismatched class %d, %d\n",
> +						prev_engine.engine_class,
> +						ci.engine_class);
> +					err = -EINVAL;
> +					goto out_err;
> +				}
> +			}
> +
> +			prev_engine = ci;
> +			current_mask |= siblings[n]->logical_mask;
> +		}
> +
> +		if (i > 0) {
> +			if (current_mask != prev_mask << 1) {
> +				drm_dbg(&i915->drm,
> +					"Non contiguous logical mask 0x%x, 0x%x\n",
> +					prev_mask, current_mask);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +		}
> +		prev_mask = current_mask;
> +	}
> +
> +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> +	set->engines[slot].num_siblings = num_siblings;
> +	set->engines[slot].width = width;
> +	set->engines[slot].siblings = siblings;
> +
> +	return 0;
> +
> +out_err:
> +	kfree(siblings);
> +
> +	return err;
> +}
> +
>   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
>   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
>   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> +		set_proto_ctx_engines_parallel_submit,
>   };
>   
>   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> @@ -821,6 +961,25 @@ static int intel_context_set_gem(struct intel_context *ce,
>   	return ret;
>   }
>   
> +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> +{
> +	while (count--) {
> +		struct intel_context *ce = e->engines[count], *child;
> +
> +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> +			continue;
> +
> +		for_each_child(ce, child)
> +			intel_context_unpin(child);
> +		intel_context_unpin(ce);
> +	}
> +}
> +
> +static void unpin_engines(struct i915_gem_engines *e)
> +{
> +	__unpin_engines(e, e->num_engines);
> +}
> +
>   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
>   {
>   	while (count--) {
> @@ -936,6 +1095,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
>   	return err;
>   }
>   
> +static int perma_pin_contexts(struct intel_context *ce)
What is this perma_ping thing about?

> +{
> +	struct intel_context *child;
> +	int i = 0, j = 0, ret;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	ret = intel_context_pin(ce);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	for_each_child(ce, child) {
> +		ret = intel_context_pin(child);
> +		if (unlikely(ret))
> +			goto unwind;
> +		++i;
> +	}
> +
> +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> +
> +	return 0;
> +
> +unwind:
> +	intel_context_unpin(ce);
> +	for_each_child(ce, child) {
> +		if (j++ < i)
> +			intel_context_unpin(child);
> +		else
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
>   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   					     unsigned int num_engines,
>   					     struct i915_gem_proto_engine *pe)
> @@ -946,7 +1139,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   	e = alloc_engines(num_engines);
>   	e->num_engines = num_engines;
>   	for (n = 0; n < num_engines; n++) {
> -		struct intel_context *ce;
> +		struct intel_context *ce, *child;
>   		int ret;
>   
>   		switch (pe[n].type) {
> @@ -956,7 +1149,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   
>   		case I915_GEM_ENGINE_TYPE_BALANCED:
>   			ce = intel_engine_create_virtual(pe[n].siblings,
> -							 pe[n].num_siblings);
> +							 pe[n].num_siblings, 0);
> +			break;
> +
> +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> +			ce = intel_engine_create_parallel(pe[n].siblings,
> +							  pe[n].num_siblings,
> +							  pe[n].width);
>   			break;
>   
>   		case I915_GEM_ENGINE_TYPE_INVALID:
> @@ -977,6 +1176,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>   			err = ERR_PTR(ret);
>   			goto free_engines;
>   		}
> +		for_each_child(ce, child) {
> +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
> +
> +		/* XXX: Must be done after setting gem context */
Why the 'XXX'? Is it saying that the ordering is a problem that needs to 
be fixed?

> +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> +			ret = perma_pin_contexts(ce);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
>   	}
>   
>   	return e;
> @@ -1200,6 +1415,7 @@ static void context_close(struct i915_gem_context *ctx)
>   
>   	/* Flush any concurrent set_engines() */
>   	mutex_lock(&ctx->engines_mutex);
> +	unpin_engines(ctx->engines);
>   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
>   	i915_gem_context_set_closed(ctx);
>   	mutex_unlock(&ctx->engines_mutex);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index 94c03a97cb77..7b096d83bca1 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
>   
>   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
>   	I915_GEM_ENGINE_TYPE_BALANCED,
> +
> +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> +	I915_GEM_ENGINE_TYPE_PARALLEL,
>   };
>   
>   /**
> @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
>   	/** @num_siblings: Number of balanced siblings */
>   	unsigned int num_siblings;
>   
> +	/** @width: Width of each sibling */
> +	unsigned int width;
> +
>   	/** @siblings: Balanced siblings */
>   	struct intel_engine_cs **siblings;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index a63329520c35..713d85b0b364 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -55,9 +55,13 @@ struct intel_context_ops {
>   	void (*reset)(struct intel_context *ce);
>   	void (*destroy)(struct kref *kref);
>   
> -	/* virtual engine/context interface */
> +	/* virtual/parallel engine/context interface */
>   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> -						unsigned int count);
> +						unsigned int count,
> +						unsigned long flags);
> +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> +						 unsigned int num_siblings,
> +						 unsigned int width);
>   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
>   					       unsigned int sibling);
>   };
> @@ -113,6 +117,7 @@ struct intel_context {
>   #define CONTEXT_NOPREEMPT		8
>   #define CONTEXT_LRCA_DIRTY		9
>   #define CONTEXT_GUC_INIT		10
> +#define CONTEXT_PERMA_PIN		11
>   
>   	struct {
>   		u64 timeout_us;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 87579affb952..43f16a8347ee 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>   	return intel_engine_has_preemption(engine);
>   }
>   
> +#define FORCE_VIRTUAL	BIT(0)
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count);
> +			    unsigned int count, unsigned long flags);
> +
> +static inline struct intel_context *
> +intel_engine_create_parallel(struct intel_engine_cs **engines,
> +			     unsigned int num_engines,
> +			     unsigned int width)
> +{
> +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> +}
>   
>   static inline bool
>   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 4d790f9a65dd..f66c75c77584 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   
>   struct intel_context *
>   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count)
> +			    unsigned int count, unsigned long flags)
>   {
>   	if (count == 0)
>   		return ERR_PTR(-EINVAL);
>   
> -	if (count == 1)
> +	if (count == 1 && !(flags & FORCE_VIRTUAL))
>   		return intel_context_create(siblings[0]);
>   
>   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> -	return siblings[0]->cops->create_virtual(siblings, count);
> +	return siblings[0]->cops->create_virtual(siblings, count, flags);
>   }
>   
>   struct i915_request *
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 813a6de01382..d1e2d6f8ff81 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags);
>   
>   static struct i915_request *
>   __active_request(const struct intel_timeline * const tl,
> @@ -3782,7 +3783,8 @@ static void virtual_submit_request(struct i915_request *rq)
>   }
>   
>   static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags)
>   {
>   	struct virtual_engine *ve;
>   	unsigned int n;
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index f12ffe797639..e876a9d88a5c 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
>   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
>   
>   	for (n = 0; n < nctx; n++) {
> -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ve[n])) {
>   			err = PTR_ERR(ve[n]);
>   			nctx = n;
> @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
>   	 * restrict it to our desired engine within the virtual engine.
>   	 */
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_close;
> @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
>   		i915_request_add(rq);
>   	}
>   
> -	ce = intel_engine_create_virtual(siblings, nsibling);
> +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ce)) {
>   		err = PTR_ERR(ce);
>   		goto out;
> @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
>   
>   	/* XXX We do not handle oversubscription and fairness with normal rq */
>   	for (n = 0; n < nsibling; n++) {
> -		ce = intel_engine_create_virtual(siblings, nsibling);
> +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
>   		if (IS_ERR(ce)) {
>   			err = PTR_ERR(ce);
>   			goto out;
> @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
>   	if (err)
>   		goto out_scratch;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_scratch;
> @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
>   	if (igt_spinner_init(&spin, gt))
>   		return -ENOMEM;
>   
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>   	if (IS_ERR(ve)) {
>   		err = PTR_ERR(ve);
>   		goto out_spin;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 07eee9a399c8..2554d0eb4afd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -121,7 +121,13 @@ struct guc_virtual_engine {
>   };
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags);
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width);
>   
>   #define GUC_REQUEST_SIZE 64 /* bytes */
>   
> @@ -2581,6 +2587,7 @@ static const struct intel_context_ops guc_context_ops = {
>   	.destroy = guc_context_destroy,
>   
>   	.create_virtual = guc_create_virtual,
> +	.create_parallel = guc_create_parallel,
>   };
>   
>   static void submit_work_cb(struct irq_work *wrk)
> @@ -2827,8 +2834,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2845,8 +2850,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   {
>   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> @@ -2858,8 +2861,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
>   	return __guc_context_pin(ce, engine, vaddr);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_parent_context_unpin(struct intel_context *ce)
>   {
>   	struct intel_guc *guc = ce_to_guc(ce);
> @@ -2875,8 +2876,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(context_enabled(ce));
> @@ -2887,8 +2886,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
>   	lrc_unpin(ce);
>   }
>   
> -/* Future patches will use this function */
> -__maybe_unused
>   static void guc_child_context_post_unpin(struct intel_context *ce)
>   {
>   	GEM_BUG_ON(!intel_context_is_child(ce));
> @@ -2899,6 +2896,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
>   	intel_context_unpin(ce->parent);
>   }
>   
> +static void guc_child_context_destroy(struct kref *kref)
> +{
> +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> +
> +	__guc_context_destroy(ce);
> +}
> +
> +static const struct intel_context_ops virtual_parent_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_parent_context_pin,
> +	.unpin = guc_parent_context_unpin,
> +	.post_unpin = guc_context_post_unpin,
> +
> +	.ban = guc_context_ban,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.sched_disable = guc_context_sched_disable,
> +
> +	.destroy = guc_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static const struct intel_context_ops virtual_child_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_context_pre_pin,
> +	.pin = guc_child_context_pin,
> +	.unpin = guc_child_context_unpin,
> +	.post_unpin = guc_child_context_post_unpin,
> +
> +	.cancel_request = guc_context_cancel_request,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.destroy = guc_child_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width)
> +{
> +	struct intel_engine_cs **siblings = NULL;
> +	struct intel_context *parent = NULL, *ce, *err;
> +	int i, j;
> +
> +	siblings = kmalloc_array(num_siblings,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < width; ++i) {
> +		for (j = 0; j < num_siblings; ++j)
> +			siblings[j] = engines[i * num_siblings + j];
> +
> +		ce = intel_engine_create_virtual(siblings, num_siblings,
> +						 FORCE_VIRTUAL);
> +		if (!ce) {
> +			err = ERR_PTR(-ENOMEM);
> +			goto unwind;
> +		}
> +
> +		if (i == 0) {
> +			parent = ce;
> +			parent->ops = &virtual_parent_context_ops;
> +		} else {
> +			ce->ops = &virtual_child_context_ops;
> +			intel_context_bind_parent_child(parent, ce);
> +		}
> +	}
> +
> +	kfree(siblings);
> +	return parent;
> +
> +unwind:
> +	if (parent)
> +		intel_context_put(parent);
> +	kfree(siblings);
> +	return err;
> +}
> +
>   static bool
>   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
>   {
> @@ -3726,7 +3815,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   }
>   
>   static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags)
>   {
>   	struct guc_virtual_engine *ve;
>   	struct intel_guc *guc;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index b1248a67b4f8..b153f8215403 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
>    * Extensions:
>    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
>    */
>   #define I915_CONTEXT_PARAM_ENGINES	0xa
>   
> @@ -2049,6 +2050,132 @@ struct i915_context_engines_bond {
>   	struct i915_engine_class_instance engines[N__]; \
>   } __attribute__((packed)) name__
>   
> +/**
> + * struct i915_context_engines_parallel_submit - Configure engine for
> + * parallel submission.
> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915
i915 run -> i915 to run

> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> + * many BBs there are based on the slot's configuration. The N BBs are the last
> + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> + *
> + * The default placement behavior is to create implicit bonds between each
> + * context if each context maps to more than 1 physical engine (e.g. context is
> + * a virtual engine). Also we only allow contexts of same engine class and these
> + * contexts must be in logically contiguous order. Examples of the placement
> + * behavior described below. Lastly, the default is to not allow BBs to
behaviour described -> behaviour are described

> + * preempted mid BB rather insert coordinated preemption on all hardware
to preempted mid BB rather -> to be preempted mid-batch. Rather

coordinated preemption on -> coordinated preemption points on

> + * contexts between each set of BBs. Flags may be added in the future to change
may -> could - 'may' implies we are thinking about doing it (maybe just 
for fun or because we're bored), 'could' implies a user has to ask for 
the facility if they need it.

> + * both of these default behaviors.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * interface.
> + *
> + * .. code-block:: none
> + *
> + *	Example 1 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
I would put these two terminology explanations above the 'example 1' 
line given that they are generic to all examples.

> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> + *		     engines=CS[0],CS[1])
> + *
> + *	Results in the following valid placement:
> + *	CS[0], CS[1]
> + *
> + *	Example 2 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
And drop them from here.

> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[2],CS[1],CS[3])
> + *
> + *	Results in the following valid placements:
> + *	CS[0], CS[1]
> + *	CS[2], CS[3]
> + *
> + *	This can also be thought of as 2 virtual engines described by 2-D array
> + *	in the engines the field with bonds placed between each index of the
> + *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
> + *	CS[3].
I find this description just adds to the confusion. It doesn't help that 
the sentence is broken/unparsable - 'described by 2-D array in the 
engines the field with bonds'?

"This can be thought of as two virtual engines, each containing two 
engines thereby making a 2D array. However, there are bonds tying the 
entries together and placing restrictions on how they can be scheduled. 
Specifically, the scheduler can choose only vertical columns from the 2D 
array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the 
scheduler wants to submit to CS[0], it must also choose CS[1] and vice 
versa. Same for CS[2] requires also using CS[3]."

Does that make sense?

> + *	VE[0] = CS[0], CS[2]
> + *	VE[1] = CS[1], CS[3]
> + *
> + *	Example 3 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
And again.

> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[1],CS[1],CS[3])
> + *
> + *	Results in the following valid and invalid placements:
> + *	CS[0], CS[1]
> + *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
logical -> logically

> + */
> +struct i915_context_engines_parallel_submit {
> +	/**
> +	 * @base: base user extension.
> +	 */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @engine_index: slot for parallel engine
> +	 */
> +	__u16 engine_index;
> +
> +	/**
> +	 * @width: number of contexts per parallel engine
Meaning number of engines in the virtual engine? As in, width = 3 means 
that the scheduler has a choice of three different engines to submit the 
one single batch buffer to?

> +	 */
> +	__u16 width;
> +
> +	/**
> +	 * @num_siblings: number of siblings per context
> +	 */
> +	__u16 num_siblings;
Meaning the number of engines which must run in parallel. As in, 
num_siblings = 2 means that there will be two batch buffers submitted to 
every execbuf IOCTL call and that both must execute concurrently on two 
separate engines?

John.

> +
> +	/**
> +	 * @mbz16: reserved for future use; must be zero
> +	 */
> +	__u16 mbz16;
> +
> +	/**
> +	 * @flags: all undefined flags must be zero, currently not defined flags
> +	 */
> +	__u64 flags;
> +
> +	/**
> +	 * @mbz64: reserved for future use; must be zero
> +	 */
> +	__u64 mbz64[3];
> +
> +	/**
> +	 * @engines: 2-d array of engine instances to configure parallel engine
> +	 *
> +	 * length = width (i) * num_siblings (j)
> +	 * index = j + i * num_siblings
> +	 */
> +	struct i915_engine_class_instance engines[0];
> +
> +} __packed;
> +
> +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> +	struct i915_user_extension base; \
> +	__u16 engine_index; \
> +	__u16 width; \
> +	__u16 num_siblings; \
> +	__u16 mbz16; \
> +	__u64 flags; \
> +	__u64 mbz64[3]; \
> +	struct i915_engine_class_instance engines[N__]; \
> +} __attribute__((packed)) name__
> +
>   /**
>    * DOC: Context Engine Map uAPI
>    *
> @@ -2108,6 +2235,7 @@ struct i915_context_param_engines {
>   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>   	struct i915_engine_class_instance engines[0];
>   } __attribute__((packed));
>   


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 21/27] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-21  0:12   ` John Harrison
  -1 siblings, 0 replies; 145+ messages in thread
From: John Harrison @ 2021-09-21  0:12 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Update parallel submit doc to point to i915_drm.h
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>


> ---
>   Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
>   Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
>   2 files changed, 2 insertions(+), 124 deletions(-)
>   delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
>
> diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> deleted file mode 100644
> index 8cbe2c4e0172..000000000000
> --- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
> +++ /dev/null
> @@ -1,122 +0,0 @@
> -/* SPDX-License-Identifier: MIT */
> -/*
> - * Copyright © 2021 Intel Corporation
> - */
> -
> -#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> -
> -/**
> - * struct drm_i915_context_engines_parallel_submit - Configure engine for
> - * parallel submission.
> - *
> - * Setup a slot in the context engine map to allow multiple BBs to be submitted
> - * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> - * in parallel. Multiple hardware contexts are created internally in the i915
> - * run these BBs. Once a slot is configured for N BBs only N BBs can be
> - * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> - * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> - * many BBs there are based on the slot's configuration. The N BBs are the last
> - * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> - *
> - * The default placement behavior is to create implicit bonds between each
> - * context if each context maps to more than 1 physical engine (e.g. context is
> - * a virtual engine). Also we only allow contexts of same engine class and these
> - * contexts must be in logically contiguous order. Examples of the placement
> - * behavior described below. Lastly, the default is to not allow BBs to
> - * preempted mid BB rather insert coordinated preemption on all hardware
> - * contexts between each set of BBs. Flags may be added in the future to change
> - * both of these default behaviors.
> - *
> - * Returns -EINVAL if hardware context placement configuration is invalid or if
> - * the placement configuration isn't supported on the platform / submission
> - * interface.
> - * Returns -ENODEV if extension isn't supported on the platform / submission
> - * interface.
> - *
> - * .. code-block:: none
> - *
> - *	Example 1 pseudo code:
> - *	CS[X] = generic engine of same class, logical instance X
> - *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> - *	set_engines(INVALID)
> - *	set_parallel(engine_index=0, width=2, num_siblings=1,
> - *		     engines=CS[0],CS[1])
> - *
> - *	Results in the following valid placement:
> - *	CS[0], CS[1]
> - *
> - *	Example 2 pseudo code:
> - *	CS[X] = generic engine of same class, logical instance X
> - *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> - *	set_engines(INVALID)
> - *	set_parallel(engine_index=0, width=2, num_siblings=2,
> - *		     engines=CS[0],CS[2],CS[1],CS[3])
> - *
> - *	Results in the following valid placements:
> - *	CS[0], CS[1]
> - *	CS[2], CS[3]
> - *
> - *	This can also be thought of as 2 virtual engines described by 2-D array
> - *	in the engines the field with bonds placed between each index of the
> - *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
> - *	CS[3].
> - *	VE[0] = CS[0], CS[2]
> - *	VE[1] = CS[1], CS[3]
> - *
> - *	Example 3 pseudo code:
> - *	CS[X] = generic engine of same class, logical instance X
> - *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> - *	set_engines(INVALID)
> - *	set_parallel(engine_index=0, width=2, num_siblings=2,
> - *		     engines=CS[0],CS[1],CS[1],CS[3])
> - *
> - *	Results in the following valid and invalid placements:
> - *	CS[0], CS[1]
> - *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
> - */
> -struct drm_i915_context_engines_parallel_submit {
> -	/**
> -	 * @base: base user extension.
> -	 */
> -	struct i915_user_extension base;
> -
> -	/**
> -	 * @engine_index: slot for parallel engine
> -	 */
> -	__u16 engine_index;
> -
> -	/**
> -	 * @width: number of contexts per parallel engine
> -	 */
> -	__u16 width;
> -
> -	/**
> -	 * @num_siblings: number of siblings per context
> -	 */
> -	__u16 num_siblings;
> -
> -	/**
> -	 * @mbz16: reserved for future use; must be zero
> -	 */
> -	__u16 mbz16;
> -
> -	/**
> -	 * @flags: all undefined flags must be zero, currently not defined flags
> -	 */
> -	__u64 flags;
> -
> -	/**
> -	 * @mbz64: reserved for future use; must be zero
> -	 */
> -	__u64 mbz64[3];
> -
> -	/**
> -	 * @engines: 2-d array of engine instances to configure parallel engine
> -	 *
> -	 * length = width (i) * num_siblings (j)
> -	 * index = j + i * num_siblings
> -	 */
> -	struct i915_engine_class_instance engines[0];
> -
> -} __packed;
> -
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> index cbda75065dad..d630f15ab795 100644
> --- a/Documentation/gpu/rfc/i915_scheduler.rst
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
>   drm_i915_context_engines_parallel_submit to the uAPI to implement this
>   extension.
>   
> -.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
> -        :functions: drm_i915_context_engines_parallel_submit
> +.. kernel-doc:: include/uapi/drm/i915_drm.h
> +        :functions: i915_context_engines_parallel_submit
>   
>   Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
>   -------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-09-20 22:57   ` John Harrison
@ 2021-09-21 14:49     ` Tvrtko Ursulin
  2021-09-21 19:28       ` Matthew Brost
  2021-09-21 19:28     ` Matthew Brost
  1 sibling, 1 reply; 145+ messages in thread
From: Tvrtko Ursulin @ 2021-09-21 14:49 UTC (permalink / raw)
  To: John Harrison, Matthew Brost, intel-gfx, dri-devel
  Cc: daniel.vetter, tony.ye, zhengguo.xu


On 20/09/2021 23:57, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
>> Set number of engines before attempting to create contexts so the
>> function free_engines can clean up properly.
>>
>> Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle 
>> create parameters (v5)")
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>> index dbaeb924a437..bcaaf514876b 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>> @@ -944,6 +944,7 @@ static struct i915_gem_engines 
>> *user_engines(struct i915_gem_context *ctx,
>>       unsigned int n;
>>       e = alloc_engines(num_engines);
> This can return null when out of memory. There needs to be an early exit 
> check before dereferencing a null pointer. Not sure if that is a worse 
> bug or not than leaking memory! Either way, it would be good to fix that 
> too.

Pull out from the series and send a fix standalone ASAP? Also suggest 
adding author and reviewer to cc for typically quicker turnaround time.

Regards,

Tvrtko


> John.
> 
>> +    e->num_engines = num_engines;
>>       for (n = 0; n < num_engines; n++) {
>>           struct intel_context *ce;
>>           int ret;
>> @@ -977,7 +978,6 @@ static struct i915_gem_engines 
>> *user_engines(struct i915_gem_context *ctx,
>>               goto free_engines;
>>           }
>>       }
>> -    e->num_engines = num_engines;
>>       return e;
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 18/27] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-09-20 22:48   ` [Intel-gfx] " John Harrison
@ 2021-09-21 19:13     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-21 19:13 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 20, 2021 at 03:48:59PM -0700, John Harrison wrote:
> 
> 
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Display the workqueue status in debugfs for GuC contexts that are in
> > parent-child relationship.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 51 ++++++++++++++-----
> >   1 file changed, 37 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index e34e0ea9136a..07eee9a399c8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -3673,6 +3673,26 @@ static void guc_log_context_priority(struct drm_printer *p,
> >   	drm_printf(p, "\n");
> >   }
> > +
> > +static inline void guc_log_context(struct drm_printer *p,
> > +				   struct intel_context *ce)
> > +{
> > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > +		   ce->ring->head,
> > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > +		   ce->ring->tail,
> > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > +		   atomic_read(&ce->pin_count));
> > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > +		   atomic_read(&ce->guc_id.ref));
> > +	drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> > +		   ce->guc_state.sched_state);
> > +}
> > +
> >   void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   					     struct drm_printer *p)
> >   {
> > @@ -3682,22 +3702,25 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   	xa_lock_irqsave(&guc->context_lookup, flags);
> >   	xa_for_each(&guc->context_lookup, index, ce) {
> > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id.id);
> > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > -			   ce->ring->head,
> > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > -			   ce->ring->tail,
> > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > -			   atomic_read(&ce->pin_count));
> > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > -			   atomic_read(&ce->guc_id.ref));
> > -		drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
> > -			   ce->guc_state.sched_state);
> > +		GEM_BUG_ON(intel_context_is_child(ce));
> > +		guc_log_context(p, ce);
> >   		guc_log_context_priority(p, ce);
> > +
> > +		if (intel_context_is_parent(ce)) {
> > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > +			struct intel_context *child;
> > +
> > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > +				   READ_ONCE(desc->head));
> > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > +				   READ_ONCE(desc->tail));
> > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > +				   READ_ONCE(desc->wq_status));
> > +
> > +			for_each_child(ce, child)
> > +				guc_log_context(p, child);
> There should be some indication that this is a child context and/or how many
> children there are. Otherwise how does the reader differentiation between
> the list of child contexts and the next parent/single context?
> 

We don't log the priority info or work queue info for child contexts so
it is actually pretty easy to parse. That being said, I'll output the
number of children if the context is a parent.

Matt

> John.
> 
> > +		}
> >   	}
> >   	xa_unlock_irqrestore(&guc->context_lookup, flags);
> >   }
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-09-20 22:57   ` John Harrison
  2021-09-21 14:49     ` Tvrtko Ursulin
@ 2021-09-21 19:28     ` Matthew Brost
  1 sibling, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-21 19:28 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 20, 2021 at 03:57:06PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Set number of engines before attempting to create contexts so the
> > function free_engines can clean up properly.
> > 
> > Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index dbaeb924a437..bcaaf514876b 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -944,6 +944,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   	unsigned int n;
> >   	e = alloc_engines(num_engines);
> This can return null when out of memory. There needs to be an early exit
> check before dereferencing a null pointer. Not sure if that is a worse bug
> or not than leaking memory! Either way, it would be good to fix that too.
> 

Indeed there is another bug. Will fix that one too.

Matt

> John.
> 
> > +	e->num_engines = num_engines;
> >   	for (n = 0; n < num_engines; n++) {
> >   		struct intel_context *ce;
> >   		int ret;
> > @@ -977,7 +978,6 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   			goto free_engines;
> >   		}
> >   	}
> > -	e->num_engines = num_engines;
> >   	return e;
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts
  2021-09-21 14:49     ` Tvrtko Ursulin
@ 2021-09-21 19:28       ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-21 19:28 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: John Harrison, intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Tue, Sep 21, 2021 at 03:49:53PM +0100, Tvrtko Ursulin wrote:
> 
> On 20/09/2021 23:57, John Harrison wrote:
> > On 8/20/2021 15:44, Matthew Brost wrote:
> > > Set number of engines before attempting to create contexts so the
> > > function free_engines can clean up properly.
> > > 
> > > Fixes: d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle
> > > create parameters (v5)")
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> > >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > index dbaeb924a437..bcaaf514876b 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > @@ -944,6 +944,7 @@ static struct i915_gem_engines
> > > *user_engines(struct i915_gem_context *ctx,
> > >       unsigned int n;
> > >       e = alloc_engines(num_engines);
> > This can return null when out of memory. There needs to be an early exit
> > check before dereferencing a null pointer. Not sure if that is a worse
> > bug or not than leaking memory! Either way, it would be good to fix that
> > too.
> 
> Pull out from the series and send a fix standalone ASAP? Also suggest adding

Sure, will do.

Matt

> author and reviewer to cc for typically quicker turnaround time.
> 
> Regards,
> 
> Tvrtko
> 
> 
> > John.
> > 
> > > +    e->num_engines = num_engines;
> > >       for (n = 0; n < num_engines; n++) {
> > >           struct intel_context *ce;
> > >           int ret;
> > > @@ -977,7 +978,6 @@ static struct i915_gem_engines
> > > *user_engines(struct i915_gem_context *ctx,
> > >               goto free_engines;
> > >           }
> > >       }
> > > -    e->num_engines = num_engines;
> > >       return e;
> > 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 17/27] drm/i915/guc: Implement multi-lrc reset
  2021-09-20 22:44   ` John Harrison
@ 2021-09-22 16:16     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-22 16:16 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 20, 2021 at 03:44:18PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> 
>     Update context and full GPU reset to work with multi-lrc. The idea is
>     parent context tracks all the active requests inflight for itself and
>     its' children. The parent context owns the reset replaying / canceling
> 
> its' -> its
> 
> 
>     requests as needed.
> 
>     Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>     ---
>      drivers/gpu/drm/i915/gt/intel_context.c       | 11 ++--
>      .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 63 +++++++++++++------
>      2 files changed, 51 insertions(+), 23 deletions(-)
> 
>     diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>     index 00d1aee6d199..5615be32879c 100644
>     --- a/drivers/gpu/drm/i915/gt/intel_context.c
>     +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>     @@ -528,20 +528,21 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
> 
>      struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>      {
>     +       struct intel_context *parent = intel_context_to_parent(ce);
>             struct i915_request *rq, *active = NULL;
>             unsigned long flags;
> 
>             GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
> 
> Should this not check the parent as well/instead?
> 

I don't think so. The 'ce' could be a not parallel context, a parent
context, or a child context. 

> And to be clear, this can be called on regular contexts (where ce == parent)
> and on both the parent or child contexts of multi-LRC contexts (where ce may or
> may not match parent)?
>

Right. The parent owns the parent->guc_state.lock/requests and search
that list for the first non-completed request that matches submitted
'ce'.
 
> 
> 
> 
>     -       spin_lock_irqsave(&ce->guc_state.lock, flags);
>     -       list_for_each_entry_reverse(rq, &ce->guc_state.requests,
>     +       spin_lock_irqsave(&parent->guc_state.lock, flags);
>     +       list_for_each_entry_reverse(rq, &parent->guc_state.requests,
>                                         sched.link) {
>     -               if (i915_request_completed(rq))
>     +               if (i915_request_completed(rq) && rq->context == ce)
> 
> 'rq->context == ce' means:
> 
>  1. single-LRC context, rq is owned by ce
>  2. multi-LRC context, ce is child, rq really belongs to ce but is being
>     tracked by parent
>  3. multi-LRC context, ce is parent, rq really is owned by ce
> 
> So when 'rq->ce != ce', it means that the request is owned by a different child
> to 'ce' but within the same multi-LRC group. So we want to ignore that request
> and keep searching until we find one that is really owned by the target ce?
>

All correct.
 
> 
>                             break;
> 
>     -               active = rq;
>     +               active = (rq->context == ce) ? rq : active;
> 
> Would be clearer to say 'if(rq->ce != ce) continue;' and leave 'active = rq;'
> alone?
>

Yes, that is probably cleaner.
 
> And again, the intention is to ignore requests that are owned by other members
> of the same multi-LRC group?
> 
> Would be good to add some documentation to this function to explain the above
> (assuming my description is correct?).
>

Will add a comment explaining this.
 
> 
>             }
>     -       spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>     +       spin_unlock_irqrestore(&parent->guc_state.lock, flags);
> 
>             return active;
>      }
>     diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>     index f0b60fecf253..e34e0ea9136a 100644
>     --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>     +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>     @@ -670,6 +670,11 @@ static int rq_prio(const struct i915_request *rq)
>             return rq->sched.attr.priority;
>      }
> 
>     +static inline bool is_multi_lrc(struct intel_context *ce)
>     +{
>     +       return intel_context_is_parallel(ce);
>     +}
>     +
>      static bool is_multi_lrc_rq(struct i915_request *rq)
>      {
>             return intel_context_is_parallel(rq->context);
>     @@ -1179,10 +1184,13 @@ __unwind_incomplete_requests(struct intel_context *ce)
> 
>      static void __guc_reset_context(struct intel_context *ce, bool stalled)
>      {
>     +       bool local_stalled;
>             struct i915_request *rq;
>             unsigned long flags;
>             u32 head;
>     +       int i, number_children = ce->guc_number_children;
> 
> If this is a child context, does it not need to pull the child count from the
> parent? Likewise the list/link pointers below? Or does each child context have
> a full list of its siblings + parent?
> 

This function shouldn't be called by a child. Will add
GEM_BUG_ON(intel_context_is_child(ce)) to this function.

> 
>             bool skip = false;
>     +       struct intel_context *parent = ce;
> 
>             intel_context_get(ce);
> 
>     @@ -1209,25 +1217,34 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>             if (unlikely(skip))
>                     goto out_put;
> 
>     -       rq = intel_context_find_active_request(ce);
>     -       if (!rq) {
>     -               head = ce->ring->tail;
>     -               stalled = false;
>     -               goto out_replay;
>     -       }
>     +       for (i = 0; i < number_children + 1; ++i) {
>     +               if (!intel_context_is_pinned(ce))
>     +                       goto next_context;
>     +
>     +               local_stalled = false;
>     +               rq = intel_context_find_active_request(ce);
>     +               if (!rq) {
>     +                       head = ce->ring->tail;
>     +                       goto out_replay;
>     +               }
> 
>     -       if (!i915_request_started(rq))
>     -               stalled = false;
>     +               GEM_BUG_ON(i915_active_is_idle(&ce->active));
>     +               head = intel_ring_wrap(ce->ring, rq->head);
> 
>     -       GEM_BUG_ON(i915_active_is_idle(&ce->active));
>     -       head = intel_ring_wrap(ce->ring, rq->head);
>     -       __i915_request_reset(rq, stalled);
>     +               if (i915_request_started(rq))
> 
> Why change the ordering of the started test versus the wrap/reset call? Is it
> significant? Why is it now important to be reversed?
>

Off the top of my head I'm not sure what the ordering is changed. I'll
have to double check on this.
 
> 
>     +                       local_stalled = true;
> 
>     +               __i915_request_reset(rq, local_stalled && stalled);
>      out_replay:
>     -       guc_reset_state(ce, head, stalled);
>     -       __unwind_incomplete_requests(ce);
>     +               guc_reset_state(ce, head, local_stalled && stalled);
>     +next_context:
>     +               if (i != number_children)
>     +                       ce = list_next_entry(ce, guc_child_link);
> 
> Can this not be put in to the step clause of the for statement?
> 

Maybe? Does list_next_entry blow up on that last entry? Idk? If not,
then yes this can be added to the loop. Will double check on this.

> 
>     +       }
>     +
>     +       __unwind_incomplete_requests(parent);
>      out_put:
>     -       intel_context_put(ce);
>     +       intel_context_put(parent);
> 
> As above, I think this function would benefit from some comments to explain
> exactly what is being done and why.
>

Sure. Will add some comments.

Matt
 
> John.
> 
> 
>      }
> 
>      void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
>     @@ -1248,7 +1265,8 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
> 
>                     xa_unlock(&guc->context_lookup);
> 
>     -               if (intel_context_is_pinned(ce))
>     +               if (intel_context_is_pinned(ce) &&
>     +                   !intel_context_is_child(ce))
>                             __guc_reset_context(ce, stalled);
> 
>                     intel_context_put(ce);
>     @@ -1340,7 +1358,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
> 
>                     xa_unlock(&guc->context_lookup);
> 
>     -               if (intel_context_is_pinned(ce))
>     +               if (intel_context_is_pinned(ce) &&
>     +                   !intel_context_is_child(ce))
>                             guc_cancel_context_requests(ce);
> 
>                     intel_context_put(ce);
>     @@ -2031,6 +2050,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>             u16 guc_id;
>             bool enabled;
> 
>     +       GEM_BUG_ON(intel_context_is_child(ce));
>     +
>             spin_lock_irqsave(&ce->guc_state.lock, flags);
> 
>             incr_context_blocked(ce);
>     @@ -2068,6 +2089,7 @@ static void guc_context_unblock(struct intel_context *ce)
>             bool enable;
> 
>             GEM_BUG_ON(context_enabled(ce));
>     +       GEM_BUG_ON(intel_context_is_child(ce));
> 
>             spin_lock_irqsave(&ce->guc_state.lock, flags);
> 
>     @@ -2099,11 +2121,14 @@ static void guc_context_unblock(struct intel_context *ce)
>      static void guc_context_cancel_request(struct intel_context *ce,
>                                            struct i915_request *rq)
>      {
>     +       struct intel_context *block_context =
>     +               request_to_scheduling_context(rq);
>     +
>             if (i915_sw_fence_signaled(&rq->submit)) {
>                     struct i915_sw_fence *fence;
> 
>                     intel_context_get(ce);
>     -               fence = guc_context_block(ce);
>     +               fence = guc_context_block(block_context);
>                     i915_sw_fence_wait(fence);
>                     if (!i915_request_completed(rq)) {
>                             __i915_request_skip(rq);
>     @@ -2117,7 +2142,7 @@ static void guc_context_cancel_request(struct intel_context *ce,
>                      */
>                     flush_work(&ce_to_guc(ce)->ct.requests.worker);
> 
>     -               guc_context_unblock(ce);
>     +               guc_context_unblock(block_context);
>                     intel_context_put(ce);
>             }
>      }
>     @@ -2143,6 +2168,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>             intel_wakeref_t wakeref;
>             unsigned long flags;
> 
>     +       GEM_BUG_ON(intel_context_is_child(ce));
>     +
>             guc_flush_submissions(guc);
> 
>             spin_lock_irqsave(&ce->guc_state.lock, flags);
> 
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-09-20 21:48   ` John Harrison
@ 2021-09-22 16:25     ` Matthew Brost
  2021-09-22 20:15       ` John Harrison
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-22 16:25 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 20, 2021 at 02:48:52PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Implement multi-lrc submission via a single workqueue entry and single
> > H2G. The workqueue entry contains an updated tail value for each
> > request, of all the contexts in the multi-lrc submission, and updates
> > these values simultaneously. As such, the tasklet and bypass path have
> > been updated to coalesce requests into a single submission.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
> >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
> >   drivers/gpu/drm/i915/i915_request.h           |   8 +
> >   6 files changed, 317 insertions(+), 62 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > index fbfcae727d7f..879aef662b2e 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > @@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
> >   		}
> >   	}
> >   }
> > +
> > +void intel_guc_write_barrier(struct intel_guc *guc)
> > +{
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +
> > +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > +		GEM_BUG_ON(guc->send_regs.fw_domains);
> Granted, this patch is just moving code from one file to another not
> changing it. However, I think it would be worth adding a blank line in here.
> Otherwise the 'this register' comment below can be confusingly read as
> referring to the send_regs.fw_domain entry above.
> 
> And maybe add a comment why it is a bug for the send_regs value to be set?
> I'm not seeing any obvious connection between it and the reset of this code.
> 

Can add a blank line. I think the GEM_BUG_ON relates to being able to
use intel_uncore_write_fw vs intel_uncore_write. Can add comment.

> > +		/*
> > +		 * This register is used by the i915 and GuC for MMIO based
> > +		 * communication. Once we are in this code CTBs are the only
> > +		 * method the i915 uses to communicate with the GuC so it is
> > +		 * safe to write to this register (a value of 0 is NOP for MMIO
> > +		 * communication). If we ever start mixing CTBs and MMIOs a new
> > +		 * register will have to be chosen.
> > +		 */
> > +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > +	} else {
> > +		/* wmb() sufficient for a barrier if in smem */
> > +		wmb();
> > +	}
> > +}
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 3f95b1b4f15c..0ead2406d03c 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -37,6 +37,12 @@ struct intel_guc {
> >   	/* Global engine used to submit requests to GuC */
> >   	struct i915_sched_engine *sched_engine;
> >   	struct i915_request *stalled_request;
> > +	enum {
> > +		STALL_NONE,
> > +		STALL_REGISTER_CONTEXT,
> > +		STALL_MOVE_LRC_TAIL,
> > +		STALL_ADD_REQUEST,
> > +	} submission_stall_reason;
> >   	/* intel_guc_recv interrupt related state */
> >   	spinlock_t irq_lock;
> > @@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
> >   void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
> > +void intel_guc_write_barrier(struct intel_guc *guc);
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > index 20c710a74498..10d1878d2826 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
> >   	return ++ct->requests.last_fence;
> >   }
> > -static void write_barrier(struct intel_guc_ct *ct)
> > -{
> > -	struct intel_guc *guc = ct_to_guc(ct);
> > -	struct intel_gt *gt = guc_to_gt(guc);
> > -
> > -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > -		GEM_BUG_ON(guc->send_regs.fw_domains);
> > -		/*
> > -		 * This register is used by the i915 and GuC for MMIO based
> > -		 * communication. Once we are in this code CTBs are the only
> > -		 * method the i915 uses to communicate with the GuC so it is
> > -		 * safe to write to this register (a value of 0 is NOP for MMIO
> > -		 * communication). If we ever start mixing CTBs and MMIOs a new
> > -		 * register will have to be chosen.
> > -		 */
> > -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > -	} else {
> > -		/* wmb() sufficient for a barrier if in smem */
> > -		wmb();
> > -	}
> > -}
> > -
> >   static int ct_write(struct intel_guc_ct *ct,
> >   		    const u32 *action,
> >   		    u32 len /* in dwords */,
> > @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
> >   	 * make sure H2G buffer update and LRC tail update (if this triggering a
> >   	 * submission) are visible before updating the descriptor tail
> >   	 */
> > -	write_barrier(ct);
> > +	intel_guc_write_barrier(ct_to_guc(ct));
> >   	/* update local copies */
> >   	ctb->tail = tail;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index 0e600a3b8f1e..6cd26dc060d1 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -65,12 +65,14 @@
> >   #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> >   #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> >   #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> > -#define WQ_TARGET_SHIFT			10
> > +#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
> > +#define WQ_TARGET_SHIFT			8
> >   #define WQ_LEN_SHIFT			16
> >   #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> >   #define WQ_PRESENT_WORKLOAD		(1 << 28)
> > -#define WQ_RING_TAIL_SHIFT		20
> > +#define WQ_GUC_ID_SHIFT			0
> > +#define WQ_RING_TAIL_SHIFT		18
> Presumably all of these API changes are not actually new? They really came
> in with the reset of the v40 re-write? It's just that this is the first time
> we are using them and therefore need to finally update the defines?
> 

Yes.

> >   #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> >   #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index e9dfd43d29a0..b107ad095248 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
> >   		   LRC_STATE_OFFSET) / sizeof(u32)));
> >   }
> > +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> > +			   struct intel_context *ce,
> > +			   u32 wqi_size)
> > +{
> > +	/*
> > +	 * Check for space in work queue. Caching a value of head pointer in
> > +	 * intel_context structure in order reduce the number accesses to shared
> > +	 * GPU memory which may be across a PCIe bus.
> > +	 */
> > +#define AVAILABLE_SPACE	\
> > +	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
> > +	if (wqi_size > AVAILABLE_SPACE) {
> > +		ce->guc_wqi_head = READ_ONCE(desc->head);
> > +
> > +		if (wqi_size > AVAILABLE_SPACE)
> > +			return NULL;
> > +	}
> > +#undef AVAILABLE_SPACE
> > +
> > +	return ((u32 *)__get_process_desc(ce)) +
> > +		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
> > +}
> > +
> >   static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> >   {
> >   	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > @@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> >   static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> > -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   {
> >   	int err = 0;
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u32 action[3];
> >   	int len = 0;
> >   	u32 g2h_len_dw = 0;
> > @@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> >   	GEM_BUG_ON(context_guc_id_invalid(ce));
> > -	/*
> > -	 * Corner case where the GuC firmware was blown away and reloaded while
> > -	 * this context was pinned.
> > -	 */
> > -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> > -		err = guc_lrc_desc_pin(ce, false);
> > -		if (unlikely(err))
> > -			return err;
> > -	}
> > -
> >   	spin_lock(&ce->guc_state.lock);
> >   	/*
> >   	 * The request / context will be run on the hardware when scheduling
> > -	 * gets enabled in the unblock.
> > +	 * gets enabled in the unblock. For multi-lrc we still submit the
> > +	 * context to move the LRC tails.
> >   	 */
> > -	if (unlikely(context_blocked(ce)))
> > +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
> >   		goto out;
> > -	enabled = context_enabled(ce);
> > +	enabled = context_enabled(ce) || context_blocked(ce);
> Would be better to say '|| is_parent(ce)' rather than blocked? The reason
> for reason for claiming enabled when not is because it's a multi-LRC parent,
> right? Or can there be a parent that is neither enabled nor blocked for
> which we don't want to do the processing? But why would that make sense/be
> possible?
> 

No. If it is parent and blocked we want to submit the enable but not
enable submission. In the non-multi-lrc case the submit has already been
done by the i915 (moving lrc tail).

> >   	if (!enabled) {
> >   		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> > @@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   		trace_intel_context_sched_enable(ce);
> >   		atomic_inc(&guc->outstanding_submission_g2h);
> >   		set_context_enabled(ce);
> > +
> > +		/*
> > +		 * Without multi-lrc KMD does the submission step (moving the
> > +		 * lrc tail) so enabling scheduling is sufficient to submit the
> > +		 * context. This isn't the case in multi-lrc submission as the
> > +		 * GuC needs to move the tails, hence the need for another H2G
> > +		 * to submit a multi-lrc context after enabling scheduling.
> > +		 */
> > +		if (intel_context_is_parent(ce)) {
> > +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> > +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> > +		}
> >   	} else if (!enabled) {
> >   		clr_context_pending_enable(ce);
> >   		intel_context_put(ce);
> > @@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> >   	return err;
> >   }
> > +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	int ret = __guc_add_request(guc, rq);
> > +
> > +	if (unlikely(ret == -EBUSY)) {
> > +		guc->stalled_request= rq;
> > +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static void guc_set_lrc_tail(struct i915_request *rq)
> >   {
> >   	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> > @@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
> >   	return rq->sched.attr.priority;
> >   }
> > +static bool is_multi_lrc_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_child(rq->context) ||
> > +		intel_context_is_parent(rq->context);
> > +}
> > +
> > +static bool can_merge_rq(struct i915_request *rq,
> > +			 struct i915_request *last)
> > +{
> > +	return request_to_scheduling_context(rq) ==
> > +		request_to_scheduling_context(last);
> > +}
> > +
> > +static u32 wq_space_until_wrap(struct intel_context *ce)
> > +{
> > +	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
> > +}
> > +
> > +static void write_wqi(struct guc_process_desc *desc,
> > +		      struct intel_context *ce,
> > +		      u32 wqi_size)
> > +{
> > +	/*
> > +	 * Ensure WQE are visible before updating tail
> WQE or WQI?
>

WQI (work queue instance) is the convention used but I actually like WQE
(work queue entry) better. Will change the name to WQE everywhere.
 
> > +	 */
> > +	intel_guc_write_barrier(ce_to_guc(ce));
> > +
> > +	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
> > +	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
> > +}
> > +
> > +static int guc_wq_noop_append(struct intel_context *ce)
> > +{
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> > +
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	*wqi = WQ_TYPE_NOOP |
> > +		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
> This should have a BUG_ON check that the requested size fits within the
> WQ_LEN field?
>

I could add that.
 
> Indeed, would be better to use the FIELD macros as they do that kind of
> thing for you.
>

Yes, they do. I forget how they work, will figure this out and use the
macros.
 
> 
> > +	ce->guc_wqi_tail = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __guc_wq_item_append(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	struct intel_context *child;
> > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > +	unsigned int wqi_size = (ce->guc_number_children + 4) *
> > +		sizeof(u32);
> > +	u32 *wqi;
> > +	int ret;
> > +
> > +	/* Ensure context is in correct state updating work queue */
> > +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> > +	GEM_BUG_ON(context_guc_id_invalid(ce));
> > +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> > +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> > +
> > +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> > +	if (wqi_size > wq_space_until_wrap(ce)) {
> > +		ret = guc_wq_noop_append(ce);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	wqi = get_wq_pointer(desc, ce, wqi_size);
> > +	if (!wqi)
> > +		return -EBUSY;
> > +
> > +	*wqi++ = WQ_TYPE_MULTI_LRC |
> > +		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
> > +	*wqi++ = ce->lrc.lrca;
> > +	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
> > +		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
> As above, would be better to use FIELD macros instead of manual shifting.
> 

Will do.

Matt

> John.
> 
> 
> > +	*wqi++ = 0;	/* fence_id */
> > +	for_each_child(ce, child)
> > +		*wqi++ = child->ring->tail / sizeof(u64);
> > +
> > +	write_wqi(desc, ce, wqi_size);
> > +
> > +	return 0;
> > +}
> > +
> > +static int guc_wq_item_append(struct intel_guc *guc,
> > +			      struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +	int ret = 0;
> > +
> > +	if (likely(!intel_context_is_banned(ce))) {
> > +		ret = __guc_wq_item_append(rq);
> > +
> > +		if (unlikely(ret == -EBUSY)) {
> > +			guc->stalled_request = rq;
> > +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static bool multi_lrc_submit(struct i915_request *rq)
> > +{
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	intel_ring_set_tail(rq->ring, rq->tail);
> > +
> > +	/*
> > +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> > +	 * request generated from a multi-BB submission. This indicates to the
> > +	 * backend (GuC interface) that we should submit this context thus
> > +	 * submitting all the requests generated in parallel.
> > +	 */
> > +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
> > +		intel_context_is_banned(ce);
> > +}
> > +
> >   static int guc_dequeue_one_context(struct intel_guc *guc)
> >   {
> >   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > @@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   	if (guc->stalled_request) {
> >   		submit = true;
> >   		last = guc->stalled_request;
> > -		goto resubmit;
> > +
> > +		switch (guc->submission_stall_reason) {
> > +		case STALL_REGISTER_CONTEXT:
> > +			goto register_context;
> > +		case STALL_MOVE_LRC_TAIL:
> > +			goto move_lrc_tail;
> > +		case STALL_ADD_REQUEST:
> > +			goto add_request;
> > +		default:
> > +			MISSING_CASE(guc->submission_stall_reason);
> > +		}
> >   	}
> >   	while ((rb = rb_first_cached(&sched_engine->queue))) {
> > @@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   		struct i915_request *rq, *rn;
> >   		priolist_for_each_request_consume(rq, rn, p) {
> > -			if (last && rq->context != last->context)
> > -				goto done;
> > +			if (last && !can_merge_rq(rq, last))
> > +				goto register_context;
> >   			list_del_init(&rq->sched.link);
> > @@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> >   			trace_i915_request_in(rq, 0);
> >   			last = rq;
> > -			submit = true;
> > +
> > +			if (is_multi_lrc_rq(rq)) {
> > +				/*
> > +				 * We need to coalesce all multi-lrc requests in
> > +				 * a relationship into a single H2G. We are
> > +				 * guaranteed that all of these requests will be
> > +				 * submitted sequentially.
> > +				 */
> > +				if (multi_lrc_submit(rq)) {
> > +					submit = true;
> > +					goto register_context;
> > +				}
> > +			} else {
> > +				submit = true;
> > +			}
> >   		}
> >   		rb_erase_cached(&p->node, &sched_engine->queue);
> >   		i915_priolist_free(p);
> >   	}
> > -done:
> > +
> > +register_context:
> >   	if (submit) {
> > -		guc_set_lrc_tail(last);
> > -resubmit:
> > +		struct intel_context *ce = request_to_scheduling_context(last);
> > +
> > +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> > +			     !intel_context_is_banned(ce))) {
> > +			ret = guc_lrc_desc_pin(ce, false);
> > +			if (unlikely(ret == -EPIPE)) {
> > +				goto deadlk;
> > +			} else if (ret == -EBUSY) {
> > +				guc->stalled_request = last;
> > +				guc->submission_stall_reason =
> > +					STALL_REGISTER_CONTEXT;
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		}
> > +
> > +move_lrc_tail:
> > +		if (is_multi_lrc_rq(last)) {
> > +			ret = guc_wq_item_append(guc, last);
> > +			if (ret == -EBUSY) {
> > +				goto schedule_tasklet;
> > +			} else if (ret != 0) {
> > +				GEM_WARN_ON(ret);	/* Unexpected */
> > +				goto deadlk;
> > +			}
> > +		} else {
> > +			guc_set_lrc_tail(last);
> > +		}
> > +
> > +add_request:
> >   		ret = guc_add_request(guc, last);
> > -		if (unlikely(ret == -EPIPE))
> > +		if (unlikely(ret == -EPIPE)) {
> > +			goto deadlk;
> > +		} else if (ret == -EBUSY) {
> > +			goto schedule_tasklet;
> > +		} else if (ret != 0) {
> > +			GEM_WARN_ON(ret);	/* Unexpected */
> >   			goto deadlk;
> > -		else if (ret == -EBUSY) {
> > -			tasklet_schedule(&sched_engine->tasklet);
> > -			guc->stalled_request = last;
> > -			return false;
> >   		}
> >   	}
> >   	guc->stalled_request = NULL;
> > +	guc->submission_stall_reason = STALL_NONE;
> >   	return submit;
> >   deadlk:
> >   	sched_engine->tasklet.callback = NULL;
> >   	tasklet_disable_nosync(&sched_engine->tasklet);
> >   	return false;
> > +
> > +schedule_tasklet:
> > +	tasklet_schedule(&sched_engine->tasklet);
> > +	return false;
> >   }
> >   static void guc_submission_tasklet(struct tasklet_struct *t)
> > @@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	trace_i915_request_in(rq, 0);
> > -	guc_set_lrc_tail(rq);
> > -	ret = guc_add_request(guc, rq);
> > -	if (ret == -EBUSY)
> > -		guc->stalled_request = rq;
> > +	if (is_multi_lrc_rq(rq)) {
> > +		if (multi_lrc_submit(rq)) {
> > +			ret = guc_wq_item_append(guc, rq);
> > +			if (!ret)
> > +				ret = guc_add_request(guc, rq);
> > +		}
> > +	} else {
> > +		guc_set_lrc_tail(rq);
> > +		ret = guc_add_request(guc, rq);
> > +	}
> >   	if (unlikely(ret == -EPIPE))
> >   		disable_submission(guc);
> > @@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> >   	return ret;
> >   }
> > +bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> > +{
> > +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	return submission_disabled(guc) || guc->stalled_request ||
> > +		!i915_sched_engine_is_empty(sched_engine) ||
> > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > +}
> > +
> >   static void guc_submit_request(struct i915_request *rq)
> >   {
> >   	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > @@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
> >   	/* Will be called from irq-context when using foreign fences. */
> >   	spin_lock_irqsave(&sched_engine->lock, flags);
> > -	if (submission_disabled(guc) || guc->stalled_request ||
> > -	    !i915_sched_engine_is_empty(sched_engine))
> > +	if (need_tasklet(guc, rq))
> >   		queue_request(sched_engine, rq, rq_prio(rq));
> >   	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
> >   		tasklet_hi_schedule(&sched_engine->tasklet);
> > @@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
> >   static void add_to_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
> >   	spin_lock(&ce->guc_state.lock);
> > @@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
> >   static void remove_from_context(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >   	spin_lock_irq(&ce->guc_state.lock);
> > @@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
> >   static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   					   int prio)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
> >   	/* Short circuit function */
> > @@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
> >   static void guc_retire_inflight_request_prio(struct i915_request *rq)
> >   {
> > -	struct intel_context *ce = rq->context;
> > +	struct intel_context *ce = request_to_scheduling_context(rq);
> >   	spin_lock(&ce->guc_state.lock);
> >   	guc_prio_fini(rq, ce);
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 177eaf55adff..8f0073e19079 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -139,6 +139,14 @@ enum {
> >   	 * the GPU. Here we track such boost requests on a per-request basis.
> >   	 */
> >   	I915_FENCE_FLAG_BOOST,
> > +
> > +	/*
> > +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> > +	 * parent-child relationship (parallel submission, multi-lrc) should
> > +	 * trigger a submission to the GuC rather than just moving the context
> > +	 * tail.
> > +	 */
> > +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> >   };
> >   /**
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface
  2021-09-21  0:09   ` John Harrison
@ 2021-09-22 16:38     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-22 16:38 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Mon, Sep 20, 2021 at 05:09:28PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > Introduce 'set parallel submit' extension to connect UAPI to GuC
> > multi-lrc interface. Kernel doc in new uAPI should explain it all.
> > 
> > IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > media UMD: link to come
> Is this link still not available?
> 

Have it now: https://github.com/intel/media-driver/pull/1252

> Also, see 'kernel test robot' emails saying that sparse is complaining about
> something I don't understand but presumably needs to be fixed.
>

Yea, those warning need to be fixed.
 
> 
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add IGT link and placeholder for media UMD link
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 220 +++++++++++++++++-
> >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
> >   drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
> >   .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
> >   drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 114 ++++++++-
> >   include/uapi/drm/i915_drm.h                   | 128 ++++++++++
> >   9 files changed, 485 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index bcaaf514876b..de0fd145fb47 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -522,9 +522,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
> >   	return 0;
> >   }
> > +static int
> > +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> > +				      void *data)
> > +{
> > +	struct i915_context_engines_parallel_submit __user *ext =
> > +		container_of_user(base, typeof(*ext), base);
> > +	const struct set_proto_ctx_engines *set = data;
> > +	struct drm_i915_private *i915 = set->i915;
> > +	u64 flags;
> > +	int err = 0, n, i, j;
> > +	u16 slot, width, num_siblings;
> > +	struct intel_engine_cs **siblings = NULL;
> > +	intel_engine_mask_t prev_mask;
> > +
> > +	/* Disabling for now */
> > +	return -ENODEV;
> > +
> > +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> > +		return -ENODEV;
> This needs a FIXME comment to say that exec list will be added later.
> 

Sure.

> > +
> > +	if (get_user(slot, &ext->engine_index))
> > +		return -EFAULT;
> > +
> > +	if (get_user(width, &ext->width))
> > +		return -EFAULT;
> > +
> > +	if (get_user(num_siblings, &ext->num_siblings))
> > +		return -EFAULT;
> > +
> > +	if (slot >= set->num_engines) {
> > +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> > +			slot, set->num_engines);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> > +		drm_dbg(&i915->drm,
> > +			"Invalid placement[%d], already occupied\n", slot);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (get_user(flags, &ext->flags))
> > +		return -EFAULT;
> > +
> > +	if (flags) {
> > +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> > +		err = check_user_mbz(&ext->mbz64[n]);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	if (width < 2) {
> > +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (num_siblings < 1) {
> > +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> > +			num_siblings);
> > +		return -EINVAL;
> > +	}
> > +
> > +	siblings = kmalloc_array(num_siblings * width,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return -ENOMEM;
> > +
> > +	/* Create contexts / engines */
> > +	for (i = 0; i < width; ++i) {
> > +		intel_engine_mask_t current_mask = 0;
> > +		struct i915_engine_class_instance prev_engine;
> > +
> > +		for (j = 0; j < num_siblings; ++j) {
> > +			struct i915_engine_class_instance ci;
> > +
> > +			n = i * num_siblings + j;
> > +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> > +				err = -EFAULT;
> > +				goto out_err;
> > +			}
> > +
> > +			siblings[n] =
> > +				intel_engine_lookup_user(i915, ci.engine_class,
> > +							 ci.engine_instance);
> > +			if (!siblings[n]) {
> > +				drm_dbg(&i915->drm,
> > +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> > +					n, ci.engine_class, ci.engine_instance);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +
> > +			if (n) {
> > +				if (prev_engine.engine_class !=
> > +				    ci.engine_class) {
> > +					drm_dbg(&i915->drm,
> > +						"Mismatched class %d, %d\n",
> > +						prev_engine.engine_class,
> > +						ci.engine_class);
> > +					err = -EINVAL;
> > +					goto out_err;
> > +				}
> > +			}
> > +
> > +			prev_engine = ci;
> > +			current_mask |= siblings[n]->logical_mask;
> > +		}
> > +
> > +		if (i > 0) {
> > +			if (current_mask != prev_mask << 1) {
> > +				drm_dbg(&i915->drm,
> > +					"Non contiguous logical mask 0x%x, 0x%x\n",
> > +					prev_mask, current_mask);
> > +				err = -EINVAL;
> > +				goto out_err;
> > +			}
> > +		}
> > +		prev_mask = current_mask;
> > +	}
> > +
> > +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> > +	set->engines[slot].num_siblings = num_siblings;
> > +	set->engines[slot].width = width;
> > +	set->engines[slot].siblings = siblings;
> > +
> > +	return 0;
> > +
> > +out_err:
> > +	kfree(siblings);
> > +
> > +	return err;
> > +}
> > +
> >   static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
> >   	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
> >   	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> > +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> > +		set_proto_ctx_engines_parallel_submit,
> >   };
> >   static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> > @@ -821,6 +961,25 @@ static int intel_context_set_gem(struct intel_context *ce,
> >   	return ret;
> >   }
> > +static void __unpin_engines(struct i915_gem_engines *e, unsigned int count)
> > +{
> > +	while (count--) {
> > +		struct intel_context *ce = e->engines[count], *child;
> > +
> > +		if (!ce || !test_bit(CONTEXT_PERMA_PIN, &ce->flags))
> > +			continue;
> > +
> > +		for_each_child(ce, child)
> > +			intel_context_unpin(child);
> > +		intel_context_unpin(ce);
> > +	}
> > +}
> > +
> > +static void unpin_engines(struct i915_gem_engines *e)
> > +{
> > +	__unpin_engines(e, e->num_engines);
> > +}
> > +
> >   static void __free_engines(struct i915_gem_engines *e, unsigned int count)
> >   {
> >   	while (count--) {
> > @@ -936,6 +1095,40 @@ static struct i915_gem_engines *default_engines(struct i915_gem_context *ctx,
> >   	return err;
> >   }
> > +static int perma_pin_contexts(struct intel_context *ce)
> What is this perma_ping thing about?
>

This is per Daniel Vetters suggestion in previous rev. Basically to
simplify the parallel submit implementation pin the contexts are
creation time and unpin when the contexts get destroyed. It removes
complexity from gt/intel_context.c, the execbuf IOCTL, and backend
pinning / unpinning functions.
 
> > +{
> > +	struct intel_context *child;
> > +	int i = 0, j = 0, ret;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	ret = intel_context_pin(ce);
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	for_each_child(ce, child) {
> > +		ret = intel_context_pin(child);
> > +		if (unlikely(ret))
> > +			goto unwind;
> > +		++i;
> > +	}
> > +
> > +	set_bit(CONTEXT_PERMA_PIN, &ce->flags);
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	intel_context_unpin(ce);
> > +	for_each_child(ce, child) {
> > +		if (j++ < i)
> > +			intel_context_unpin(child);
> > +		else
> > +			break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   					     unsigned int num_engines,
> >   					     struct i915_gem_proto_engine *pe)
> > @@ -946,7 +1139,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   	e = alloc_engines(num_engines);
> >   	e->num_engines = num_engines;
> >   	for (n = 0; n < num_engines; n++) {
> > -		struct intel_context *ce;
> > +		struct intel_context *ce, *child;
> >   		int ret;
> >   		switch (pe[n].type) {
> > @@ -956,7 +1149,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   		case I915_GEM_ENGINE_TYPE_BALANCED:
> >   			ce = intel_engine_create_virtual(pe[n].siblings,
> > -							 pe[n].num_siblings);
> > +							 pe[n].num_siblings, 0);
> > +			break;
> > +
> > +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> > +			ce = intel_engine_create_parallel(pe[n].siblings,
> > +							  pe[n].num_siblings,
> > +							  pe[n].width);
> >   			break;
> >   		case I915_GEM_ENGINE_TYPE_INVALID:
> > @@ -977,6 +1176,22 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
> >   			err = ERR_PTR(ret);
> >   			goto free_engines;
> >   		}
> > +		for_each_child(ce, child) {
> > +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> > +
> > +		/* XXX: Must be done after setting gem context */
> Why the 'XXX'? Is it saying that the ordering is a problem that needs to be
> fixed?
> 

Add 'XXX' because originally I hid this behavior in the vfunc used in
intel_engine_create_parallel but as I say we need the gem context set
first. In theory we could fix this with a bit more of rework so all of
this is in the backend, thus the 'XXX'. 

> > +		if (pe[n].type == I915_GEM_ENGINE_TYPE_PARALLEL) {
> > +			ret = perma_pin_contexts(ce);
> > +			if (ret) {
> > +				err = ERR_PTR(ret);
> > +				goto free_engines;
> > +			}
> > +		}
> >   	}
> >   	return e;
> > @@ -1200,6 +1415,7 @@ static void context_close(struct i915_gem_context *ctx)
> >   	/* Flush any concurrent set_engines() */
> >   	mutex_lock(&ctx->engines_mutex);
> > +	unpin_engines(ctx->engines);
> >   	engines_idle_release(ctx, rcu_replace_pointer(ctx->engines, NULL, 1));
> >   	i915_gem_context_set_closed(ctx);
> >   	mutex_unlock(&ctx->engines_mutex);
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > index 94c03a97cb77..7b096d83bca1 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> > @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
> >   	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
> >   	I915_GEM_ENGINE_TYPE_BALANCED,
> > +
> > +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> > +	I915_GEM_ENGINE_TYPE_PARALLEL,
> >   };
> >   /**
> > @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
> >   	/** @num_siblings: Number of balanced siblings */
> >   	unsigned int num_siblings;
> > +	/** @width: Width of each sibling */
> > +	unsigned int width;
> > +
> >   	/** @siblings: Balanced siblings */
> >   	struct intel_engine_cs **siblings;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index a63329520c35..713d85b0b364 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -55,9 +55,13 @@ struct intel_context_ops {
> >   	void (*reset)(struct intel_context *ce);
> >   	void (*destroy)(struct kref *kref);
> > -	/* virtual engine/context interface */
> > +	/* virtual/parallel engine/context interface */
> >   	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> > -						unsigned int count);
> > +						unsigned int count,
> > +						unsigned long flags);
> > +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> > +						 unsigned int num_siblings,
> > +						 unsigned int width);
> >   	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
> >   					       unsigned int sibling);
> >   };
> > @@ -113,6 +117,7 @@ struct intel_context {
> >   #define CONTEXT_NOPREEMPT		8
> >   #define CONTEXT_LRCA_DIRTY		9
> >   #define CONTEXT_GUC_INIT		10
> > +#define CONTEXT_PERMA_PIN		11
> >   	struct {
> >   		u64 timeout_us;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> > index 87579affb952..43f16a8347ee 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> > @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
> >   	return intel_engine_has_preemption(engine);
> >   }
> > +#define FORCE_VIRTUAL	BIT(0)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count);
> > +			    unsigned int count, unsigned long flags);
> > +
> > +static inline struct intel_context *
> > +intel_engine_create_parallel(struct intel_engine_cs **engines,
> > +			     unsigned int num_engines,
> > +			     unsigned int width)
> > +{
> > +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> > +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> > +}
> >   static inline bool
> >   intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > index 4d790f9a65dd..f66c75c77584 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > @@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
> >   struct intel_context *
> >   intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > -			    unsigned int count)
> > +			    unsigned int count, unsigned long flags)
> >   {
> >   	if (count == 0)
> >   		return ERR_PTR(-EINVAL);
> > -	if (count == 1)
> > +	if (count == 1 && !(flags & FORCE_VIRTUAL))
> >   		return intel_context_create(siblings[0]);
> >   	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> > -	return siblings[0]->cops->create_virtual(siblings, count);
> > +	return siblings[0]->cops->create_virtual(siblings, count, flags);
> >   }
> >   struct i915_request *
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index 813a6de01382..d1e2d6f8ff81 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags);
> >   static struct i915_request *
> >   __active_request(const struct intel_timeline * const tl,
> > @@ -3782,7 +3783,8 @@ static void virtual_submit_request(struct i915_request *rq)
> >   }
> >   static struct intel_context *
> > -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +			 unsigned long flags)
> >   {
> >   	struct virtual_engine *ve;
> >   	unsigned int n;
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > index f12ffe797639..e876a9d88a5c 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> > @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
> >   	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
> >   	for (n = 0; n < nctx; n++) {
> > -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> > +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ve[n])) {
> >   			err = PTR_ERR(ve[n]);
> >   			nctx = n;
> > @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
> >   	 * restrict it to our desired engine within the virtual engine.
> >   	 */
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_close;
> > @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
> >   		i915_request_add(rq);
> >   	}
> > -	ce = intel_engine_create_virtual(siblings, nsibling);
> > +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ce)) {
> >   		err = PTR_ERR(ce);
> >   		goto out;
> > @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
> >   	/* XXX We do not handle oversubscription and fairness with normal rq */
> >   	for (n = 0; n < nsibling; n++) {
> > -		ce = intel_engine_create_virtual(siblings, nsibling);
> > +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
> >   		if (IS_ERR(ce)) {
> >   			err = PTR_ERR(ce);
> >   			goto out;
> > @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
> >   	if (err)
> >   		goto out_scratch;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_scratch;
> > @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
> >   	if (igt_spinner_init(&spin, gt))
> >   		return -ENOMEM;
> > -	ve = intel_engine_create_virtual(siblings, nsibling);
> > +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
> >   	if (IS_ERR(ve)) {
> >   		err = PTR_ERR(ve);
> >   		goto out_spin;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 07eee9a399c8..2554d0eb4afd 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -121,7 +121,13 @@ struct guc_virtual_engine {
> >   };
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags);
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width);
> >   #define GUC_REQUEST_SIZE 64 /* bytes */
> > @@ -2581,6 +2587,7 @@ static const struct intel_context_ops guc_context_ops = {
> >   	.destroy = guc_context_destroy,
> >   	.create_virtual = guc_create_virtual,
> > +	.create_parallel = guc_create_parallel,
> >   };
> >   static void submit_work_cb(struct irq_work *wrk)
> > @@ -2827,8 +2834,6 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2845,8 +2850,6 @@ static int guc_parent_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   {
> >   	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > @@ -2858,8 +2861,6 @@ static int guc_child_context_pin(struct intel_context *ce, void *vaddr)
> >   	return __guc_context_pin(ce, engine, vaddr);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_parent_context_unpin(struct intel_context *ce)
> >   {
> >   	struct intel_guc *guc = ce_to_guc(ce);
> > @@ -2875,8 +2876,6 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(context_enabled(ce));
> > @@ -2887,8 +2886,6 @@ static void guc_child_context_unpin(struct intel_context *ce)
> >   	lrc_unpin(ce);
> >   }
> > -/* Future patches will use this function */
> > -__maybe_unused
> >   static void guc_child_context_post_unpin(struct intel_context *ce)
> >   {
> >   	GEM_BUG_ON(!intel_context_is_child(ce));
> > @@ -2899,6 +2896,98 @@ static void guc_child_context_post_unpin(struct intel_context *ce)
> >   	intel_context_unpin(ce->parent);
> >   }
> > +static void guc_child_context_destroy(struct kref *kref)
> > +{
> > +	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > +
> > +	__guc_context_destroy(ce);
> > +}
> > +
> > +static const struct intel_context_ops virtual_parent_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_parent_context_pin,
> > +	.unpin = guc_parent_context_unpin,
> > +	.post_unpin = guc_context_post_unpin,
> > +
> > +	.ban = guc_context_ban,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.sched_disable = guc_context_sched_disable,
> > +
> > +	.destroy = guc_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static const struct intel_context_ops virtual_child_context_ops = {
> > +	.alloc = guc_virtual_context_alloc,
> > +
> > +	.pre_pin = guc_context_pre_pin,
> > +	.pin = guc_child_context_pin,
> > +	.unpin = guc_child_context_unpin,
> > +	.post_unpin = guc_child_context_post_unpin,
> > +
> > +	.cancel_request = guc_context_cancel_request,
> > +
> > +	.enter = guc_virtual_context_enter,
> > +	.exit = guc_virtual_context_exit,
> > +
> > +	.destroy = guc_child_context_destroy,
> > +
> > +	.get_sibling = guc_virtual_get_sibling,
> > +};
> > +
> > +static struct intel_context *
> > +guc_create_parallel(struct intel_engine_cs **engines,
> > +		    unsigned int num_siblings,
> > +		    unsigned int width)
> > +{
> > +	struct intel_engine_cs **siblings = NULL;
> > +	struct intel_context *parent = NULL, *ce, *err;
> > +	int i, j;
> > +
> > +	siblings = kmalloc_array(num_siblings,
> > +				 sizeof(*siblings),
> > +				 GFP_KERNEL);
> > +	if (!siblings)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	for (i = 0; i < width; ++i) {
> > +		for (j = 0; j < num_siblings; ++j)
> > +			siblings[j] = engines[i * num_siblings + j];
> > +
> > +		ce = intel_engine_create_virtual(siblings, num_siblings,
> > +						 FORCE_VIRTUAL);
> > +		if (!ce) {
> > +			err = ERR_PTR(-ENOMEM);
> > +			goto unwind;
> > +		}
> > +
> > +		if (i == 0) {
> > +			parent = ce;
> > +			parent->ops = &virtual_parent_context_ops;
> > +		} else {
> > +			ce->ops = &virtual_child_context_ops;
> > +			intel_context_bind_parent_child(parent, ce);
> > +		}
> > +	}
> > +
> > +	kfree(siblings);
> > +	return parent;
> > +
> > +unwind:
> > +	if (parent)
> > +		intel_context_put(parent);
> > +	kfree(siblings);
> > +	return err;
> > +}
> > +
> >   static bool
> >   guc_irq_enable_breadcrumbs(struct intel_breadcrumbs *b)
> >   {
> > @@ -3726,7 +3815,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   }
> >   static struct intel_context *
> > -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> > +		   unsigned long flags)
> >   {
> >   	struct guc_virtual_engine *ve;
> >   	struct intel_guc *guc;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index b1248a67b4f8..b153f8215403 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1824,6 +1824,7 @@ struct drm_i915_gem_context_param {
> >    * Extensions:
> >    *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> >    *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> >    */
> >   #define I915_CONTEXT_PARAM_ENGINES	0xa
> > @@ -2049,6 +2050,132 @@ struct i915_context_engines_bond {
> >   	struct i915_engine_class_instance engines[N__]; \
> >   } __attribute__((packed)) name__
> > +/**
> > + * struct i915_context_engines_parallel_submit - Configure engine for
> > + * parallel submission.
> > + *
> > + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> > + * in parallel. Multiple hardware contexts are created internally in the i915
> i915 run -> i915 to run
> 
> > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> > + * many BBs there are based on the slot's configuration. The N BBs are the last
> > + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> > + *
> > + * The default placement behavior is to create implicit bonds between each
> > + * context if each context maps to more than 1 physical engine (e.g. context is
> > + * a virtual engine). Also we only allow contexts of same engine class and these
> > + * contexts must be in logically contiguous order. Examples of the placement
> > + * behavior described below. Lastly, the default is to not allow BBs to
> behaviour described -> behaviour are described
> 
> > + * preempted mid BB rather insert coordinated preemption on all hardware
> to preempted mid BB rather -> to be preempted mid-batch. Rather
> 
> coordinated preemption on -> coordinated preemption points on
> 
> > + * contexts between each set of BBs. Flags may be added in the future to change
> may -> could - 'may' implies we are thinking about doing it (maybe just for
> fun or because we're bored), 'could' implies a user has to ask for the
> facility if they need it.
>

Will reword all of this.
 
> > + * both of these default behaviors.
> > + *
> > + * Returns -EINVAL if hardware context placement configuration is invalid or if
> > + * the placement configuration isn't supported on the platform / submission
> > + * interface.
> > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > + * interface.
> > + *
> > + * .. code-block:: none
> > + *
> > + *	Example 1 pseudo code:
> > + *	CS[X] = generic engine of same class, logical instance X
> > + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> I would put these two terminology explanations above the 'example 1' line
> given that they are generic to all examples.
> 
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> > + *		     engines=CS[0],CS[1])
> > + *
> > + *	Results in the following valid placement:
> > + *	CS[0], CS[1]
> > + *
> > + *	Example 2 pseudo code:
> > + *	CS[X] = generic engine of same class, logical instance X
> > + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> And drop them from here.
>

Sure.
 
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[2],CS[1],CS[3])
> > + *
> > + *	Results in the following valid placements:
> > + *	CS[0], CS[1]
> > + *	CS[2], CS[3]
> > + *
> > + *	This can also be thought of as 2 virtual engines described by 2-D array
> > + *	in the engines the field with bonds placed between each index of the
> > + *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
> > + *	CS[3].
> I find this description just adds to the confusion. It doesn't help that the
> sentence is broken/unparsable - 'described by 2-D array in the engines the
> field with bonds'?
> 
> "This can be thought of as two virtual engines, each containing two engines
> thereby making a 2D array. However, there are bonds tying the entries
> together and placing restrictions on how they can be scheduled.
> Specifically, the scheduler can choose only vertical columns from the 2D
> array. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if the
> scheduler wants to submit to CS[0], it must also choose CS[1] and vice
> versa. Same for CS[2] requires also using CS[3]."
> 
> Does that make sense?
>

Yours is better. Will add.
 
> > + *	VE[0] = CS[0], CS[2]
> > + *	VE[1] = CS[1], CS[3]
> > + *
> > + *	Example 3 pseudo code:
> > + *	CS[X] = generic engine of same class, logical instance X
> > + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> And again.
> 
> > + *	set_engines(INVALID)
> > + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> > + *		     engines=CS[0],CS[1],CS[1],CS[3])
> > + *
> > + *	Results in the following valid and invalid placements:
> > + *	CS[0], CS[1]
> > + *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
> logical -> logically
>

Yep.
 
> > + */
> > +struct i915_context_engines_parallel_submit {
> > +	/**
> > +	 * @base: base user extension.
> > +	 */
> > +	struct i915_user_extension base;
> > +
> > +	/**
> > +	 * @engine_index: slot for parallel engine
> > +	 */
> > +	__u16 engine_index;
> > +
> > +	/**
> > +	 * @width: number of contexts per parallel engine
> Meaning number of engines in the virtual engine? As in, width = 3 means that
> the scheduler has a choice of three different engines to submit the one
> single batch buffer to?
>

No, width is number of BBs in a single submission. Will update the
comment to reflect that.
 
> > +	 */
> > +	__u16 width;
> > +
> > +	/**
> > +	 * @num_siblings: number of siblings per context
> > +	 */
> > +	__u16 num_siblings;
> Meaning the number of engines which must run in parallel. As in,
> num_siblings = 2 means that there will be two batch buffers submitted to
> every execbuf IOCTL call and that both must execute concurrently on two
> separate engines?
>

This means the number of possible different engine sets the N (width)
batch buffers could be placed on. Will update this comment too.

Matt
 
> John.
> 
> > +
> > +	/**
> > +	 * @mbz16: reserved for future use; must be zero
> > +	 */
> > +	__u16 mbz16;
> > +
> > +	/**
> > +	 * @flags: all undefined flags must be zero, currently not defined flags
> > +	 */
> > +	__u64 flags;
> > +
> > +	/**
> > +	 * @mbz64: reserved for future use; must be zero
> > +	 */
> > +	__u64 mbz64[3];
> > +
> > +	/**
> > +	 * @engines: 2-d array of engine instances to configure parallel engine
> > +	 *
> > +	 * length = width (i) * num_siblings (j)
> > +	 * index = j + i * num_siblings
> > +	 */
> > +	struct i915_engine_class_instance engines[0];
> > +
> > +} __packed;
> > +
> > +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> > +	struct i915_user_extension base; \
> > +	__u16 engine_index; \
> > +	__u16 width; \
> > +	__u16 num_siblings; \
> > +	__u16 mbz16; \
> > +	__u64 flags; \
> > +	__u64 mbz64[3]; \
> > +	struct i915_engine_class_instance engines[N__]; \
> > +} __attribute__((packed)) name__
> > +
> >   /**
> >    * DOC: Context Engine Map uAPI
> >    *
> > @@ -2108,6 +2235,7 @@ struct i915_context_param_engines {
> >   	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> >   #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> >   #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> >   	struct i915_engine_class_instance engines[0];
> >   } __attribute__((packed));
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-09-22 16:25     ` Matthew Brost
@ 2021-09-22 20:15       ` John Harrison
  2021-09-23  2:44         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-22 20:15 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On 9/22/2021 09:25, Matthew Brost wrote:
> On Mon, Sep 20, 2021 at 02:48:52PM -0700, John Harrison wrote:
>> On 8/20/2021 15:44, Matthew Brost wrote:
>>> Implement multi-lrc submission via a single workqueue entry and single
>>> H2G. The workqueue entry contains an updated tail value for each
>>> request, of all the contexts in the multi-lrc submission, and updates
>>> these values simultaneously. As such, the tasklet and bypass path have
>>> been updated to coalesce requests into a single submission.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
>>>    drivers/gpu/drm/i915/i915_request.h           |   8 +
>>>    6 files changed, 317 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
>>> index fbfcae727d7f..879aef662b2e 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
>>> @@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
>>>    		}
>>>    	}
>>>    }
>>> +
>>> +void intel_guc_write_barrier(struct intel_guc *guc)
>>> +{
>>> +	struct intel_gt *gt = guc_to_gt(guc);
>>> +
>>> +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
>>> +		GEM_BUG_ON(guc->send_regs.fw_domains);
>> Granted, this patch is just moving code from one file to another not
>> changing it. However, I think it would be worth adding a blank line in here.
>> Otherwise the 'this register' comment below can be confusingly read as
>> referring to the send_regs.fw_domain entry above.
>>
>> And maybe add a comment why it is a bug for the send_regs value to be set?
>> I'm not seeing any obvious connection between it and the reset of this code.
>>
> Can add a blank line. I think the GEM_BUG_ON relates to being able to
> use intel_uncore_write_fw vs intel_uncore_write. Can add comment.
>
>>> +		/*
>>> +		 * This register is used by the i915 and GuC for MMIO based
>>> +		 * communication. Once we are in this code CTBs are the only
>>> +		 * method the i915 uses to communicate with the GuC so it is
>>> +		 * safe to write to this register (a value of 0 is NOP for MMIO
>>> +		 * communication). If we ever start mixing CTBs and MMIOs a new
>>> +		 * register will have to be chosen.
>>> +		 */
>>> +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
>>> +	} else {
>>> +		/* wmb() sufficient for a barrier if in smem */
>>> +		wmb();
>>> +	}
>>> +}
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> index 3f95b1b4f15c..0ead2406d03c 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> @@ -37,6 +37,12 @@ struct intel_guc {
>>>    	/* Global engine used to submit requests to GuC */
>>>    	struct i915_sched_engine *sched_engine;
>>>    	struct i915_request *stalled_request;
>>> +	enum {
>>> +		STALL_NONE,
>>> +		STALL_REGISTER_CONTEXT,
>>> +		STALL_MOVE_LRC_TAIL,
>>> +		STALL_ADD_REQUEST,
>>> +	} submission_stall_reason;
>>>    	/* intel_guc_recv interrupt related state */
>>>    	spinlock_t irq_lock;
>>> @@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
>>>    void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
>>> +void intel_guc_write_barrier(struct intel_guc *guc);
>>> +
>>>    #endif
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> index 20c710a74498..10d1878d2826 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
>>>    	return ++ct->requests.last_fence;
>>>    }
>>> -static void write_barrier(struct intel_guc_ct *ct)
>>> -{
>>> -	struct intel_guc *guc = ct_to_guc(ct);
>>> -	struct intel_gt *gt = guc_to_gt(guc);
>>> -
>>> -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
>>> -		GEM_BUG_ON(guc->send_regs.fw_domains);
>>> -		/*
>>> -		 * This register is used by the i915 and GuC for MMIO based
>>> -		 * communication. Once we are in this code CTBs are the only
>>> -		 * method the i915 uses to communicate with the GuC so it is
>>> -		 * safe to write to this register (a value of 0 is NOP for MMIO
>>> -		 * communication). If we ever start mixing CTBs and MMIOs a new
>>> -		 * register will have to be chosen.
>>> -		 */
>>> -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
>>> -	} else {
>>> -		/* wmb() sufficient for a barrier if in smem */
>>> -		wmb();
>>> -	}
>>> -}
>>> -
>>>    static int ct_write(struct intel_guc_ct *ct,
>>>    		    const u32 *action,
>>>    		    u32 len /* in dwords */,
>>> @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
>>>    	 * make sure H2G buffer update and LRC tail update (if this triggering a
>>>    	 * submission) are visible before updating the descriptor tail
>>>    	 */
>>> -	write_barrier(ct);
>>> +	intel_guc_write_barrier(ct_to_guc(ct));
>>>    	/* update local copies */
>>>    	ctb->tail = tail;
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> index 0e600a3b8f1e..6cd26dc060d1 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> @@ -65,12 +65,14 @@
>>>    #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
>>>    #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
>>>    #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
>>> -#define WQ_TARGET_SHIFT			10
>>> +#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
>>> +#define WQ_TARGET_SHIFT			8
>>>    #define WQ_LEN_SHIFT			16
>>>    #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
>>>    #define WQ_PRESENT_WORKLOAD		(1 << 28)
>>> -#define WQ_RING_TAIL_SHIFT		20
>>> +#define WQ_GUC_ID_SHIFT			0
>>> +#define WQ_RING_TAIL_SHIFT		18
>> Presumably all of these API changes are not actually new? They really came
>> in with the reset of the v40 re-write? It's just that this is the first time
>> we are using them and therefore need to finally update the defines?
>>
> Yes.
>
>>>    #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
>>>    #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index e9dfd43d29a0..b107ad095248 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
>>>    		   LRC_STATE_OFFSET) / sizeof(u32)));
>>>    }
>>> +static u32 *get_wq_pointer(struct guc_process_desc *desc,
>>> +			   struct intel_context *ce,
>>> +			   u32 wqi_size)
>>> +{
>>> +	/*
>>> +	 * Check for space in work queue. Caching a value of head pointer in
>>> +	 * intel_context structure in order reduce the number accesses to shared
>>> +	 * GPU memory which may be across a PCIe bus.
>>> +	 */
>>> +#define AVAILABLE_SPACE	\
>>> +	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
>>> +	if (wqi_size > AVAILABLE_SPACE) {
>>> +		ce->guc_wqi_head = READ_ONCE(desc->head);
>>> +
>>> +		if (wqi_size > AVAILABLE_SPACE)
>>> +			return NULL;
>>> +	}
>>> +#undef AVAILABLE_SPACE
>>> +
>>> +	return ((u32 *)__get_process_desc(ce)) +
>>> +		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
>>> +}
>>> +
>>>    static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
>>>    {
>>>    	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
>>> @@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>>>    static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
>>> -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>> +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>>    {
>>>    	int err = 0;
>>> -	struct intel_context *ce = rq->context;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>>    	u32 action[3];
>>>    	int len = 0;
>>>    	u32 g2h_len_dw = 0;
>>> @@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>>    	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>>>    	GEM_BUG_ON(context_guc_id_invalid(ce));
>>> -	/*
>>> -	 * Corner case where the GuC firmware was blown away and reloaded while
>>> -	 * this context was pinned.
>>> -	 */
>>> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
>>> -		err = guc_lrc_desc_pin(ce, false);
>>> -		if (unlikely(err))
>>> -			return err;
>>> -	}
>>> -
>>>    	spin_lock(&ce->guc_state.lock);
>>>    	/*
>>>    	 * The request / context will be run on the hardware when scheduling
>>> -	 * gets enabled in the unblock.
>>> +	 * gets enabled in the unblock. For multi-lrc we still submit the
>>> +	 * context to move the LRC tails.
>>>    	 */
>>> -	if (unlikely(context_blocked(ce)))
>>> +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
>>>    		goto out;
>>> -	enabled = context_enabled(ce);
>>> +	enabled = context_enabled(ce) || context_blocked(ce);
>> Would be better to say '|| is_parent(ce)' rather than blocked? The reason
>> for reason for claiming enabled when not is because it's a multi-LRC parent,
>> right? Or can there be a parent that is neither enabled nor blocked for
>> which we don't want to do the processing? But why would that make sense/be
>> possible?
>>
> No. If it is parent and blocked we want to submit the enable but not
> enable submission. In the non-multi-lrc case the submit has already been
> done by the i915 (moving lrc tail).
>
>>>    	if (!enabled) {
>>>    		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
>>> @@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>>    		trace_intel_context_sched_enable(ce);
>>>    		atomic_inc(&guc->outstanding_submission_g2h);
>>>    		set_context_enabled(ce);
>>> +
>>> +		/*
>>> +		 * Without multi-lrc KMD does the submission step (moving the
>>> +		 * lrc tail) so enabling scheduling is sufficient to submit the
>>> +		 * context. This isn't the case in multi-lrc submission as the
>>> +		 * GuC needs to move the tails, hence the need for another H2G
>>> +		 * to submit a multi-lrc context after enabling scheduling.
>>> +		 */
>>> +		if (intel_context_is_parent(ce)) {
>>> +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
>>> +			err = intel_guc_send_nb(guc, action, len - 1, 0);
>>> +		}
>>>    	} else if (!enabled) {
>>>    		clr_context_pending_enable(ce);
>>>    		intel_context_put(ce);
>>> @@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>>    	return err;
>>>    }
>>> +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>>> +{
>>> +	int ret = __guc_add_request(guc, rq);
>>> +
>>> +	if (unlikely(ret == -EBUSY)) {
>>> +		guc->stalled_request= rq;
>>> +		guc->submission_stall_reason = STALL_ADD_REQUEST;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>    static void guc_set_lrc_tail(struct i915_request *rq)
>>>    {
>>>    	rq->context->lrc_reg_state[CTX_RING_TAIL] =
>>> @@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
>>>    	return rq->sched.attr.priority;
>>>    }
>>> +static bool is_multi_lrc_rq(struct i915_request *rq)
>>> +{
>>> +	return intel_context_is_child(rq->context) ||
>>> +		intel_context_is_parent(rq->context);
>>> +}
>>> +
>>> +static bool can_merge_rq(struct i915_request *rq,
>>> +			 struct i915_request *last)
>>> +{
>>> +	return request_to_scheduling_context(rq) ==
>>> +		request_to_scheduling_context(last);
>>> +}
>>> +
>>> +static u32 wq_space_until_wrap(struct intel_context *ce)
>>> +{
>>> +	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
>>> +}
>>> +
>>> +static void write_wqi(struct guc_process_desc *desc,
>>> +		      struct intel_context *ce,
>>> +		      u32 wqi_size)
>>> +{
>>> +	/*
>>> +	 * Ensure WQE are visible before updating tail
>> WQE or WQI?
>>
> WQI (work queue instance) is the convention used but I actually like WQE
> (work queue entry) better. Will change the name to WQE everywhere.
I thought it was Work Queue Item. Which is basically the same as entry. 
Ether works but simpler just to change this one instance to WQI. And 
maybe make sure the abbreviation is actually written out in full 
somewhere to be clear? I think patch #12 is adding these fields to the 
ce, maybe update the comments to explain what wqi means.

John.

>   
>>> +	 */
>>> +	intel_guc_write_barrier(ce_to_guc(ce));
>>> +
>>> +	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
>>> +	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
>>> +}
>>> +
>>> +static int guc_wq_noop_append(struct intel_context *ce)
>>> +{
>>> +	struct guc_process_desc *desc = __get_process_desc(ce);
>>> +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
>>> +
>>> +	if (!wqi)
>>> +		return -EBUSY;
>>> +
>>> +	*wqi = WQ_TYPE_NOOP |
>>> +		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
>> This should have a BUG_ON check that the requested size fits within the
>> WQ_LEN field?
>>
> I could add that.
>   
>> Indeed, would be better to use the FIELD macros as they do that kind of
>> thing for you.
>>
> Yes, they do. I forget how they work, will figure this out and use the
> macros.
>   
>>> +	ce->guc_wqi_tail = 0;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int __guc_wq_item_append(struct i915_request *rq)
>>> +{
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>> +	struct intel_context *child;
>>> +	struct guc_process_desc *desc = __get_process_desc(ce);
>>> +	unsigned int wqi_size = (ce->guc_number_children + 4) *
>>> +		sizeof(u32);
>>> +	u32 *wqi;
>>> +	int ret;
>>> +
>>> +	/* Ensure context is in correct state updating work queue */
>>> +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
>>> +	GEM_BUG_ON(context_guc_id_invalid(ce));
>>> +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
>>> +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
>>> +
>>> +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
>>> +	if (wqi_size > wq_space_until_wrap(ce)) {
>>> +		ret = guc_wq_noop_append(ce);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	wqi = get_wq_pointer(desc, ce, wqi_size);
>>> +	if (!wqi)
>>> +		return -EBUSY;
>>> +
>>> +	*wqi++ = WQ_TYPE_MULTI_LRC |
>>> +		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
>>> +	*wqi++ = ce->lrc.lrca;
>>> +	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
>>> +		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
>> As above, would be better to use FIELD macros instead of manual shifting.
>>
> Will do.
>
> Matt
>
>> John.
>>
>>
>>> +	*wqi++ = 0;	/* fence_id */
>>> +	for_each_child(ce, child)
>>> +		*wqi++ = child->ring->tail / sizeof(u64);
>>> +
>>> +	write_wqi(desc, ce, wqi_size);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int guc_wq_item_append(struct intel_guc *guc,
>>> +			      struct i915_request *rq)
>>> +{
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>> +	int ret = 0;
>>> +
>>> +	if (likely(!intel_context_is_banned(ce))) {
>>> +		ret = __guc_wq_item_append(rq);
>>> +
>>> +		if (unlikely(ret == -EBUSY)) {
>>> +			guc->stalled_request = rq;
>>> +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
>>> +		}
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static bool multi_lrc_submit(struct i915_request *rq)
>>> +{
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>> +
>>> +	intel_ring_set_tail(rq->ring, rq->tail);
>>> +
>>> +	/*
>>> +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
>>> +	 * request generated from a multi-BB submission. This indicates to the
>>> +	 * backend (GuC interface) that we should submit this context thus
>>> +	 * submitting all the requests generated in parallel.
>>> +	 */
>>> +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
>>> +		intel_context_is_banned(ce);
>>> +}
>>> +
>>>    static int guc_dequeue_one_context(struct intel_guc *guc)
>>>    {
>>>    	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>> @@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>>>    	if (guc->stalled_request) {
>>>    		submit = true;
>>>    		last = guc->stalled_request;
>>> -		goto resubmit;
>>> +
>>> +		switch (guc->submission_stall_reason) {
>>> +		case STALL_REGISTER_CONTEXT:
>>> +			goto register_context;
>>> +		case STALL_MOVE_LRC_TAIL:
>>> +			goto move_lrc_tail;
>>> +		case STALL_ADD_REQUEST:
>>> +			goto add_request;
>>> +		default:
>>> +			MISSING_CASE(guc->submission_stall_reason);
>>> +		}
>>>    	}
>>>    	while ((rb = rb_first_cached(&sched_engine->queue))) {
>>> @@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>>>    		struct i915_request *rq, *rn;
>>>    		priolist_for_each_request_consume(rq, rn, p) {
>>> -			if (last && rq->context != last->context)
>>> -				goto done;
>>> +			if (last && !can_merge_rq(rq, last))
>>> +				goto register_context;
>>>    			list_del_init(&rq->sched.link);
>>> @@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
>>>    			trace_i915_request_in(rq, 0);
>>>    			last = rq;
>>> -			submit = true;
>>> +
>>> +			if (is_multi_lrc_rq(rq)) {
>>> +				/*
>>> +				 * We need to coalesce all multi-lrc requests in
>>> +				 * a relationship into a single H2G. We are
>>> +				 * guaranteed that all of these requests will be
>>> +				 * submitted sequentially.
>>> +				 */
>>> +				if (multi_lrc_submit(rq)) {
>>> +					submit = true;
>>> +					goto register_context;
>>> +				}
>>> +			} else {
>>> +				submit = true;
>>> +			}
>>>    		}
>>>    		rb_erase_cached(&p->node, &sched_engine->queue);
>>>    		i915_priolist_free(p);
>>>    	}
>>> -done:
>>> +
>>> +register_context:
>>>    	if (submit) {
>>> -		guc_set_lrc_tail(last);
>>> -resubmit:
>>> +		struct intel_context *ce = request_to_scheduling_context(last);
>>> +
>>> +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
>>> +			     !intel_context_is_banned(ce))) {
>>> +			ret = guc_lrc_desc_pin(ce, false);
>>> +			if (unlikely(ret == -EPIPE)) {
>>> +				goto deadlk;
>>> +			} else if (ret == -EBUSY) {
>>> +				guc->stalled_request = last;
>>> +				guc->submission_stall_reason =
>>> +					STALL_REGISTER_CONTEXT;
>>> +				goto schedule_tasklet;
>>> +			} else if (ret != 0) {
>>> +				GEM_WARN_ON(ret);	/* Unexpected */
>>> +				goto deadlk;
>>> +			}
>>> +		}
>>> +
>>> +move_lrc_tail:
>>> +		if (is_multi_lrc_rq(last)) {
>>> +			ret = guc_wq_item_append(guc, last);
>>> +			if (ret == -EBUSY) {
>>> +				goto schedule_tasklet;
>>> +			} else if (ret != 0) {
>>> +				GEM_WARN_ON(ret);	/* Unexpected */
>>> +				goto deadlk;
>>> +			}
>>> +		} else {
>>> +			guc_set_lrc_tail(last);
>>> +		}
>>> +
>>> +add_request:
>>>    		ret = guc_add_request(guc, last);
>>> -		if (unlikely(ret == -EPIPE))
>>> +		if (unlikely(ret == -EPIPE)) {
>>> +			goto deadlk;
>>> +		} else if (ret == -EBUSY) {
>>> +			goto schedule_tasklet;
>>> +		} else if (ret != 0) {
>>> +			GEM_WARN_ON(ret);	/* Unexpected */
>>>    			goto deadlk;
>>> -		else if (ret == -EBUSY) {
>>> -			tasklet_schedule(&sched_engine->tasklet);
>>> -			guc->stalled_request = last;
>>> -			return false;
>>>    		}
>>>    	}
>>>    	guc->stalled_request = NULL;
>>> +	guc->submission_stall_reason = STALL_NONE;
>>>    	return submit;
>>>    deadlk:
>>>    	sched_engine->tasklet.callback = NULL;
>>>    	tasklet_disable_nosync(&sched_engine->tasklet);
>>>    	return false;
>>> +
>>> +schedule_tasklet:
>>> +	tasklet_schedule(&sched_engine->tasklet);
>>> +	return false;
>>>    }
>>>    static void guc_submission_tasklet(struct tasklet_struct *t)
>>> @@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>>>    	trace_i915_request_in(rq, 0);
>>> -	guc_set_lrc_tail(rq);
>>> -	ret = guc_add_request(guc, rq);
>>> -	if (ret == -EBUSY)
>>> -		guc->stalled_request = rq;
>>> +	if (is_multi_lrc_rq(rq)) {
>>> +		if (multi_lrc_submit(rq)) {
>>> +			ret = guc_wq_item_append(guc, rq);
>>> +			if (!ret)
>>> +				ret = guc_add_request(guc, rq);
>>> +		}
>>> +	} else {
>>> +		guc_set_lrc_tail(rq);
>>> +		ret = guc_add_request(guc, rq);
>>> +	}
>>>    	if (unlikely(ret == -EPIPE))
>>>    		disable_submission(guc);
>>> @@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>>>    	return ret;
>>>    }
>>> +bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
>>> +{
>>> +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>> +
>>> +	return submission_disabled(guc) || guc->stalled_request ||
>>> +		!i915_sched_engine_is_empty(sched_engine) ||
>>> +		!lrc_desc_registered(guc, ce->guc_id.id);
>>> +}
>>> +
>>>    static void guc_submit_request(struct i915_request *rq)
>>>    {
>>>    	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
>>> @@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
>>>    	/* Will be called from irq-context when using foreign fences. */
>>>    	spin_lock_irqsave(&sched_engine->lock, flags);
>>> -	if (submission_disabled(guc) || guc->stalled_request ||
>>> -	    !i915_sched_engine_is_empty(sched_engine))
>>> +	if (need_tasklet(guc, rq))
>>>    		queue_request(sched_engine, rq, rq_prio(rq));
>>>    	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
>>>    		tasklet_hi_schedule(&sched_engine->tasklet);
>>> @@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
>>>    static void add_to_context(struct i915_request *rq)
>>>    {
>>> -	struct intel_context *ce = rq->context;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>>    	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>    	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
>>>    	spin_lock(&ce->guc_state.lock);
>>> @@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
>>>    static void remove_from_context(struct i915_request *rq)
>>>    {
>>> -	struct intel_context *ce = rq->context;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>> +
>>> +	GEM_BUG_ON(intel_context_is_child(ce));
>>>    	spin_lock_irq(&ce->guc_state.lock);
>>> @@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
>>>    static void guc_bump_inflight_request_prio(struct i915_request *rq,
>>>    					   int prio)
>>>    {
>>> -	struct intel_context *ce = rq->context;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>>    	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
>>>    	/* Short circuit function */
>>> @@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
>>>    static void guc_retire_inflight_request_prio(struct i915_request *rq)
>>>    {
>>> -	struct intel_context *ce = rq->context;
>>> +	struct intel_context *ce = request_to_scheduling_context(rq);
>>>    	spin_lock(&ce->guc_state.lock);
>>>    	guc_prio_fini(rq, ce);
>>> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
>>> index 177eaf55adff..8f0073e19079 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.h
>>> +++ b/drivers/gpu/drm/i915/i915_request.h
>>> @@ -139,6 +139,14 @@ enum {
>>>    	 * the GPU. Here we track such boost requests on a per-request basis.
>>>    	 */
>>>    	I915_FENCE_FLAG_BOOST,
>>> +
>>> +	/*
>>> +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
>>> +	 * parent-child relationship (parallel submission, multi-lrc) should
>>> +	 * trigger a submission to the GuC rather than just moving the context
>>> +	 * tail.
>>> +	 */
>>> +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
>>>    };
>>>    /**


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission
  2021-09-22 20:15       ` John Harrison
@ 2021-09-23  2:44         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-23  2:44 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 22, 2021 at 01:15:46PM -0700, John Harrison wrote:
> On 9/22/2021 09:25, Matthew Brost wrote:
> > On Mon, Sep 20, 2021 at 02:48:52PM -0700, John Harrison wrote:
> > > On 8/20/2021 15:44, Matthew Brost wrote:
> > > > Implement multi-lrc submission via a single workqueue entry and single
> > > > H2G. The workqueue entry contains an updated tail value for each
> > > > request, of all the contexts in the multi-lrc submission, and updates
> > > > these values simultaneously. As such, the tasklet and bypass path have
> > > > been updated to coalesce requests into a single submission.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  21 ++
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   8 +
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  24 +-
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 312 +++++++++++++++---
> > > >    drivers/gpu/drm/i915/i915_request.h           |   8 +
> > > >    6 files changed, 317 insertions(+), 62 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > > index fbfcae727d7f..879aef662b2e 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > > @@ -748,3 +748,24 @@ void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p)
> > > >    		}
> > > >    	}
> > > >    }
> > > > +
> > > > +void intel_guc_write_barrier(struct intel_guc *guc)
> > > > +{
> > > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > > +
> > > > +	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > > > +		GEM_BUG_ON(guc->send_regs.fw_domains);
> > > Granted, this patch is just moving code from one file to another not
> > > changing it. However, I think it would be worth adding a blank line in here.
> > > Otherwise the 'this register' comment below can be confusingly read as
> > > referring to the send_regs.fw_domain entry above.
> > > 
> > > And maybe add a comment why it is a bug for the send_regs value to be set?
> > > I'm not seeing any obvious connection between it and the reset of this code.
> > > 
> > Can add a blank line. I think the GEM_BUG_ON relates to being able to
> > use intel_uncore_write_fw vs intel_uncore_write. Can add comment.
> > 
> > > > +		/*
> > > > +		 * This register is used by the i915 and GuC for MMIO based
> > > > +		 * communication. Once we are in this code CTBs are the only
> > > > +		 * method the i915 uses to communicate with the GuC so it is
> > > > +		 * safe to write to this register (a value of 0 is NOP for MMIO
> > > > +		 * communication). If we ever start mixing CTBs and MMIOs a new
> > > > +		 * register will have to be chosen.
> > > > +		 */
> > > > +		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > > > +	} else {
> > > > +		/* wmb() sufficient for a barrier if in smem */
> > > > +		wmb();
> > > > +	}
> > > > +}
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 3f95b1b4f15c..0ead2406d03c 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -37,6 +37,12 @@ struct intel_guc {
> > > >    	/* Global engine used to submit requests to GuC */
> > > >    	struct i915_sched_engine *sched_engine;
> > > >    	struct i915_request *stalled_request;
> > > > +	enum {
> > > > +		STALL_NONE,
> > > > +		STALL_REGISTER_CONTEXT,
> > > > +		STALL_MOVE_LRC_TAIL,
> > > > +		STALL_ADD_REQUEST,
> > > > +	} submission_stall_reason;
> > > >    	/* intel_guc_recv interrupt related state */
> > > >    	spinlock_t irq_lock;
> > > > @@ -332,4 +338,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc);
> > > >    void intel_guc_load_status(struct intel_guc *guc, struct drm_printer *p);
> > > > +void intel_guc_write_barrier(struct intel_guc *guc);
> > > > +
> > > >    #endif
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > > index 20c710a74498..10d1878d2826 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > > @@ -377,28 +377,6 @@ static u32 ct_get_next_fence(struct intel_guc_ct *ct)
> > > >    	return ++ct->requests.last_fence;
> > > >    }
> > > > -static void write_barrier(struct intel_guc_ct *ct)
> > > > -{
> > > > -	struct intel_guc *guc = ct_to_guc(ct);
> > > > -	struct intel_gt *gt = guc_to_gt(guc);
> > > > -
> > > > -	if (i915_gem_object_is_lmem(guc->ct.vma->obj)) {
> > > > -		GEM_BUG_ON(guc->send_regs.fw_domains);
> > > > -		/*
> > > > -		 * This register is used by the i915 and GuC for MMIO based
> > > > -		 * communication. Once we are in this code CTBs are the only
> > > > -		 * method the i915 uses to communicate with the GuC so it is
> > > > -		 * safe to write to this register (a value of 0 is NOP for MMIO
> > > > -		 * communication). If we ever start mixing CTBs and MMIOs a new
> > > > -		 * register will have to be chosen.
> > > > -		 */
> > > > -		intel_uncore_write_fw(gt->uncore, GEN11_SOFT_SCRATCH(0), 0);
> > > > -	} else {
> > > > -		/* wmb() sufficient for a barrier if in smem */
> > > > -		wmb();
> > > > -	}
> > > > -}
> > > > -
> > > >    static int ct_write(struct intel_guc_ct *ct,
> > > >    		    const u32 *action,
> > > >    		    u32 len /* in dwords */,
> > > > @@ -468,7 +446,7 @@ static int ct_write(struct intel_guc_ct *ct,
> > > >    	 * make sure H2G buffer update and LRC tail update (if this triggering a
> > > >    	 * submission) are visible before updating the descriptor tail
> > > >    	 */
> > > > -	write_barrier(ct);
> > > > +	intel_guc_write_barrier(ct_to_guc(ct));
> > > >    	/* update local copies */
> > > >    	ctb->tail = tail;
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > index 0e600a3b8f1e..6cd26dc060d1 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > @@ -65,12 +65,14 @@
> > > >    #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
> > > >    #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
> > > >    #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
> > > > -#define WQ_TARGET_SHIFT			10
> > > > +#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
> > > > +#define WQ_TARGET_SHIFT			8
> > > >    #define WQ_LEN_SHIFT			16
> > > >    #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
> > > >    #define WQ_PRESENT_WORKLOAD		(1 << 28)
> > > > -#define WQ_RING_TAIL_SHIFT		20
> > > > +#define WQ_GUC_ID_SHIFT			0
> > > > +#define WQ_RING_TAIL_SHIFT		18
> > > Presumably all of these API changes are not actually new? They really came
> > > in with the reset of the v40 re-write? It's just that this is the first time
> > > we are using them and therefore need to finally update the defines?
> > > 
> > Yes.
> > 
> > > >    #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
> > > >    #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index e9dfd43d29a0..b107ad095248 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -391,6 +391,29 @@ __get_process_desc(struct intel_context *ce)
> > > >    		   LRC_STATE_OFFSET) / sizeof(u32)));
> > > >    }
> > > > +static u32 *get_wq_pointer(struct guc_process_desc *desc,
> > > > +			   struct intel_context *ce,
> > > > +			   u32 wqi_size)
> > > > +{
> > > > +	/*
> > > > +	 * Check for space in work queue. Caching a value of head pointer in
> > > > +	 * intel_context structure in order reduce the number accesses to shared
> > > > +	 * GPU memory which may be across a PCIe bus.
> > > > +	 */
> > > > +#define AVAILABLE_SPACE	\
> > > > +	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
> > > > +	if (wqi_size > AVAILABLE_SPACE) {
> > > > +		ce->guc_wqi_head = READ_ONCE(desc->head);
> > > > +
> > > > +		if (wqi_size > AVAILABLE_SPACE)
> > > > +			return NULL;
> > > > +	}
> > > > +#undef AVAILABLE_SPACE
> > > > +
> > > > +	return ((u32 *)__get_process_desc(ce)) +
> > > > +		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
> > > > +}
> > > > +
> > > >    static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
> > > >    {
> > > >    	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
> > > > @@ -547,10 +570,10 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> > > >    static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> > > > -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > > +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > >    {
> > > >    	int err = 0;
> > > > -	struct intel_context *ce = rq->context;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > >    	u32 action[3];
> > > >    	int len = 0;
> > > >    	u32 g2h_len_dw = 0;
> > > > @@ -571,26 +594,17 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > >    	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> > > >    	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > -	/*
> > > > -	 * Corner case where the GuC firmware was blown away and reloaded while
> > > > -	 * this context was pinned.
> > > > -	 */
> > > > -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id))) {
> > > > -		err = guc_lrc_desc_pin(ce, false);
> > > > -		if (unlikely(err))
> > > > -			return err;
> > > > -	}
> > > > -
> > > >    	spin_lock(&ce->guc_state.lock);
> > > >    	/*
> > > >    	 * The request / context will be run on the hardware when scheduling
> > > > -	 * gets enabled in the unblock.
> > > > +	 * gets enabled in the unblock. For multi-lrc we still submit the
> > > > +	 * context to move the LRC tails.
> > > >    	 */
> > > > -	if (unlikely(context_blocked(ce)))
> > > > +	if (unlikely(context_blocked(ce) && !intel_context_is_parent(ce)))
> > > >    		goto out;
> > > > -	enabled = context_enabled(ce);
> > > > +	enabled = context_enabled(ce) || context_blocked(ce);
> > > Would be better to say '|| is_parent(ce)' rather than blocked? The reason
> > > for reason for claiming enabled when not is because it's a multi-LRC parent,
> > > right? Or can there be a parent that is neither enabled nor blocked for
> > > which we don't want to do the processing? But why would that make sense/be
> > > possible?
> > > 
> > No. If it is parent and blocked we want to submit the enable but not
> > enable submission. In the non-multi-lrc case the submit has already been
> > done by the i915 (moving lrc tail).
> > 
> > > >    	if (!enabled) {
> > > >    		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
> > > > @@ -609,6 +623,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > >    		trace_intel_context_sched_enable(ce);
> > > >    		atomic_inc(&guc->outstanding_submission_g2h);
> > > >    		set_context_enabled(ce);
> > > > +
> > > > +		/*
> > > > +		 * Without multi-lrc KMD does the submission step (moving the
> > > > +		 * lrc tail) so enabling scheduling is sufficient to submit the
> > > > +		 * context. This isn't the case in multi-lrc submission as the
> > > > +		 * GuC needs to move the tails, hence the need for another H2G
> > > > +		 * to submit a multi-lrc context after enabling scheduling.
> > > > +		 */
> > > > +		if (intel_context_is_parent(ce)) {
> > > > +			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
> > > > +			err = intel_guc_send_nb(guc, action, len - 1, 0);
> > > > +		}
> > > >    	} else if (!enabled) {
> > > >    		clr_context_pending_enable(ce);
> > > >    		intel_context_put(ce);
> > > > @@ -621,6 +647,18 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > >    	return err;
> > > >    }
> > > > +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> > > > +{
> > > > +	int ret = __guc_add_request(guc, rq);
> > > > +
> > > > +	if (unlikely(ret == -EBUSY)) {
> > > > +		guc->stalled_request= rq;
> > > > +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > >    static void guc_set_lrc_tail(struct i915_request *rq)
> > > >    {
> > > >    	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> > > > @@ -632,6 +670,127 @@ static int rq_prio(const struct i915_request *rq)
> > > >    	return rq->sched.attr.priority;
> > > >    }
> > > > +static bool is_multi_lrc_rq(struct i915_request *rq)
> > > > +{
> > > > +	return intel_context_is_child(rq->context) ||
> > > > +		intel_context_is_parent(rq->context);
> > > > +}
> > > > +
> > > > +static bool can_merge_rq(struct i915_request *rq,
> > > > +			 struct i915_request *last)
> > > > +{
> > > > +	return request_to_scheduling_context(rq) ==
> > > > +		request_to_scheduling_context(last);
> > > > +}
> > > > +
> > > > +static u32 wq_space_until_wrap(struct intel_context *ce)
> > > > +{
> > > > +	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
> > > > +}
> > > > +
> > > > +static void write_wqi(struct guc_process_desc *desc,
> > > > +		      struct intel_context *ce,
> > > > +		      u32 wqi_size)
> > > > +{
> > > > +	/*
> > > > +	 * Ensure WQE are visible before updating tail
> > > WQE or WQI?
> > > 
> > WQI (work queue instance) is the convention used but I actually like WQE
> > (work queue entry) better. Will change the name to WQE everywhere.
> I thought it was Work Queue Item. Which is basically the same as entry.
> Ether works but simpler just to change this one instance to WQI. And maybe
> make sure the abbreviation is actually written out in full somewhere to be
> clear? I think patch #12 is adding these fields to the ce, maybe update the
> comments to explain what wqi means.
> 

I think in the GuC spec it is 'work queue item', guess I'll stick with that.

Matt

> John.
> 
> > > > +	 */
> > > > +	intel_guc_write_barrier(ce_to_guc(ce));
> > > > +
> > > > +	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
> > > > +	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
> > > > +}
> > > > +
> > > > +static int guc_wq_noop_append(struct intel_context *ce)
> > > > +{
> > > > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > > > +	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
> > > > +
> > > > +	if (!wqi)
> > > > +		return -EBUSY;
> > > > +
> > > > +	*wqi = WQ_TYPE_NOOP |
> > > > +		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
> > > This should have a BUG_ON check that the requested size fits within the
> > > WQ_LEN field?
> > > 
> > I could add that.
> > > Indeed, would be better to use the FIELD macros as they do that kind of
> > > thing for you.
> > > 
> > Yes, they do. I forget how they work, will figure this out and use the
> > macros.
> > > > +	ce->guc_wqi_tail = 0;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int __guc_wq_item_append(struct i915_request *rq)
> > > > +{
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > > +	struct intel_context *child;
> > > > +	struct guc_process_desc *desc = __get_process_desc(ce);
> > > > +	unsigned int wqi_size = (ce->guc_number_children + 4) *
> > > > +		sizeof(u32);
> > > > +	u32 *wqi;
> > > > +	int ret;
> > > > +
> > > > +	/* Ensure context is in correct state updating work queue */
> > > > +	GEM_BUG_ON(!atomic_read(&ce->guc_id.ref));
> > > > +	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> > > > +	GEM_BUG_ON(!lrc_desc_registered(ce_to_guc(ce), ce->guc_id.id));
> > > > +
> > > > +	/* Insert NOOP if this work queue item will wrap the tail pointer. */
> > > > +	if (wqi_size > wq_space_until_wrap(ce)) {
> > > > +		ret = guc_wq_noop_append(ce);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	wqi = get_wq_pointer(desc, ce, wqi_size);
> > > > +	if (!wqi)
> > > > +		return -EBUSY;
> > > > +
> > > > +	*wqi++ = WQ_TYPE_MULTI_LRC |
> > > > +		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
> > > > +	*wqi++ = ce->lrc.lrca;
> > > > +	*wqi++ = (ce->guc_id.id << WQ_GUC_ID_SHIFT) |
> > > > +		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
> > > As above, would be better to use FIELD macros instead of manual shifting.
> > > 
> > Will do.
> > 
> > Matt
> > 
> > > John.
> > > 
> > > 
> > > > +	*wqi++ = 0;	/* fence_id */
> > > > +	for_each_child(ce, child)
> > > > +		*wqi++ = child->ring->tail / sizeof(u64);
> > > > +
> > > > +	write_wqi(desc, ce, wqi_size);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int guc_wq_item_append(struct intel_guc *guc,
> > > > +			      struct i915_request *rq)
> > > > +{
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > > +	int ret = 0;
> > > > +
> > > > +	if (likely(!intel_context_is_banned(ce))) {
> > > > +		ret = __guc_wq_item_append(rq);
> > > > +
> > > > +		if (unlikely(ret == -EBUSY)) {
> > > > +			guc->stalled_request = rq;
> > > > +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static bool multi_lrc_submit(struct i915_request *rq)
> > > > +{
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > > +
> > > > +	intel_ring_set_tail(rq->ring, rq->tail);
> > > > +
> > > > +	/*
> > > > +	 * We expect the front end (execbuf IOCTL) to set this flag on the last
> > > > +	 * request generated from a multi-BB submission. This indicates to the
> > > > +	 * backend (GuC interface) that we should submit this context thus
> > > > +	 * submitting all the requests generated in parallel.
> > > > +	 */
> > > > +	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
> > > > +		intel_context_is_banned(ce);
> > > > +}
> > > > +
> > > >    static int guc_dequeue_one_context(struct intel_guc *guc)
> > > >    {
> > > >    	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > > > @@ -645,7 +804,17 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> > > >    	if (guc->stalled_request) {
> > > >    		submit = true;
> > > >    		last = guc->stalled_request;
> > > > -		goto resubmit;
> > > > +
> > > > +		switch (guc->submission_stall_reason) {
> > > > +		case STALL_REGISTER_CONTEXT:
> > > > +			goto register_context;
> > > > +		case STALL_MOVE_LRC_TAIL:
> > > > +			goto move_lrc_tail;
> > > > +		case STALL_ADD_REQUEST:
> > > > +			goto add_request;
> > > > +		default:
> > > > +			MISSING_CASE(guc->submission_stall_reason);
> > > > +		}
> > > >    	}
> > > >    	while ((rb = rb_first_cached(&sched_engine->queue))) {
> > > > @@ -653,8 +822,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> > > >    		struct i915_request *rq, *rn;
> > > >    		priolist_for_each_request_consume(rq, rn, p) {
> > > > -			if (last && rq->context != last->context)
> > > > -				goto done;
> > > > +			if (last && !can_merge_rq(rq, last))
> > > > +				goto register_context;
> > > >    			list_del_init(&rq->sched.link);
> > > > @@ -662,33 +831,84 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
> > > >    			trace_i915_request_in(rq, 0);
> > > >    			last = rq;
> > > > -			submit = true;
> > > > +
> > > > +			if (is_multi_lrc_rq(rq)) {
> > > > +				/*
> > > > +				 * We need to coalesce all multi-lrc requests in
> > > > +				 * a relationship into a single H2G. We are
> > > > +				 * guaranteed that all of these requests will be
> > > > +				 * submitted sequentially.
> > > > +				 */
> > > > +				if (multi_lrc_submit(rq)) {
> > > > +					submit = true;
> > > > +					goto register_context;
> > > > +				}
> > > > +			} else {
> > > > +				submit = true;
> > > > +			}
> > > >    		}
> > > >    		rb_erase_cached(&p->node, &sched_engine->queue);
> > > >    		i915_priolist_free(p);
> > > >    	}
> > > > -done:
> > > > +
> > > > +register_context:
> > > >    	if (submit) {
> > > > -		guc_set_lrc_tail(last);
> > > > -resubmit:
> > > > +		struct intel_context *ce = request_to_scheduling_context(last);
> > > > +
> > > > +		if (unlikely(!lrc_desc_registered(guc, ce->guc_id.id) &&
> > > > +			     !intel_context_is_banned(ce))) {
> > > > +			ret = guc_lrc_desc_pin(ce, false);
> > > > +			if (unlikely(ret == -EPIPE)) {
> > > > +				goto deadlk;
> > > > +			} else if (ret == -EBUSY) {
> > > > +				guc->stalled_request = last;
> > > > +				guc->submission_stall_reason =
> > > > +					STALL_REGISTER_CONTEXT;
> > > > +				goto schedule_tasklet;
> > > > +			} else if (ret != 0) {
> > > > +				GEM_WARN_ON(ret);	/* Unexpected */
> > > > +				goto deadlk;
> > > > +			}
> > > > +		}
> > > > +
> > > > +move_lrc_tail:
> > > > +		if (is_multi_lrc_rq(last)) {
> > > > +			ret = guc_wq_item_append(guc, last);
> > > > +			if (ret == -EBUSY) {
> > > > +				goto schedule_tasklet;
> > > > +			} else if (ret != 0) {
> > > > +				GEM_WARN_ON(ret);	/* Unexpected */
> > > > +				goto deadlk;
> > > > +			}
> > > > +		} else {
> > > > +			guc_set_lrc_tail(last);
> > > > +		}
> > > > +
> > > > +add_request:
> > > >    		ret = guc_add_request(guc, last);
> > > > -		if (unlikely(ret == -EPIPE))
> > > > +		if (unlikely(ret == -EPIPE)) {
> > > > +			goto deadlk;
> > > > +		} else if (ret == -EBUSY) {
> > > > +			goto schedule_tasklet;
> > > > +		} else if (ret != 0) {
> > > > +			GEM_WARN_ON(ret);	/* Unexpected */
> > > >    			goto deadlk;
> > > > -		else if (ret == -EBUSY) {
> > > > -			tasklet_schedule(&sched_engine->tasklet);
> > > > -			guc->stalled_request = last;
> > > > -			return false;
> > > >    		}
> > > >    	}
> > > >    	guc->stalled_request = NULL;
> > > > +	guc->submission_stall_reason = STALL_NONE;
> > > >    	return submit;
> > > >    deadlk:
> > > >    	sched_engine->tasklet.callback = NULL;
> > > >    	tasklet_disable_nosync(&sched_engine->tasklet);
> > > >    	return false;
> > > > +
> > > > +schedule_tasklet:
> > > > +	tasklet_schedule(&sched_engine->tasklet);
> > > > +	return false;
> > > >    }
> > > >    static void guc_submission_tasklet(struct tasklet_struct *t)
> > > > @@ -1227,10 +1447,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> > > >    	trace_i915_request_in(rq, 0);
> > > > -	guc_set_lrc_tail(rq);
> > > > -	ret = guc_add_request(guc, rq);
> > > > -	if (ret == -EBUSY)
> > > > -		guc->stalled_request = rq;
> > > > +	if (is_multi_lrc_rq(rq)) {
> > > > +		if (multi_lrc_submit(rq)) {
> > > > +			ret = guc_wq_item_append(guc, rq);
> > > > +			if (!ret)
> > > > +				ret = guc_add_request(guc, rq);
> > > > +		}
> > > > +	} else {
> > > > +		guc_set_lrc_tail(rq);
> > > > +		ret = guc_add_request(guc, rq);
> > > > +	}
> > > >    	if (unlikely(ret == -EPIPE))
> > > >    		disable_submission(guc);
> > > > @@ -1238,6 +1464,16 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> > > >    	return ret;
> > > >    }
> > > > +bool need_tasklet(struct intel_guc *guc, struct i915_request *rq)
> > > > +{
> > > > +	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > > +
> > > > +	return submission_disabled(guc) || guc->stalled_request ||
> > > > +		!i915_sched_engine_is_empty(sched_engine) ||
> > > > +		!lrc_desc_registered(guc, ce->guc_id.id);
> > > > +}
> > > > +
> > > >    static void guc_submit_request(struct i915_request *rq)
> > > >    {
> > > >    	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
> > > > @@ -1247,8 +1483,7 @@ static void guc_submit_request(struct i915_request *rq)
> > > >    	/* Will be called from irq-context when using foreign fences. */
> > > >    	spin_lock_irqsave(&sched_engine->lock, flags);
> > > > -	if (submission_disabled(guc) || guc->stalled_request ||
> > > > -	    !i915_sched_engine_is_empty(sched_engine))
> > > > +	if (need_tasklet(guc, rq))
> > > >    		queue_request(sched_engine, rq, rq_prio(rq));
> > > >    	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
> > > >    		tasklet_hi_schedule(&sched_engine->tasklet);
> > > > @@ -2241,9 +2476,10 @@ static bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
> > > >    static void add_to_context(struct i915_request *rq)
> > > >    {
> > > > -	struct intel_context *ce = rq->context;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > >    	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >    	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
> > > >    	spin_lock(&ce->guc_state.lock);
> > > > @@ -2276,7 +2512,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
> > > >    static void remove_from_context(struct i915_request *rq)
> > > >    {
> > > > -	struct intel_context *ce = rq->context;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >    	spin_lock_irq(&ce->guc_state.lock);
> > > > @@ -2692,7 +2930,7 @@ static void guc_init_breadcrumbs(struct intel_engine_cs *engine)
> > > >    static void guc_bump_inflight_request_prio(struct i915_request *rq,
> > > >    					   int prio)
> > > >    {
> > > > -	struct intel_context *ce = rq->context;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > >    	u8 new_guc_prio = map_i915_prio_to_guc_prio(prio);
> > > >    	/* Short circuit function */
> > > > @@ -2715,7 +2953,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
> > > >    static void guc_retire_inflight_request_prio(struct i915_request *rq)
> > > >    {
> > > > -	struct intel_context *ce = rq->context;
> > > > +	struct intel_context *ce = request_to_scheduling_context(rq);
> > > >    	spin_lock(&ce->guc_state.lock);
> > > >    	guc_prio_fini(rq, ce);
> > > > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > > > index 177eaf55adff..8f0073e19079 100644
> > > > --- a/drivers/gpu/drm/i915/i915_request.h
> > > > +++ b/drivers/gpu/drm/i915/i915_request.h
> > > > @@ -139,6 +139,14 @@ enum {
> > > >    	 * the GPU. Here we track such boost requests on a per-request basis.
> > > >    	 */
> > > >    	I915_FENCE_FLAG_BOOST,
> > > > +
> > > > +	/*
> > > > +	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
> > > > +	 * parent-child relationship (parallel submission, multi-lrc) should
> > > > +	 * trigger a submission to the GuC rather than just moving the context
> > > > +	 * tail.
> > > > +	 */
> > > > +	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> > > >    };
> > > >    /**
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 22/27] drm/i915/guc: Add basic GuC multi-lrc selftest
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-28 20:47   ` John Harrison
  -1 siblings, 0 replies; 145+ messages in thread
From: John Harrison @ 2021-09-28 20:47 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> Add very basic (single submission) multi-lrc selftest.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
>   .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 180 ++++++++++++++++++
>   .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>   3 files changed, 182 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 2554d0eb4afd..91330525330d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3924,4 +3924,5 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
>   
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_guc.c"
> +#include "selftest_guc_multi_lrc.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
> new file mode 100644
> index 000000000000..dacfc5dfadd6
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
> @@ -0,0 +1,180 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright �� 2019 Intel Corporation
> + */
> +
> +#include "selftests/igt_spinner.h"
> +#include "selftests/igt_reset.h"
> +#include "selftests/intel_scheduler_helpers.h"
> +#include "gt/intel_engine_heartbeat.h"
> +#include "gem/selftests/mock_context.h"
> +
> +static void logical_sort(struct intel_engine_cs **engines, int num_engines)
> +{
> +	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
> +	int i, j;
> +
> +	for (i = 0; i < num_engines; ++i)
> +		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
> +			if (engines[j]->logical_mask & BIT(i)) {
> +				sorted[i] = engines[j];
> +				break;
> +			}
> +		}
> +
> +	memcpy(*engines, *sorted,
> +	       sizeof(struct intel_engine_cs *) * num_engines);
> +}
> +
> +static struct intel_context *
> +multi_lrc_create_parent(struct intel_gt *gt, u8 class,
> +			unsigned long flags)
> +{
> +	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	int i = 0;
> +
> +	for_each_engine(engine, gt, id) {
> +		if (engine->class != class)
> +			continue;
> +
> +		siblings[i++] = engine;
> +	}
> +
> +	if (i <= 1)
> +		return ERR_PTR(0);
> +
> +	logical_sort(siblings, i);
> +
> +	return intel_engine_create_parallel(siblings, 1, i);
> +}
> +
> +static void multi_lrc_context_unpin(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +}
> +
> +static void multi_lrc_context_put(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	/*
> +	 * Only the parent gets the creation ref put in the uAPI, the parent
> +	 * itself is responsible for creation ref put on the children.
> +	 */
> +	intel_context_put(ce);
> +}
> +
> +static struct i915_request *
> +multi_lrc_nop_request(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +	struct i915_request *rq, *child_rq;
> +	int i = 0;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	rq = intel_context_create_request(ce);
> +	if (IS_ERR(rq))
> +		return rq;
> +
> +	i915_request_get(rq);
> +	i915_request_add(rq);
> +
> +	for_each_child(ce, child) {
> +		child_rq = intel_context_create_request(child);
> +		if (IS_ERR(child_rq))
> +			goto child_error;
> +
> +		if (++i == ce->guc_number_children)
> +			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +				&child_rq->fence.flags);
> +		i915_request_add(child_rq);
> +	}
> +
> +	return rq;
> +
> +child_error:
> +	i915_request_put(rq);
> +
> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
> +{
> +	struct intel_context *parent;
> +	struct i915_request *rq;
> +	int ret;
> +
> +	parent = multi_lrc_create_parent(gt, class, 0);
> +	if (IS_ERR(parent)) {
> +		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
> +		return PTR_ERR(parent);
> +	} else if (!parent) {
> +		pr_debug("Not enough engines in class: %d",
> +			 VIDEO_DECODE_CLASS);
Should be 'class'.

With that fixed:
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> +		return 0;
> +	}
> +
> +	rq = multi_lrc_nop_request(parent);
> +	if (IS_ERR(rq)) {
> +		ret = PTR_ERR(rq);
> +		pr_err("Failed creating requests: %d", ret);
> +		goto out;
> +	}
> +
> +	ret = intel_selftest_wait_for_rq(rq);
> +	if (ret)
> +		pr_err("Failed waiting on request: %d", ret);
> +
> +	i915_request_put(rq);
> +
> +	if (ret >= 0) {
> +		ret = intel_gt_wait_for_idle(gt, HZ * 5);
> +		if (ret < 0)
> +			pr_err("GT failed to idle: %d\n", ret);
> +	}
> +
> +out:
> +	multi_lrc_context_unpin(parent);
> +	multi_lrc_context_put(parent);
> +	return ret;
> +}
> +
> +static int intel_guc_multi_lrc_basic(void *arg)
> +{
> +	struct intel_gt *gt = arg;
> +	unsigned int class;
> +	int ret;
> +
> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> +		ret = __intel_guc_multi_lrc_basic(gt, class);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +int intel_guc_multi_lrc_live_selftests(struct drm_i915_private *i915)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(intel_guc_multi_lrc_basic),
> +	};
> +	struct intel_gt *gt = &i915->gt;
> +
> +	if (intel_gt_is_wedged(gt))
> +		return 0;
> +
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		return 0;
> +
> +	return intel_gt_live_subtests(tests, gt);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> index 3cf6758931f9..bdd290f2bf3c 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
> @@ -48,5 +48,6 @@ selftest(ring_submission, intel_ring_submission_live_selftests)
>   selftest(perf, i915_perf_live_selftests)
>   selftest(slpc, intel_slpc_live_selftests)
>   selftest(guc, intel_guc_live_selftests)
> +selftest(guc_multi_lrc, intel_guc_multi_lrc_live_selftests)
>   /* Here be dragons: keep last to run last! */
>   selftest(late_gt_pm, intel_gt_pm_late_selftests)


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
  (?)
@ 2021-09-28 22:20   ` John Harrison
  2021-09-28 22:33     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-28 22:20 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> mid BB. To safely enable preemption at the BB boundary, a handshake
> between to parent and child is needed. This is implemented via custom
> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
via by -> by

> default if a context is configured by set parallel extension.
I tend to agree with Tvrtko that this should probably be an opt in 
change. Is there a flags word passed in when creating the context?

Also, it's not just a change in pre-emption behaviour but a change in 
synchronisation too, right? Previously, if you had a whole bunch of back 
to back submissions then one child could run ahead of another and/or the 
parent. After this change, there is a forced regroup at the end of each 
batch. So while one could end sooner/later than the others, they can't 
ever get an entire batch (or more) ahead or behind. Or was that 
synchronisation already in there through other means anyway?

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
>   4 files changed, 287 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 5615be32879c..2de62649e275 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>   	GEM_BUG_ON(intel_context_is_child(child));
>   	GEM_BUG_ON(intel_context_is_parent(child));
>   
> -	parent->guc_number_children++;
> +	child->guc_child_index = parent->guc_number_children++;
>   	list_add_tail(&child->guc_child_link,
>   		      &parent->guc_child_list);
>   	child->parent = parent;
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 713d85b0b364..727f91e7f7c2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -246,6 +246,9 @@ struct intel_context {
>   		/** @guc_number_children: number of children if parent */
>   		u8 guc_number_children;
>   
> +		/** @guc_child_index: index into guc_child_list if child */
> +		u8 guc_child_index;
> +
>   		/**
>   		 * @parent_page: page in context used by parent for work queue,
>   		 * work queue descriptor
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 6cd26dc060d1..9f61cfa5566a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -188,7 +188,7 @@ struct guc_process_desc {
>   	u32 wq_status;
>   	u32 engine_presence;
>   	u32 priority;
> -	u32 reserved[30];
> +	u32 reserved[36];
What is this extra space for? All the extra storage is grabbed from 
after the end of this structure, isn't it?

>   } __packed;
>   
>   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 91330525330d..1a18f99bf12a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -11,6 +11,7 @@
>   #include "gt/intel_context.h"
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
> +#include "gt/intel_gpu_commands.h"
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
> @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
>   
>   /*
>    * When using multi-lrc submission an extra page in the context state is
> - * reserved for the process descriptor and work queue.
> + * reserved for the process descriptor, work queue, and preempt BB boundary
> + * handshake between the parent + childlren contexts.
>    *
>    * The layout of this page is below:
>    * 0						guc_process_desc
> + * + sizeof(struct guc_process_desc)		child go
> + * + CACHELINE_BYTES				child join ...
> + * + CACHELINE_BYTES ...
Would be better written as '[num_children]' instead of '...' to make it 
clear it is a per child array.

Also, maybe create a struct for this to get rid of the magic '+1's and 
'BYTES / sizeof' constructs in the functions below.

>    * ...						unused
>    * PAGE_SIZE / 2				work queue start
>    * ...						work queue
> @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>   	return __guc_action_deregister_context(guc, guc_id, loop);
>   }
>   
> +static inline void clear_children_join_go_memory(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +	u8 i;
> +
> +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> +}
> +
> +static inline u32 get_children_go_value(struct intel_context *ce)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +
> +	return mem[0];
> +}
> +
> +static inline u32 get_children_join_value(struct intel_context *ce,
> +					  u8 child_index)
> +{
> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> +
> +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> +}
> +
>   static void guc_context_policy_init(struct intel_engine_cs *engine,
>   				    struct guc_lrc_desc *desc)
>   {
> @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>   			guc_context_policy_init(engine, desc);
>   		}
> +
> +		clear_children_join_go_memory(ce);
>   	}
>   
>   	/*
> @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
>   	.get_sibling = guc_virtual_get_sibling,
>   };
>   
> +/*
> + * The below override of the breadcrumbs is enabled when the user configures a
> + * context for parallel submission (multi-lrc, parent-child).
> + *
> + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> + * safely preempt all the hw contexts configured for parallel submission
> + * between each BB. The contract between the i915 and GuC is if the parent
> + * context can be preempted, all the children can be preempted, and the GuC will
> + * always try to preempt the parent before the children. A handshake between the
> + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> + * creating a window to preempt between each set of BBs.
> + */
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags);
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags);
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs);
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs);
> +
>   static struct intel_context *
>   guc_create_parallel(struct intel_engine_cs **engines,
>   		    unsigned int num_siblings,
> @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   		}
>   	}
>   
> +	parent->engine->emit_bb_start =
> +		emit_bb_start_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb =
> +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> +	parent->engine->emit_fini_breadcrumb_dw =
> +		12 + 4 * parent->guc_number_children;
> +	for_each_child(parent, ce) {
> +		ce->engine->emit_bb_start =
> +			emit_bb_start_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb =
> +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> +		ce->engine->emit_fini_breadcrumb_dw = 16;
> +	}
> +
>   	kfree(siblings);
>   	return parent;
>   
> @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
>   
> +static inline u32 get_children_go_addr(struct intel_context *ce)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return i915_ggtt_offset(ce->state) +
> +		__get_process_desc_offset(ce) +
> +		sizeof(struct guc_process_desc);
> +}
> +
> +static inline u32 get_children_join_addr(struct intel_context *ce,
> +					 u8 child_index)
> +{
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> +}
> +
> +#define PARENT_GO_BB			1
> +#define PARENT_GO_FINI_BREADCRUMB	0
> +#define CHILD_GO_BB			1
> +#define CHILD_GO_FINI_BREADCRUMB	0
> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						     u64 offset, u32 len,
> +						     const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Wait on chidlren */
chidlren -> children

> +	for (i = 0; i < ce->guc_number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_BB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_BB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> +						    u64 offset, u32 len,
> +						    const unsigned int flags)
> +{
> +	struct intel_context *ce = rq->context;
> +	u32 *cs;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	cs = intel_ring_begin(rq, 12);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_BB,
> +				  get_children_join_addr(ce->parent,
> +							 ce->guc_child_index),
> +				  0);
> +
> +	/* Wait parent on for go */
parent on -> on parent

> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_BB;
> +	*cs++ = get_children_go_addr(ce->parent);
> +	*cs++ = 0;
> +
> +	/* Turn off preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> +
> +	/* Jump to batch */
> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> +	*cs++ = lower_32_bits(offset);
> +	*cs++ = upper_32_bits(offset);
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +	u8 i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	/* Wait on children */
> +	for (i = 0; i < ce->guc_number_children; ++i) {
> +		*cs++ = (MI_SEMAPHORE_WAIT |
> +			 MI_SEMAPHORE_GLOBAL_GTT |
> +			 MI_SEMAPHORE_POLL |
> +			 MI_SEMAPHORE_SAD_EQ_SDD);
> +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> +		*cs++ = get_children_join_addr(ce, i);
> +		*cs++ = 0;
> +	}
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Tell children go */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  CHILD_GO_FINI_BREADCRUMB,
> +				  get_children_go_addr(ce),
> +				  0);
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	/* Turn on preemption */
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	/* Signal parent */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  PARENT_GO_FINI_BREADCRUMB,
> +				  get_children_join_addr(ce->parent,
> +							 ce->guc_child_index),
> +				  0);
> +
This is backwards compared to the parent?

Parent: wait on children then enable pre-emption
Child: enable pre-emption then signal parent

Makes for a window where the parent is waiting in atomic context for a 
signal from a child that has been pre-empted and might not get to run 
for some time?

John.


> +	/* Wait parent on for go */
> +	*cs++ = (MI_SEMAPHORE_WAIT |
> +		 MI_SEMAPHORE_GLOBAL_GTT |
> +		 MI_SEMAPHORE_POLL |
> +		 MI_SEMAPHORE_SAD_EQ_SDD);
> +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> +	*cs++ = get_children_go_addr(ce->parent);
> +	*cs++ = 0;
> +
> +	/* Emit fini breadcrumb */
> +	cs = gen8_emit_ggtt_write(cs,
> +				  rq->fence.seqno,
> +				  i915_request_active_timeline(rq)->hwsp_offset,
> +				  0);
> +
> +	/* User interrupt */
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_NOOP;
> +
> +	rq->tail = intel_ring_offset(rq, cs);
> +
> +	return cs;
> +}
> +
>   static struct intel_context *
>   g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>   {
> @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   			drm_printf(p, "\t\tWQI Status: %u\n\n",
>   				   READ_ONCE(desc->wq_status));
>   
> +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> +				   ce->guc_number_children);
> +			if (ce->engine->emit_bb_start ==
> +			    emit_bb_start_parent_no_preempt_mid_batch) {
> +				u8 i;
> +
> +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> +					   get_children_go_value(ce));
> +				for (i = 0; i < ce->guc_number_children; ++i)
> +					drm_printf(p, "\t\tChildren Join: %u\n",
> +						   get_children_join_value(ce, i));
> +			}
> +
>   			for_each_child(ce, child)
>   				guc_log_context(p, child);
>   		}


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-09-28 22:20   ` John Harrison
@ 2021-09-28 22:33     ` Matthew Brost
  2021-09-28 23:33       ` John Harrison
  0 siblings, 1 reply; 145+ messages in thread
From: Matthew Brost @ 2021-09-28 22:33 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Tue, Sep 28, 2021 at 03:20:42PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > mid BB. To safely enable preemption at the BB boundary, a handshake
> > between to parent and child is needed. This is implemented via custom
> > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> via by -> by
> 
> > default if a context is configured by set parallel extension.
> I tend to agree with Tvrtko that this should probably be an opt in change.
> Is there a flags word passed in when creating the context?
> 

I don't disagree but the uAPI in this series is where we landed. It has
been acked all by the relevant parties in the RFC, ported to our
internal tree, and the media UMD has been updated / posted. Concerns
with the uAPI should've been raised in the RFC phase, not now. I really
don't feel like changing this uAPI another time.

> Also, it's not just a change in pre-emption behaviour but a change in
> synchronisation too, right? Previously, if you had a whole bunch of back to
> back submissions then one child could run ahead of another and/or the
> parent. After this change, there is a forced regroup at the end of each
> batch. So while one could end sooner/later than the others, they can't ever
> get an entire batch (or more) ahead or behind. Or was that synchronisation
> already in there through other means anyway?
> 

Yes, each parent / child sync at the of each batch - this is the only
way safely insert preemption points. Without this the GuC could attempt
a preemption and hang the batches.

> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> >   drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
> >   4 files changed, 287 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 5615be32879c..2de62649e275 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> >   	GEM_BUG_ON(intel_context_is_child(child));
> >   	GEM_BUG_ON(intel_context_is_parent(child));
> > -	parent->guc_number_children++;
> > +	child->guc_child_index = parent->guc_number_children++;
> >   	list_add_tail(&child->guc_child_link,
> >   		      &parent->guc_child_list);
> >   	child->parent = parent;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 713d85b0b364..727f91e7f7c2 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -246,6 +246,9 @@ struct intel_context {
> >   		/** @guc_number_children: number of children if parent */
> >   		u8 guc_number_children;
> > +		/** @guc_child_index: index into guc_child_list if child */
> > +		u8 guc_child_index;
> > +
> >   		/**
> >   		 * @parent_page: page in context used by parent for work queue,
> >   		 * work queue descriptor
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > index 6cd26dc060d1..9f61cfa5566a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > @@ -188,7 +188,7 @@ struct guc_process_desc {
> >   	u32 wq_status;
> >   	u32 engine_presence;
> >   	u32 priority;
> > -	u32 reserved[30];
> > +	u32 reserved[36];
> What is this extra space for? All the extra storage is grabbed from after
> the end of this structure, isn't it?
> 

This is the size of process descriptor in the GuC spec. Even though this
is unused space we really don't want the child go / join memory using
anything within the process descriptor.

> >   } __packed;
> >   #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 91330525330d..1a18f99bf12a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -11,6 +11,7 @@
> >   #include "gt/intel_context.h"
> >   #include "gt/intel_engine_pm.h"
> >   #include "gt/intel_engine_heartbeat.h"
> > +#include "gt/intel_gpu_commands.h"
> >   #include "gt/intel_gt.h"
> >   #include "gt/intel_gt_irq.h"
> >   #include "gt/intel_gt_pm.h"
> > @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> >   /*
> >    * When using multi-lrc submission an extra page in the context state is
> > - * reserved for the process descriptor and work queue.
> > + * reserved for the process descriptor, work queue, and preempt BB boundary
> > + * handshake between the parent + childlren contexts.
> >    *
> >    * The layout of this page is below:
> >    * 0						guc_process_desc
> > + * + sizeof(struct guc_process_desc)		child go
> > + * + CACHELINE_BYTES				child join ...
> > + * + CACHELINE_BYTES ...
> Would be better written as '[num_children]' instead of '...' to make it
> clear it is a per child array.
>

I think this description is pretty clear.

> Also, maybe create a struct for this to get rid of the magic '+1's and
> 'BYTES / sizeof' constructs in the functions below.
> 

Let me see if I can create a struct that describes the layout.

> >    * ...						unused
> >    * PAGE_SIZE / 2				work queue start
> >    * ...						work queue
> > @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> >   	return __guc_action_deregister_context(guc, guc_id, loop);
> >   }
> > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +	u8 i;
> > +
> > +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> > +}
> > +
> > +static inline u32 get_children_go_value(struct intel_context *ce)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +
> > +	return mem[0];
> > +}
> > +
> > +static inline u32 get_children_join_value(struct intel_context *ce,
> > +					  u8 child_index)
> > +{
> > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > +
> > +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> > +}
> > +
> >   static void guc_context_policy_init(struct intel_engine_cs *engine,
> >   				    struct guc_lrc_desc *desc)
> >   {
> > @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >   			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> >   			guc_context_policy_init(engine, desc);
> >   		}
> > +
> > +		clear_children_join_go_memory(ce);
> >   	}
> >   	/*
> > @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> >   	.get_sibling = guc_virtual_get_sibling,
> >   };
> > +/*
> > + * The below override of the breadcrumbs is enabled when the user configures a
> > + * context for parallel submission (multi-lrc, parent-child).
> > + *
> > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > + * safely preempt all the hw contexts configured for parallel submission
> > + * between each BB. The contract between the i915 and GuC is if the parent
> > + * context can be preempted, all the children can be preempted, and the GuC will
> > + * always try to preempt the parent before the children. A handshake between the
> > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > + * creating a window to preempt between each set of BBs.
> > + */
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags);
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags);
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs);
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs);
> > +
> >   static struct intel_context *
> >   guc_create_parallel(struct intel_engine_cs **engines,
> >   		    unsigned int num_siblings,
> > @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> >   		}
> >   	}
> > +	parent->engine->emit_bb_start =
> > +		emit_bb_start_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb =
> > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > +	parent->engine->emit_fini_breadcrumb_dw =
> > +		12 + 4 * parent->guc_number_children;
> > +	for_each_child(parent, ce) {
> > +		ce->engine->emit_bb_start =
> > +			emit_bb_start_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb =
> > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > +	}
> > +
> >   	kfree(siblings);
> >   	return parent;
> > @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
> >   	guc->submission_selected = __guc_submission_selected(guc);
> >   }
> > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return i915_ggtt_offset(ce->state) +
> > +		__get_process_desc_offset(ce) +
> > +		sizeof(struct guc_process_desc);
> > +}
> > +
> > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > +					 u8 child_index)
> > +{
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> > +}
> > +
> > +#define PARENT_GO_BB			1
> > +#define PARENT_GO_FINI_BREADCRUMB	0
> > +#define CHILD_GO_BB			1
> > +#define CHILD_GO_FINI_BREADCRUMB	0
> > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						     u64 offset, u32 len,
> > +						     const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Wait on chidlren */
> chidlren -> children
>

Yep.
 
> > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_BB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_BB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +	*cs++ = MI_NOOP;
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						    u64 offset, u32 len,
> > +						    const unsigned int flags)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u32 *cs;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	cs = intel_ring_begin(rq, 12);
> > +	if (IS_ERR(cs))
> > +		return PTR_ERR(cs);
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_BB,
> > +				  get_children_join_addr(ce->parent,
> > +							 ce->guc_child_index),
> > +				  0);
> > +
> > +	/* Wait parent on for go */
> parent on -> on parent
>

Yep.
 
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_BB;
> > +	*cs++ = get_children_go_addr(ce->parent);
> > +	*cs++ = 0;
> > +
> > +	/* Turn off preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > +
> > +	/* Jump to batch */
> > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > +	*cs++ = lower_32_bits(offset);
> > +	*cs++ = upper_32_bits(offset);
> > +
> > +	intel_ring_advance(rq, cs);
> > +
> > +	return 0;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +	u8 i;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	/* Wait on children */
> > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > +			 MI_SEMAPHORE_POLL |
> > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > +		*cs++ = get_children_join_addr(ce, i);
> > +		*cs++ = 0;
> > +	}
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Tell children go */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  CHILD_GO_FINI_BREADCRUMB,
> > +				  get_children_go_addr(ce),
> > +				  0);
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	/* Turn on preemption */
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +	*cs++ = MI_NOOP;
> > +
> > +	/* Signal parent */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  PARENT_GO_FINI_BREADCRUMB,
> > +				  get_children_join_addr(ce->parent,
> > +							 ce->guc_child_index),
> > +				  0);
> > +
> This is backwards compared to the parent?
> 
> Parent: wait on children then enable pre-emption
> Child: enable pre-emption then signal parent
> 
> Makes for a window where the parent is waiting in atomic context for a
> signal from a child that has been pre-empted and might not get to run for
> some time?
>

No, this is correct. The rule is if the parent can be preempted all the
children can be preempted, thus we can't enable preemption on the parent
until all the children have preemption enabled, thus the parent waits
for all the children to join before enabling its preemption.

Matt
 
> John.
> 
> 
> > +	/* Wait parent on for go */
> > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > +		 MI_SEMAPHORE_POLL |
> > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > +	*cs++ = get_children_go_addr(ce->parent);
> > +	*cs++ = 0;
> > +
> > +	/* Emit fini breadcrumb */
> > +	cs = gen8_emit_ggtt_write(cs,
> > +				  rq->fence.seqno,
> > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > +				  0);
> > +
> > +	/* User interrupt */
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_NOOP;
> > +
> > +	rq->tail = intel_ring_offset(rq, cs);
> > +
> > +	return cs;
> > +}
> > +
> >   static struct intel_context *
> >   g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> >   {
> > @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >   			drm_printf(p, "\t\tWQI Status: %u\n\n",
> >   				   READ_ONCE(desc->wq_status));
> > +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> > +				   ce->guc_number_children);
> > +			if (ce->engine->emit_bb_start ==
> > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > +				u8 i;
> > +
> > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > +					   get_children_go_value(ce));
> > +				for (i = 0; i < ce->guc_number_children; ++i)
> > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > +						   get_children_join_value(ce, i));
> > +			}
> > +
> >   			for_each_child(ce, child)
> >   				guc_log_context(p, child);
> >   		}
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-09-28 22:33     ` Matthew Brost
@ 2021-09-28 23:33       ` John Harrison
  2021-09-29  0:22         ` Matthew Brost
  0 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-28 23:33 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On 9/28/2021 15:33, Matthew Brost wrote:
> On Tue, Sep 28, 2021 at 03:20:42PM -0700, John Harrison wrote:
>> On 8/20/2021 15:44, Matthew Brost wrote:
>>> For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
>>> mid BB. To safely enable preemption at the BB boundary, a handshake
>>> between to parent and child is needed. This is implemented via custom
>>> emit_bb_start & emit_fini_breadcrumb functions and enabled via by
>> via by -> by
>>
>>> default if a context is configured by set parallel extension.
>> I tend to agree with Tvrtko that this should probably be an opt in change.
>> Is there a flags word passed in when creating the context?
>>
> I don't disagree but the uAPI in this series is where we landed. It has
> been acked all by the relevant parties in the RFC, ported to our
> internal tree, and the media UMD has been updated / posted. Concerns
> with the uAPI should've been raised in the RFC phase, not now. I really
> don't feel like changing this uAPI another time.
The counter argument is that once a UAPI has been merged, it cannot be 
changed. Ever. So it is worth taking the trouble to get it right first 
time.

The proposal isn't a major re-write of the interface. It is simply a 
request to set an extra flag when creating the context.


>
>> Also, it's not just a change in pre-emption behaviour but a change in
>> synchronisation too, right? Previously, if you had a whole bunch of back to
>> back submissions then one child could run ahead of another and/or the
>> parent. After this change, there is a forced regroup at the end of each
>> batch. So while one could end sooner/later than the others, they can't ever
>> get an entire batch (or more) ahead or behind. Or was that synchronisation
>> already in there through other means anyway?
>>
> Yes, each parent / child sync at the of each batch - this is the only
> way safely insert preemption points. Without this the GuC could attempt
> a preemption and hang the batches.
To be clear, I'm not saying that this is wrong. I'm just saying that 
this appears to be new behaviour with this patch but it is not 
explicitly called out in the description of the patch.


>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
>>>    drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
>>>    4 files changed, 287 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>>> index 5615be32879c..2de62649e275 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>> @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>>>    	GEM_BUG_ON(intel_context_is_child(child));
>>>    	GEM_BUG_ON(intel_context_is_parent(child));
>>> -	parent->guc_number_children++;
>>> +	child->guc_child_index = parent->guc_number_children++;
>>>    	list_add_tail(&child->guc_child_link,
>>>    		      &parent->guc_child_list);
>>>    	child->parent = parent;
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> index 713d85b0b364..727f91e7f7c2 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>> @@ -246,6 +246,9 @@ struct intel_context {
>>>    		/** @guc_number_children: number of children if parent */
>>>    		u8 guc_number_children;
>>> +		/** @guc_child_index: index into guc_child_list if child */
>>> +		u8 guc_child_index;
>>> +
>>>    		/**
>>>    		 * @parent_page: page in context used by parent for work queue,
>>>    		 * work queue descriptor
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> index 6cd26dc060d1..9f61cfa5566a 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> @@ -188,7 +188,7 @@ struct guc_process_desc {
>>>    	u32 wq_status;
>>>    	u32 engine_presence;
>>>    	u32 priority;
>>> -	u32 reserved[30];
>>> +	u32 reserved[36];
>> What is this extra space for? All the extra storage is grabbed from after
>> the end of this structure, isn't it?
>>
> This is the size of process descriptor in the GuC spec. Even though this
> is unused space we really don't want the child go / join memory using
> anything within the process descriptor.
Okay. So it's more that the code was previously broken and we just 
hadn't hit a problem because of it? Again, worth adding a comment in the 
description to call it out as a bug fix.

>
>>>    } __packed;
>>>    #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 91330525330d..1a18f99bf12a 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -11,6 +11,7 @@
>>>    #include "gt/intel_context.h"
>>>    #include "gt/intel_engine_pm.h"
>>>    #include "gt/intel_engine_heartbeat.h"
>>> +#include "gt/intel_gpu_commands.h"
>>>    #include "gt/intel_gt.h"
>>>    #include "gt/intel_gt_irq.h"
>>>    #include "gt/intel_gt_pm.h"
>>> @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
>>>    /*
>>>     * When using multi-lrc submission an extra page in the context state is
>>> - * reserved for the process descriptor and work queue.
>>> + * reserved for the process descriptor, work queue, and preempt BB boundary
>>> + * handshake between the parent + childlren contexts.
>>>     *
>>>     * The layout of this page is below:
>>>     * 0						guc_process_desc
>>> + * + sizeof(struct guc_process_desc)		child go
>>> + * + CACHELINE_BYTES				child join ...
>>> + * + CACHELINE_BYTES ...
>> Would be better written as '[num_children]' instead of '...' to make it
>> clear it is a per child array.
>>
> I think this description is pretty clear.
Evidently not because it confused me for a moment.

>
>> Also, maybe create a struct for this to get rid of the magic '+1's and
>> 'BYTES / sizeof' constructs in the functions below.
>>
> Let me see if I can create a struct that describes the layout.
That would definitely make the code a lot clearer.

>
>>>     * ...						unused
>>>     * PAGE_SIZE / 2				work queue start
>>>     * ...						work queue
>>> @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>>>    	return __guc_action_deregister_context(guc, guc_id, loop);
>>>    }
>>> +static inline void clear_children_join_go_memory(struct intel_context *ce)
>>> +{
>>> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
>>> +	u8 i;
>>> +
>>> +	for (i = 0; i < ce->guc_number_children + 1; ++i)
>>> +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
>>> +}
>>> +
>>> +static inline u32 get_children_go_value(struct intel_context *ce)
>>> +{
>>> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
>>> +
>>> +	return mem[0];
>>> +}
>>> +
>>> +static inline u32 get_children_join_value(struct intel_context *ce,
>>> +					  u8 child_index)
>>> +{
>>> +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
>>> +
>>> +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
>>> +}
>>> +
>>>    static void guc_context_policy_init(struct intel_engine_cs *engine,
>>>    				    struct guc_lrc_desc *desc)
>>>    {
>>> @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>>>    			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
>>>    			guc_context_policy_init(engine, desc);
>>>    		}
>>> +
>>> +		clear_children_join_go_memory(ce);
>>>    	}
>>>    	/*
>>> @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
>>>    	.get_sibling = guc_virtual_get_sibling,
>>>    };
>>> +/*
>>> + * The below override of the breadcrumbs is enabled when the user configures a
>>> + * context for parallel submission (multi-lrc, parent-child).
>>> + *
>>> + * The overridden breadcrumbs implements an algorithm which allows the GuC to
>>> + * safely preempt all the hw contexts configured for parallel submission
>>> + * between each BB. The contract between the i915 and GuC is if the parent
>>> + * context can be preempted, all the children can be preempted, and the GuC will
>>> + * always try to preempt the parent before the children. A handshake between the
>>> + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
>>> + * creating a window to preempt between each set of BBs.
>>> + */
>>> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
>>> +						     u64 offset, u32 len,
>>> +						     const unsigned int flags);
>>> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
>>> +						    u64 offset, u32 len,
>>> +						    const unsigned int flags);
>>> +static u32 *
>>> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>>> +						 u32 *cs);
>>> +static u32 *
>>> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
>>> +						u32 *cs);
>>> +
>>>    static struct intel_context *
>>>    guc_create_parallel(struct intel_engine_cs **engines,
>>>    		    unsigned int num_siblings,
>>> @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
>>>    		}
>>>    	}
>>> +	parent->engine->emit_bb_start =
>>> +		emit_bb_start_parent_no_preempt_mid_batch;
>>> +	parent->engine->emit_fini_breadcrumb =
>>> +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
>>> +	parent->engine->emit_fini_breadcrumb_dw =
>>> +		12 + 4 * parent->guc_number_children;
>>> +	for_each_child(parent, ce) {
>>> +		ce->engine->emit_bb_start =
>>> +			emit_bb_start_child_no_preempt_mid_batch;
>>> +		ce->engine->emit_fini_breadcrumb =
>>> +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
>>> +		ce->engine->emit_fini_breadcrumb_dw = 16;
>>> +	}
>>> +
>>>    	kfree(siblings);
>>>    	return parent;
>>> @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
>>>    	guc->submission_selected = __guc_submission_selected(guc);
>>>    }
>>> +static inline u32 get_children_go_addr(struct intel_context *ce)
>>> +{
>>> +	GEM_BUG_ON(!intel_context_is_parent(ce));
>>> +
>>> +	return i915_ggtt_offset(ce->state) +
>>> +		__get_process_desc_offset(ce) +
>>> +		sizeof(struct guc_process_desc);
>>> +}
>>> +
>>> +static inline u32 get_children_join_addr(struct intel_context *ce,
>>> +					 u8 child_index)
>>> +{
>>> +	GEM_BUG_ON(!intel_context_is_parent(ce));
>>> +
>>> +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
>>> +}
>>> +
>>> +#define PARENT_GO_BB			1
>>> +#define PARENT_GO_FINI_BREADCRUMB	0
>>> +#define CHILD_GO_BB			1
>>> +#define CHILD_GO_FINI_BREADCRUMB	0
>>> +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
>>> +						     u64 offset, u32 len,
>>> +						     const unsigned int flags)
>>> +{
>>> +	struct intel_context *ce = rq->context;
>>> +	u32 *cs;
>>> +	u8 i;
>>> +
>>> +	GEM_BUG_ON(!intel_context_is_parent(ce));
>>> +
>>> +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
>>> +	if (IS_ERR(cs))
>>> +		return PTR_ERR(cs);
>>> +
>>> +	/* Wait on chidlren */
>> chidlren -> children
>>
> Yep.
>   
>>> +	for (i = 0; i < ce->guc_number_children; ++i) {
>>> +		*cs++ = (MI_SEMAPHORE_WAIT |
>>> +			 MI_SEMAPHORE_GLOBAL_GTT |
>>> +			 MI_SEMAPHORE_POLL |
>>> +			 MI_SEMAPHORE_SAD_EQ_SDD);
>>> +		*cs++ = PARENT_GO_BB;
>>> +		*cs++ = get_children_join_addr(ce, i);
>>> +		*cs++ = 0;
>>> +	}
>>> +
>>> +	/* Turn off preemption */
>>> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	/* Tell children go */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  CHILD_GO_BB,
>>> +				  get_children_go_addr(ce),
>>> +				  0);
>>> +
>>> +	/* Jump to batch */
>>> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
>>> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
>>> +	*cs++ = lower_32_bits(offset);
>>> +	*cs++ = upper_32_bits(offset);
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	intel_ring_advance(rq, cs);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
>>> +						    u64 offset, u32 len,
>>> +						    const unsigned int flags)
>>> +{
>>> +	struct intel_context *ce = rq->context;
>>> +	u32 *cs;
>>> +
>>> +	GEM_BUG_ON(!intel_context_is_child(ce));
>>> +
>>> +	cs = intel_ring_begin(rq, 12);
>>> +	if (IS_ERR(cs))
>>> +		return PTR_ERR(cs);
>>> +
>>> +	/* Signal parent */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  PARENT_GO_BB,
>>> +				  get_children_join_addr(ce->parent,
>>> +							 ce->guc_child_index),
>>> +				  0);
>>> +
>>> +	/* Wait parent on for go */
>> parent on -> on parent
>>
> Yep.
>   
>>> +	*cs++ = (MI_SEMAPHORE_WAIT |
>>> +		 MI_SEMAPHORE_GLOBAL_GTT |
>>> +		 MI_SEMAPHORE_POLL |
>>> +		 MI_SEMAPHORE_SAD_EQ_SDD);
>>> +	*cs++ = CHILD_GO_BB;
>>> +	*cs++ = get_children_go_addr(ce->parent);
>>> +	*cs++ = 0;
>>> +
>>> +	/* Turn off preemption */
>>> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
>>> +
>>> +	/* Jump to batch */
>>> +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
>>> +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
>>> +	*cs++ = lower_32_bits(offset);
>>> +	*cs++ = upper_32_bits(offset);
>>> +
>>> +	intel_ring_advance(rq, cs);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static u32 *
>>> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>>> +						 u32 *cs)
>>> +{
>>> +	struct intel_context *ce = rq->context;
>>> +	u8 i;
>>> +
>>> +	GEM_BUG_ON(!intel_context_is_parent(ce));
>>> +
>>> +	/* Wait on children */
>>> +	for (i = 0; i < ce->guc_number_children; ++i) {
>>> +		*cs++ = (MI_SEMAPHORE_WAIT |
>>> +			 MI_SEMAPHORE_GLOBAL_GTT |
>>> +			 MI_SEMAPHORE_POLL |
>>> +			 MI_SEMAPHORE_SAD_EQ_SDD);
>>> +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
>>> +		*cs++ = get_children_join_addr(ce, i);
>>> +		*cs++ = 0;
>>> +	}
>>> +
>>> +	/* Turn on preemption */
>>> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	/* Tell children go */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  CHILD_GO_FINI_BREADCRUMB,
>>> +				  get_children_go_addr(ce),
>>> +				  0);
>>> +
>>> +	/* Emit fini breadcrumb */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  rq->fence.seqno,
>>> +				  i915_request_active_timeline(rq)->hwsp_offset,
>>> +				  0);
>>> +
>>> +	/* User interrupt */
>>> +	*cs++ = MI_USER_INTERRUPT;
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	rq->tail = intel_ring_offset(rq, cs);
>>> +
>>> +	return cs;
>>> +}
>>> +
>>> +static u32 *
>>> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
>>> +{
>>> +	struct intel_context *ce = rq->context;
>>> +
>>> +	GEM_BUG_ON(!intel_context_is_child(ce));
>>> +
>>> +	/* Turn on preemption */
>>> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	/* Signal parent */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  PARENT_GO_FINI_BREADCRUMB,
>>> +				  get_children_join_addr(ce->parent,
>>> +							 ce->guc_child_index),
>>> +				  0);
>>> +
>> This is backwards compared to the parent?
>>
>> Parent: wait on children then enable pre-emption
>> Child: enable pre-emption then signal parent
>>
>> Makes for a window where the parent is waiting in atomic context for a
>> signal from a child that has been pre-empted and might not get to run for
>> some time?
>>
> No, this is correct. The rule is if the parent can be preempted all the
> children can be preempted, thus we can't enable preemption on the parent
> until all the children have preemption enabled, thus the parent waits
> for all the children to join before enabling its preemption.
>
> Matt
But,

The write to PARENT_GO_FINI can't fail or stall, right? So if it happens 
before the ARB_ON then the child is guaranteed to execute the ARB_ON 
once it has signalled the parent. Indeed, by the time the parent context 
gets to see the update memory value, the child is practically certain to 
have passed the ARB_ON. So, by the time the parent becomes pre-emptible, 
the children will all be pre-emptible. Even if the parent is superfast, 
the children are guaranteed to become pre-emptible immediately - 
certainly before any fail-to-preempt timeout could occur.

Whereas, with the current ordering, it is possible for the child to be 
preempted before it has issued the signal to the parent. So now you have 
a non-preemptible parent hogging the hardware, waiting for a signal that 
isn't going to come for an entire execution quantum. Indeed, it is 
actually quite likely the child would be preempted before it can signal 
the parent because any pre-emption request that was issued at any time 
during the child's execution will take effect immediately on the ARB_ON 
instruction.

John.


>   
>> John.
>>
>>
>>> +	/* Wait parent on for go */
>>> +	*cs++ = (MI_SEMAPHORE_WAIT |
>>> +		 MI_SEMAPHORE_GLOBAL_GTT |
>>> +		 MI_SEMAPHORE_POLL |
>>> +		 MI_SEMAPHORE_SAD_EQ_SDD);
>>> +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
>>> +	*cs++ = get_children_go_addr(ce->parent);
>>> +	*cs++ = 0;
>>> +
>>> +	/* Emit fini breadcrumb */
>>> +	cs = gen8_emit_ggtt_write(cs,
>>> +				  rq->fence.seqno,
>>> +				  i915_request_active_timeline(rq)->hwsp_offset,
>>> +				  0);
>>> +
>>> +	/* User interrupt */
>>> +	*cs++ = MI_USER_INTERRUPT;
>>> +	*cs++ = MI_NOOP;
>>> +
>>> +	rq->tail = intel_ring_offset(rq, cs);
>>> +
>>> +	return cs;
>>> +}
>>> +
>>>    static struct intel_context *
>>>    g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
>>>    {
>>> @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>>    			drm_printf(p, "\t\tWQI Status: %u\n\n",
>>>    				   READ_ONCE(desc->wq_status));
>>> +			drm_printf(p, "\t\tNumber Children: %u\n\n",
>>> +				   ce->guc_number_children);
>>> +			if (ce->engine->emit_bb_start ==
>>> +			    emit_bb_start_parent_no_preempt_mid_batch) {
>>> +				u8 i;
>>> +
>>> +				drm_printf(p, "\t\tChildren Go: %u\n\n",
>>> +					   get_children_go_value(ce));
>>> +				for (i = 0; i < ce->guc_number_children; ++i)
>>> +					drm_printf(p, "\t\tChildren Join: %u\n",
>>> +						   get_children_join_value(ce, i));
>>> +			}
>>> +
>>>    			for_each_child(ce, child)
>>>    				guc_log_context(p, child);
>>>    		}


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-09-28 23:33       ` John Harrison
@ 2021-09-29  0:22         ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-29  0:22 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Tue, Sep 28, 2021 at 04:33:24PM -0700, John Harrison wrote:
> On 9/28/2021 15:33, Matthew Brost wrote:
> > On Tue, Sep 28, 2021 at 03:20:42PM -0700, John Harrison wrote:
> > > On 8/20/2021 15:44, Matthew Brost wrote:
> > > > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > > > mid BB. To safely enable preemption at the BB boundary, a handshake
> > > > between to parent and child is needed. This is implemented via custom
> > > > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> > > via by -> by
> > > 
> > > > default if a context is configured by set parallel extension.
> > > I tend to agree with Tvrtko that this should probably be an opt in change.
> > > Is there a flags word passed in when creating the context?
> > > 
> > I don't disagree but the uAPI in this series is where we landed. It has
> > been acked all by the relevant parties in the RFC, ported to our
> > internal tree, and the media UMD has been updated / posted. Concerns
> > with the uAPI should've been raised in the RFC phase, not now. I really
> > don't feel like changing this uAPI another time.
> The counter argument is that once a UAPI has been merged, it cannot be
> changed. Ever. So it is worth taking the trouble to get it right first time.
> 
> The proposal isn't a major re-write of the interface. It is simply a request
> to set an extra flag when creating the context.
> 

We are basically just talking about the polarity of a flag at this
point. Either by default you can't be preempted mid batch (current GPU /
UMD requirement) or by default you can be preempted mid-batch (no
current GPU / UMD can do this yet but add flags that everyone opts
into). I think Daniel's opinion was just default to what the current GPU
/ UMD wants and if future requirements arise we add flags to the
interface. I understand both points of view for flag / not flag but in
the end it doesn't really matter. Either way the interface works now and
will in the future too.

> 
> > 
> > > Also, it's not just a change in pre-emption behaviour but a change in
> > > synchronisation too, right? Previously, if you had a whole bunch of back to
> > > back submissions then one child could run ahead of another and/or the
> > > parent. After this change, there is a forced regroup at the end of each
> > > batch. So while one could end sooner/later than the others, they can't ever
> > > get an entire batch (or more) ahead or behind. Or was that synchronisation
> > > already in there through other means anyway?
> > > 
> > Yes, each parent / child sync at the of each batch - this is the only
> > way safely insert preemption points. Without this the GuC could attempt
> > a preemption and hang the batches.
> To be clear, I'm not saying that this is wrong. I'm just saying that this
> appears to be new behaviour with this patch but it is not explicitly called
> out in the description of the patch.
> 

Will add some comments explaining this behavior (unless I already have
them).

> 
> > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> > > >    drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
> > > >    4 files changed, 287 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 5615be32879c..2de62649e275 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > >    	GEM_BUG_ON(intel_context_is_child(child));
> > > >    	GEM_BUG_ON(intel_context_is_parent(child));
> > > > -	parent->guc_number_children++;
> > > > +	child->guc_child_index = parent->guc_number_children++;
> > > >    	list_add_tail(&child->guc_child_link,
> > > >    		      &parent->guc_child_list);
> > > >    	child->parent = parent;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 713d85b0b364..727f91e7f7c2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -246,6 +246,9 @@ struct intel_context {
> > > >    		/** @guc_number_children: number of children if parent */
> > > >    		u8 guc_number_children;
> > > > +		/** @guc_child_index: index into guc_child_list if child */
> > > > +		u8 guc_child_index;
> > > > +
> > > >    		/**
> > > >    		 * @parent_page: page in context used by parent for work queue,
> > > >    		 * work queue descriptor
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > index 6cd26dc060d1..9f61cfa5566a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > @@ -188,7 +188,7 @@ struct guc_process_desc {
> > > >    	u32 wq_status;
> > > >    	u32 engine_presence;
> > > >    	u32 priority;
> > > > -	u32 reserved[30];
> > > > +	u32 reserved[36];
> > > What is this extra space for? All the extra storage is grabbed from after
> > > the end of this structure, isn't it?
> > > 
> > This is the size of process descriptor in the GuC spec. Even though this
> > is unused space we really don't want the child go / join memory using
> > anything within the process descriptor.
> Okay. So it's more that the code was previously broken and we just hadn't
> hit a problem because of it? Again, worth adding a comment in the
> description to call it out as a bug fix.
>

Sure.
 
> > 
> > > >    } __packed;
> > > >    #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 91330525330d..1a18f99bf12a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -11,6 +11,7 @@
> > > >    #include "gt/intel_context.h"
> > > >    #include "gt/intel_engine_pm.h"
> > > >    #include "gt/intel_engine_heartbeat.h"
> > > > +#include "gt/intel_gpu_commands.h"
> > > >    #include "gt/intel_gt.h"
> > > >    #include "gt/intel_gt_irq.h"
> > > >    #include "gt/intel_gt_pm.h"
> > > > @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> > > >    /*
> > > >     * When using multi-lrc submission an extra page in the context state is
> > > > - * reserved for the process descriptor and work queue.
> > > > + * reserved for the process descriptor, work queue, and preempt BB boundary
> > > > + * handshake between the parent + childlren contexts.
> > > >     *
> > > >     * The layout of this page is below:
> > > >     * 0						guc_process_desc
> > > > + * + sizeof(struct guc_process_desc)		child go
> > > > + * + CACHELINE_BYTES				child join ...
> > > > + * + CACHELINE_BYTES ...
> > > Would be better written as '[num_children]' instead of '...' to make it
> > > clear it is a per child array.
> > > 
> > I think this description is pretty clear.
> Evidently not because it confused me for a moment.
> 

Ok, let me see if I can make this a bit more clear.

> > 
> > > Also, maybe create a struct for this to get rid of the magic '+1's and
> > > 'BYTES / sizeof' constructs in the functions below.
> > > 
> > Let me see if I can create a struct that describes the layout.
> That would definitely make the code a lot clearer.
> 
> > 
> > > >     * ...						unused
> > > >     * PAGE_SIZE / 2				work queue start
> > > >     * ...						work queue
> > > > @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > > >    	return __guc_action_deregister_context(guc, guc_id, loop);
> > > >    }
> > > > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +	u8 i;
> > > > +
> > > > +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> > > > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> > > > +}
> > > > +
> > > > +static inline u32 get_children_go_value(struct intel_context *ce)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +
> > > > +	return mem[0];
> > > > +}
> > > > +
> > > > +static inline u32 get_children_join_value(struct intel_context *ce,
> > > > +					  u8 child_index)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +
> > > > +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> > > > +}
> > > > +
> > > >    static void guc_context_policy_init(struct intel_engine_cs *engine,
> > > >    				    struct guc_lrc_desc *desc)
> > > >    {
> > > > @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >    			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > >    			guc_context_policy_init(engine, desc);
> > > >    		}
> > > > +
> > > > +		clear_children_join_go_memory(ce);
> > > >    	}
> > > >    	/*
> > > > @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> > > >    	.get_sibling = guc_virtual_get_sibling,
> > > >    };
> > > > +/*
> > > > + * The below override of the breadcrumbs is enabled when the user configures a
> > > > + * context for parallel submission (multi-lrc, parent-child).
> > > > + *
> > > > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > > > + * safely preempt all the hw contexts configured for parallel submission
> > > > + * between each BB. The contract between the i915 and GuC is if the parent
> > > > + * context can be preempted, all the children can be preempted, and the GuC will
> > > > + * always try to preempt the parent before the children. A handshake between the
> > > > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > > > + * creating a window to preempt between each set of BBs.
> > > > + */
> > > > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						     u64 offset, u32 len,
> > > > +						     const unsigned int flags);
> > > > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						    u64 offset, u32 len,
> > > > +						    const unsigned int flags);
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						 u32 *cs);
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						u32 *cs);
> > > > +
> > > >    static struct intel_context *
> > > >    guc_create_parallel(struct intel_engine_cs **engines,
> > > >    		    unsigned int num_siblings,
> > > > @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> > > >    		}
> > > >    	}
> > > > +	parent->engine->emit_bb_start =
> > > > +		emit_bb_start_parent_no_preempt_mid_batch;
> > > > +	parent->engine->emit_fini_breadcrumb =
> > > > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > > > +	parent->engine->emit_fini_breadcrumb_dw =
> > > > +		12 + 4 * parent->guc_number_children;
> > > > +	for_each_child(parent, ce) {
> > > > +		ce->engine->emit_bb_start =
> > > > +			emit_bb_start_child_no_preempt_mid_batch;
> > > > +		ce->engine->emit_fini_breadcrumb =
> > > > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > > > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > > > +	}
> > > > +
> > > >    	kfree(siblings);
> > > >    	return parent;
> > > > @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
> > > >    	guc->submission_selected = __guc_submission_selected(guc);
> > > >    }
> > > > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > > > +{
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	return i915_ggtt_offset(ce->state) +
> > > > +		__get_process_desc_offset(ce) +
> > > > +		sizeof(struct guc_process_desc);
> > > > +}
> > > > +
> > > > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > > > +					 u8 child_index)
> > > > +{
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> > > > +}
> > > > +
> > > > +#define PARENT_GO_BB			1
> > > > +#define PARENT_GO_FINI_BREADCRUMB	0
> > > > +#define CHILD_GO_BB			1
> > > > +#define CHILD_GO_FINI_BREADCRUMB	0
> > > > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						     u64 offset, u32 len,
> > > > +						     const unsigned int flags)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u32 *cs;
> > > > +	u8 i;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> > > > +	if (IS_ERR(cs))
> > > > +		return PTR_ERR(cs);
> > > > +
> > > > +	/* Wait on chidlren */
> > > chidlren -> children
> > > 
> > Yep.
> > > > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > > > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +			 MI_SEMAPHORE_POLL |
> > > > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +		*cs++ = PARENT_GO_BB;
> > > > +		*cs++ = get_children_join_addr(ce, i);
> > > > +		*cs++ = 0;
> > > > +	}
> > > > +
> > > > +	/* Turn off preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Tell children go */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  CHILD_GO_BB,
> > > > +				  get_children_go_addr(ce),
> > > > +				  0);
> > > > +
> > > > +	/* Jump to batch */
> > > > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > > > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > > > +	*cs++ = lower_32_bits(offset);
> > > > +	*cs++ = upper_32_bits(offset);
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	intel_ring_advance(rq, cs);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						    u64 offset, u32 len,
> > > > +						    const unsigned int flags)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u32 *cs;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > > > +
> > > > +	cs = intel_ring_begin(rq, 12);
> > > > +	if (IS_ERR(cs))
> > > > +		return PTR_ERR(cs);
> > > > +
> > > > +	/* Signal parent */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  PARENT_GO_BB,
> > > > +				  get_children_join_addr(ce->parent,
> > > > +							 ce->guc_child_index),
> > > > +				  0);
> > > > +
> > > > +	/* Wait parent on for go */
> > > parent on -> on parent
> > > 
> > Yep.
> > > > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +		 MI_SEMAPHORE_POLL |
> > > > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +	*cs++ = CHILD_GO_BB;
> > > > +	*cs++ = get_children_go_addr(ce->parent);
> > > > +	*cs++ = 0;
> > > > +
> > > > +	/* Turn off preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > > > +
> > > > +	/* Jump to batch */
> > > > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > > > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > > > +	*cs++ = lower_32_bits(offset);
> > > > +	*cs++ = upper_32_bits(offset);
> > > > +
> > > > +	intel_ring_advance(rq, cs);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						 u32 *cs)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u8 i;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	/* Wait on children */
> > > > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > > > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +			 MI_SEMAPHORE_POLL |
> > > > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > > > +		*cs++ = get_children_join_addr(ce, i);
> > > > +		*cs++ = 0;
> > > > +	}
> > > > +
> > > > +	/* Turn on preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Tell children go */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  CHILD_GO_FINI_BREADCRUMB,
> > > > +				  get_children_go_addr(ce),
> > > > +				  0);
> > > > +
> > > > +	/* Emit fini breadcrumb */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  rq->fence.seqno,
> > > > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > > > +				  0);
> > > > +
> > > > +	/* User interrupt */
> > > > +	*cs++ = MI_USER_INTERRUPT;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	rq->tail = intel_ring_offset(rq, cs);
> > > > +
> > > > +	return cs;
> > > > +}
> > > > +
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > > > +
> > > > +	/* Turn on preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Signal parent */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  PARENT_GO_FINI_BREADCRUMB,
> > > > +				  get_children_join_addr(ce->parent,
> > > > +							 ce->guc_child_index),
> > > > +				  0);
> > > > +
> > > This is backwards compared to the parent?
> > > 
> > > Parent: wait on children then enable pre-emption
> > > Child: enable pre-emption then signal parent
> > > 
> > > Makes for a window where the parent is waiting in atomic context for a
> > > signal from a child that has been pre-empted and might not get to run for
> > > some time?
> > > 
> > No, this is correct. The rule is if the parent can be preempted all the
> > children can be preempted, thus we can't enable preemption on the parent
> > until all the children have preemption enabled, thus the parent waits
> > for all the children to join before enabling its preemption.
> > 
> > Matt
> But,
> 
> The write to PARENT_GO_FINI can't fail or stall, right? So if it happens
> before the ARB_ON then the child is guaranteed to execute the ARB_ON once it
> has signalled the parent. Indeed, by the time the parent context gets to see
> the update memory value, the child is practically certain to have passed the
> ARB_ON. So, by the time the parent becomes pre-emptible, the children will
> all be pre-emptible. Even if the parent is superfast, the children are
> guaranteed to become pre-emptible immediately - certainly before any
> fail-to-preempt timeout could occur.
>
> Whereas, with the current ordering, it is possible for the child to be
> preempted before it has issued the signal to the parent. So now you have a
> non-preemptible parent hogging the hardware, waiting for a signal that isn't

To be clear the parent is always preempted first by the GuC. The parent
can't be running if the child preempt is attempted.

> going to come for an entire execution quantum. Indeed, it is actually quite
> likely the child would be preempted before it can signal the parent because
> any pre-emption request that was issued at any time during the child's
> execution will take effect immediately on the ARB_ON instruction.
> 

Looking at the code, I do think I have a bug though.

I think I'm missing a MI_ARB_CHECK in the parent after turning on
preemption before releasing the children, right?

This covers the case where the GuC issues a preemption to the parent
while it is waiting on the children, all the children join, the parent
turns on preemption and is preempted with the added MI_ARB_CHECK
instruction, and the children all can be preempted waiting on the parent
go semaphore. Does that sound correct?

Matt

> John.
> 
> 
> > > John.
> > > 
> > > 
> > > > +	/* Wait parent on for go */
> > > > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +		 MI_SEMAPHORE_POLL |
> > > > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > > > +	*cs++ = get_children_go_addr(ce->parent);
> > > > +	*cs++ = 0;
> > > > +
> > > > +	/* Emit fini breadcrumb */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  rq->fence.seqno,
> > > > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > > > +				  0);
> > > > +
> > > > +	/* User interrupt */
> > > > +	*cs++ = MI_USER_INTERRUPT;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	rq->tail = intel_ring_offset(rq, cs);
> > > > +
> > > > +	return cs;
> > > > +}
> > > > +
> > > >    static struct intel_context *
> > > >    g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> > > >    {
> > > > @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > >    			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > >    				   READ_ONCE(desc->wq_status));
> > > > +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> > > > +				   ce->guc_number_children);
> > > > +			if (ce->engine->emit_bb_start ==
> > > > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > > > +				u8 i;
> > > > +
> > > > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > > > +					   get_children_go_value(ce));
> > > > +				for (i = 0; i < ce->guc_number_children; ++i)
> > > > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > > > +						   get_children_join_value(ce, i));
> > > > +			}
> > > > +
> > > >    			for_each_child(ce, child)
> > > >    				guc_log_context(p, child);
> > > >    		}
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 25/27] drm/i915/guc: Handle errors in multi-lrc requests
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-09-29 20:44   ` John Harrison
  2021-09-29 20:58     ` Matthew Brost
  -1 siblings, 1 reply; 145+ messages in thread
From: John Harrison @ 2021-09-29 20:44 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On 8/20/2021 15:44, Matthew Brost wrote:
> If an error occurs in the front end when multi-lrc requests are getting
> generated we need to skip these in the backend but we still need to
> emit the breadcrumbs seqno. An issues arrises because with multi-lrc
arrises -> arises

> breadcrumbs there is a handshake between the parent and children to make
> forwad progress. If all the requests are not present this handshake
forwad -> forward

> doesn't work. To work around this, if multi-lrc request has an error we
> skip the handshake but still emit the breadcrumbs seqno.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 61 ++++++++++++++++++-
>   1 file changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 2ef38557b0f0..61e737fd1eee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3546,8 +3546,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> -						 u32 *cs)
> +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						   u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   	u8 i;
> @@ -3575,6 +3575,41 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   				  get_children_go_addr(ce),
>   				  0);
>   
> +	return cs;
> +}
> +
> +/*
> + * If this true, a submission of multi-lrc requests had an error and the
> + * requests need to be skipped. The front end (execuf IOCTL) should've called
> + * i915_request_skip which squashes the BB but we still need to emit the fini
> + * breadrcrumbs seqno write. At this point we don't know how many of the
> + * requests in the multi-lrc submission were generated so we can't do the
> + * handshake between the parent and children (e.g. if 4 requests should be
> + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> + * has occurred on any of the requests in submission / relationship.
> + */
> +static inline bool skip_handshake(struct i915_request *rq)
> +{
> +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> +						 u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
Why -6? There are 12 words about to be written. Indeed the value of 
emit_..._dw is '12 + 4*num_children'. This should only be skipping over 
the 4*children, right? As it stands, it will skip all but the last six 
words, then write an extra twelve words and thus overflow the 
reservation by six. Unless I am totally confused?

I assume there is some reason why the amount of data written must 
exactly match the space reserved? It's a while since I've looked at the 
ring buffer code!

Seems like it would be clearer to not split the semaphore writes out but 
have them right next to the skip code that is meant to replicate them 
but with no-ops.

> +	} else {
> +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> +	}
> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,
> @@ -3591,7 +3626,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
>   }
>   
>   static u32 *
> -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						  u32 *cs)
>   {
>   	struct intel_context *ce = rq->context;
>   
> @@ -3617,6 +3653,25 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
>   	*cs++ = get_children_go_addr(ce->parent);
>   	*cs++ = 0;
>   
> +	return cs;
> +}
> +
> +static u32 *
> +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> +						u32 *cs)
> +{
> +	struct intel_context *ce = rq->context;
> +
> +	GEM_BUG_ON(!intel_context_is_child(ce));
> +
> +	if (unlikely(skip_handshake(rq))) {
> +		memset(cs, 0, sizeof(u32) *
> +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> +	} else {
> +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> +	}
Same points as above - why -6 not -12 and would be clearer to keep the 
no-ops and the writes adjacent.

John.

> +
>   	/* Emit fini breadcrumb */
>   	cs = gen8_emit_ggtt_write(cs,
>   				  rq->fence.seqno,


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 25/27] drm/i915/guc: Handle errors in multi-lrc requests
  2021-09-29 20:44   ` John Harrison
@ 2021-09-29 20:58     ` Matthew Brost
  0 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-29 20:58 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, daniel.vetter, tony.ye, zhengguo.xu

On Wed, Sep 29, 2021 at 01:44:10PM -0700, John Harrison wrote:
> On 8/20/2021 15:44, Matthew Brost wrote:
> > If an error occurs in the front end when multi-lrc requests are getting
> > generated we need to skip these in the backend but we still need to
> > emit the breadcrumbs seqno. An issues arrises because with multi-lrc
> arrises -> arises
> 

Yep.

> > breadcrumbs there is a handshake between the parent and children to make
> > forwad progress. If all the requests are not present this handshake
> forwad -> forward
> 

Yep.

> > doesn't work. To work around this, if multi-lrc request has an error we
> > skip the handshake but still emit the breadcrumbs seqno.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 61 ++++++++++++++++++-
> >   1 file changed, 58 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 2ef38557b0f0..61e737fd1eee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -3546,8 +3546,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > -						 u32 *cs)
> > +__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						   u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> >   	u8 i;
> > @@ -3575,6 +3575,41 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   				  get_children_go_addr(ce),
> >   				  0);
> > +	return cs;
> > +}
> > +
> > +/*
> > + * If this true, a submission of multi-lrc requests had an error and the
> > + * requests need to be skipped. The front end (execuf IOCTL) should've called
> > + * i915_request_skip which squashes the BB but we still need to emit the fini
> > + * breadrcrumbs seqno write. At this point we don't know how many of the
> > + * requests in the multi-lrc submission were generated so we can't do the
> > + * handshake between the parent and children (e.g. if 4 requests should be
> > + * generated but 2nd hit an error only 1 would be seen by the GuC backend).
> > + * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
> > + * has occurred on any of the requests in submission / relationship.
> > + */
> > +static inline bool skip_handshake(struct i915_request *rq)
> > +{
> > +	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > +						 u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> Why -6? There are 12 words about to be written. Indeed the value of
> emit_..._dw is '12 + 4*num_children'. This should only be skipping over the
> 4*children, right? As it stands, it will skip all but the last six words,
> then write an extra twelve words and thus overflow the reservation by six.
> Unless I am totally confused?
> 

Let me decode the length:

'Wait on children' (in __emit_fini_breadcrumb_parent_no_preempt_mid_batch) = 4 * num_children
'Turn on preemption' (in __emit_fini_breadcrumb_parent_no_preempt_mid_batch) = 2
'Tell children go' (in __emit_fini_breadcrumb_parent_no_preempt_mid_batch) = 4
'Emit fini breadcrumb' (in emit_fini_breadcrumb_child_no_preempt_mid_batch) = 4
'User interrupt' (in emit_fini_breadcrumb_child_no_preempt_mid_batch) = 2

So for a total (emit_fini_breadcrumb_dw) we have '12 + 4 * num_children'

We want skip everything in __emit_fini_breadcrumb_parent_no_preempt_mid_batch, so that is
'6 + 4 * num_children' or 'emit_fini_breadcrumb_dw - 6'

Make sense?

> I assume there is some reason why the amount of data written must exactly
> match the space reserved? It's a while since I've looked at the ring buffer
> code!
> 

I think it because the ring space is reserved at request creation time
but the fini breadcrumbs are not written until submission time.

> Seems like it would be clearer to not split the semaphore writes out but
> have them right next to the skip code that is meant to replicate them but
> with no-ops.
>

I guess that works too, I personally like the way it is but if you
insist I can change it.

> > +	} else {
> > +		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
> > +	}
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> > @@ -3591,7 +3626,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> >   }
> >   static u32 *
> > -emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > +__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						  u32 *cs)
> >   {
> >   	struct intel_context *ce = rq->context;
> > @@ -3617,6 +3653,25 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
> >   	*cs++ = get_children_go_addr(ce->parent);
> >   	*cs++ = 0;
> > +	return cs;
> > +}
> > +
> > +static u32 *
> > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > +						u32 *cs)
> > +{
> > +	struct intel_context *ce = rq->context;
> > +
> > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > +
> > +	if (unlikely(skip_handshake(rq))) {
> > +		memset(cs, 0, sizeof(u32) *
> > +		       (ce->engine->emit_fini_breadcrumb_dw - 6));
> > +		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
> > +	} else {
> > +		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
> > +	}
> Same points as above - why -6 not -12 and would be clearer to keep the
> no-ops and the writes adjacent.
>

Same as above we are NOP the length of
__emit_fini_breadcrumb_child_no_preempt_mid_batch and still want the
emit breadcrumbs below.

Matt

 
> John.
> 
> > +
> >   	/* Emit fini breadcrumb */
> >   	cs = gen8_emit_ggtt_write(cs,
> >   				  rq->fence.seqno,
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH 24/27] drm/i915: Multi-BB execbuf
  2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
                     ` (2 preceding siblings ...)
  (?)
@ 2021-09-30 22:16   ` Matthew Brost
  -1 siblings, 0 replies; 145+ messages in thread
From: Matthew Brost @ 2021-09-30 22:16 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, tony.ye, zhengguo.xu

On Fri, Aug 20, 2021 at 03:44:43PM -0700, Matthew Brost wrote:

Did a review offline with John Harrison, adding notes for what we found. 

> Allow multiple batch buffers to be submitted in a single execbuf IOCTL
> after a context has been configured with the 'set_parallel' extension.
> The number batches is implicit based on the contexts configuration.
> 
> This is implemented with a series of loops. First a loop is used to find
> all the batches, a loop to pin all the HW contexts, a loop to generate
> all the requests, a loop to submit all the requests, a loop to commit
> all the requests, and finally a loop to tie the requests to the VMAs
> they touch.

Clarify these steps a bit, also tieing requests to the VMAs is the 2nd to last
step with commiting requests to be the last.

> 
> A composite fence is also created for the also the generated requests to
> return to the user and to stick in dma resv slots.
>

Add a comment saying there should be no change in behavior for existing IOCTLs
expect if throttling because to space in the ring, the wait is done with the VMA
locks being held rather than dropping the locks and putting to the slow path.

> IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> media UMD: link to come
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 765 ++++++++++++------
>  drivers/gpu/drm/i915/gt/intel_context.h       |   8 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  12 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +
>  drivers/gpu/drm/i915/i915_request.h           |   9 +
>  drivers/gpu/drm/i915/i915_vma.c               |  21 +-
>  drivers/gpu/drm/i915/i915_vma.h               |  13 +-
>  7 files changed, 573 insertions(+), 257 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 8290bdadd167..481978974627 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -244,17 +244,23 @@ struct i915_execbuffer {
>  	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
>  	struct eb_vma *vma;
>  
> -	struct intel_engine_cs *engine; /** engine to queue the request to */
> +	struct intel_gt *gt; /* gt for the execbuf */
>  	struct intel_context *context; /* logical state for the request */
>  	struct i915_gem_context *gem_context; /** caller's context */
>  
> -	struct i915_request *request; /** our request to build */
> -	struct eb_vma *batch; /** identity of the batch obj/vma */

Line wrap of these comments.

> +	struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; /** our requests to build */
> +	struct eb_vma *batches[MAX_ENGINE_INSTANCE + 1]; /** identity of the batch obj/vma */
>  	struct i915_vma *trampoline; /** trampoline used for chaining */
>  
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>  	/** actual size of execobj[] as we may extend it for the cmdparser */
>  	unsigned int buffer_count;
>  
> +	/* number of batches in execbuf IOCTL */
> +	unsigned int num_batches;
> +
>  	/** list of vma not yet bound during reservation phase */
>  	struct list_head unbound;
>  
> @@ -281,7 +287,7 @@ struct i915_execbuffer {
>  
>  	u64 invalid_flags; /** Set of execobj.flags that are invalid */
>  
> -	u64 batch_len; /** Length of batch within object */
> +	u64 batch_len[MAX_ENGINE_INSTANCE + 1]; /** Length of batch within object */
>  	u32 batch_start_offset; /** Location within object of batch */
>  	u32 batch_flags; /** Flags composed for emit_bb_start() */
>  	struct intel_gt_buffer_pool_node *batch_pool; /** pool node for batch buffer */
> @@ -299,14 +305,13 @@ struct i915_execbuffer {
>  };
>  
>  static int eb_parse(struct i915_execbuffer *eb);
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb,
> -					  bool throttle);
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle);
>  static void eb_unpin_engine(struct i915_execbuffer *eb);
>  
>  static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
>  {
> -	return intel_engine_requires_cmd_parser(eb->engine) ||
> -		(intel_engine_using_cmd_parser(eb->engine) &&
> +	return intel_engine_requires_cmd_parser(eb->context->engine) ||
> +		(intel_engine_using_cmd_parser(eb->context->engine) &&
>  		 eb->args->batch_len);
>  }
>  
> @@ -544,11 +549,21 @@ eb_validate_vma(struct i915_execbuffer *eb,
>  	return 0;
>  }
>  
> -static void
> +static inline bool
> +is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
> +{
> +	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
> +		buffer_idx < eb->num_batches :
> +		buffer_idx >= eb->args->buffer_count - eb->num_batches;
> +}
> +
> +static int
>  eb_add_vma(struct i915_execbuffer *eb,
> -	   unsigned int i, unsigned batch_idx,
> +	   unsigned int *current_batch,
> +	   unsigned int i,
>  	   struct i915_vma *vma)
>  {
> +	struct drm_i915_private *i915 = eb->i915;
>  	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
>  	struct eb_vma *ev = &eb->vma[i];
>  
> @@ -575,15 +590,41 @@ eb_add_vma(struct i915_execbuffer *eb,
>  	 * Note that actual hangs have only been observed on gen7, but for
>  	 * paranoia do it everywhere.
>  	 */
> -	if (i == batch_idx) {
> +	if (is_batch_buffer(eb, i)) {
>  		if (entry->relocation_count &&
>  		    !(ev->flags & EXEC_OBJECT_PINNED))
>  			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
>  		if (eb->reloc_cache.has_fence)
>  			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
>  
> -		eb->batch = ev;
> +		eb->batches[*current_batch] = ev;
> +
> +		if (unlikely(ev->flags & EXEC_OBJECT_WRITE)) {
> +			drm_dbg(&i915->drm,
> +				"Attempting to use self-modifying batch buffer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (range_overflows_t(u64,
> +				      eb->batch_start_offset,
> +				      eb->args->batch_len,
> +				      ev->vma->size)) {
> +			drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> +			return -EINVAL;
> +		}
> +
> +		if (eb->args->batch_len == 0)
> +			eb->batch_len[*current_batch] = ev->vma->size -
> +				eb->batch_start_offset;
> +		if (unlikely(eb->batch_len == 0)) { /* impossible! */
> +			drm_dbg(&i915->drm, "Invalid batch length\n");
> +			return -EINVAL;
> +		}
> +
> +		++*current_batch;
>  	}
> +
> +	return 0;
>  }
>  
>  static inline int use_cpu_reloc(const struct reloc_cache *cache,
> @@ -727,14 +768,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>  	} while (1);
>  }
>  
> -static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
> -{
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		return 0;
> -	else
> -		return eb->buffer_count - 1;
> -}
> -
>  static int eb_select_context(struct i915_execbuffer *eb)
>  {
>  	struct i915_gem_context *ctx;
> @@ -843,9 +876,7 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
>  
>  static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  {
> -	struct drm_i915_private *i915 = eb->i915;
> -	unsigned int batch = eb_batch_index(eb);
> -	unsigned int i;
> +	unsigned int i, current_batch = 0;
>  	int err = 0;
>  
>  	INIT_LIST_HEAD(&eb->relocs);
> @@ -865,7 +896,9 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  			goto err;
>  		}
>  
> -		eb_add_vma(eb, i, batch, vma);
> +		err = eb_add_vma(eb, &current_batch, i, vma);
> +		if (err)
> +			return err;
>  
>  		if (i915_gem_object_is_userptr(vma->obj)) {
>  			err = i915_gem_object_userptr_submit_init(vma->obj);
> @@ -888,26 +921,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
>  		}
>  	}
>  
> -	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
> -		drm_dbg(&i915->drm,
> -			"Attempting to use self-modifying batch buffer\n");
> -		return -EINVAL;
> -	}
> -
> -	if (range_overflows_t(u64,
> -			      eb->batch_start_offset, eb->batch_len,
> -			      eb->batch->vma->size)) {
> -		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
> -		return -EINVAL;
> -	}
> -
> -	if (eb->batch_len == 0)
> -		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
> -	if (unlikely(eb->batch_len == 0)) { /* impossible! */
> -		drm_dbg(&i915->drm, "Invalid batch length\n");
> -		return -EINVAL;
> -	}
> -
>  	return 0;
>  
>  err:
> @@ -1640,8 +1653,7 @@ static int eb_reinit_userptr(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> -					   struct i915_request *rq)
> +static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
>  {
>  	bool have_copy = false;
>  	struct eb_vma *ev;
> @@ -1657,21 +1669,6 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	eb_release_vmas(eb, false);
>  	i915_gem_ww_ctx_fini(&eb->ww);
>  
> -	if (rq) {
> -		/* nonblocking is always false */
> -		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> -				      MAX_SCHEDULE_TIMEOUT) < 0) {
> -			i915_request_put(rq);
> -			rq = NULL;
> -
> -			err = -EINTR;
> -			goto err_relock;
> -		}
> -
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/*
>  	 * We take 3 passes through the slowpatch.
>  	 *
> @@ -1698,28 +1695,21 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	if (!err)
>  		err = eb_reinit_userptr(eb);
>  
> -err_relock:
>  	i915_gem_ww_ctx_init(&eb->ww, true);
>  	if (err)
>  		goto out;
>  
>  	/* reacquire the objects */
>  repeat_validate:
> -	rq = eb_pin_engine(eb, false);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, false);
> +	if (err)
>  		goto err;
> -	}
> -
> -	/* We didn't throttle, should be NULL */
> -	GEM_WARN_ON(rq);
>  
>  	err = eb_validate_vmas(eb);
>  	if (err)
>  		goto err;
>  
> -	GEM_BUG_ON(!eb->batch);
> +	GEM_BUG_ON(!eb->batches[0]);
>  
>  	list_for_each_entry(ev, &eb->relocs, reloc_link) {
>  		if (!have_copy) {
> @@ -1783,46 +1773,23 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  		}
>  	}
>  
> -	if (rq)
> -		i915_request_put(rq);
> -
>  	return err;
>  }
>  
>  static int eb_relocate_parse(struct i915_execbuffer *eb)
>  {
>  	int err;
> -	struct i915_request *rq = NULL;
>  	bool throttle = true;
>  
>  retry:
> -	rq = eb_pin_engine(eb, throttle);
> -	if (IS_ERR(rq)) {
> -		err = PTR_ERR(rq);
> -		rq = NULL;
> +	err = eb_pin_engine(eb, throttle);
> +	if (err) {
>  		if (err != -EDEADLK)
>  			return err;
>  
>  		goto err;
>  	}
>  
> -	if (rq) {
> -		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> -
> -		/* Need to drop all locks now for throttling, take slowpath */
> -		err = i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE, 0);
> -		if (err == -ETIME) {
> -			if (nonblock) {
> -				err = -EWOULDBLOCK;
> -				i915_request_put(rq);
> -				goto err;
> -			}
> -			goto slow;
> -		}
> -		i915_request_put(rq);
> -		rq = NULL;
> -	}
> -
>  	/* only throttle once, even if we didn't need to throttle */
>  	throttle = false;
>  
> @@ -1862,7 +1829,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  
>  slow:
> -	err = eb_relocate_parse_slow(eb, rq);
> +	err = eb_relocate_parse_slow(eb);
>  	if (err)
>  		/*
>  		 * If the user expects the execobject.offset and
> @@ -1876,11 +1843,31 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  

Add comments in generally about lock nesting of timeline mutexes.

> +#define for_each_batch_create_order(_eb, _i) \
> +	for (_i = 0; _i < _eb->num_batches; ++_i)
> +#define for_each_batch_add_order(_eb, _i) \
> +	BUILD_BUG_ON(!typecheck(int, _i)); \
> +	for (_i = _eb->num_batches - 1; _i >= 0; --_i)
> +
> +static struct i915_request *
> +eb_find_first_request(struct i915_execbuffer *eb)

eb_find_first_request_added

> +{
> +	int i;
> +
> +	for_each_batch_add_order(eb, i)
> +		if (eb->requests[i])
> +			return eb->requests[i];
> +
> +	GEM_BUG_ON("Request not found");
> +
> +	return NULL;
> +}
> +
>  static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  {
>  	const unsigned int count = eb->buffer_count;
>  	unsigned int i = count;
> -	int err = 0;
> +	int err = 0, j;
>  
>  	while (i--) {
>  		struct eb_vma *ev = &eb->vma[i];
> @@ -1893,11 +1880,17 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		if (flags & EXEC_OBJECT_CAPTURE) {
>  			struct i915_capture_list *capture;
>  
> -			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> -			if (capture) {
> -				capture->next = eb->request->capture_list;
> -				capture->vma = vma;
> -				eb->request->capture_list = capture;
> +			for_each_batch_create_order(eb, j) {
> +				if (!eb->requests[j])
> +					break;
> +
> +				capture = kmalloc(sizeof(*capture), GFP_KERNEL);
> +				if (capture) {
> +					capture->next =
> +						eb->requests[j]->capture_list;
> +					capture->vma = vma;
> +					eb->requests[j]->capture_list = capture;
> +				}
>  			}
>  		}
>  
> @@ -1918,14 +1911,26 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  				flags &= ~EXEC_OBJECT_ASYNC;
>  		}
>  
> +		/* We only need to await on the first request */
>  		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
>  			err = i915_request_await_object
> -				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
> +				(eb_find_first_request(eb), obj,
> +				 flags & EXEC_OBJECT_WRITE);
>  		}
>  
> -		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +		for_each_batch_add_order(eb, j) {
> +			if (err)
> +				break;
> +			if (!eb->requests[j])
> +				continue;
> +
> +			err = _i915_vma_move_to_active(vma, eb->requests[j],
> +						       j ? NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->requests[j]->fence,
> +						       flags | __EXEC_OBJECT_NO_RESERVE);
> +		}
>  	}
>  
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -1956,11 +1961,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
>  		goto err_skip;
>  
>  	/* Unconditionally flush any chipset caches (for streaming writes). */
> -	intel_gt_chipset_flush(eb->engine->gt);
> +	intel_gt_chipset_flush(eb->gt);
>  	return 0;
>  
>  err_skip:
> -	i915_request_set_error_once(eb->request, err);
> +	for_each_batch_create_order(eb, j) {
> +		if (!eb->requests[j])
> +			break;
> +
> +		i915_request_set_error_once(eb->requests[j], err);
> +	}
>  	return err;
>  }
>  
> @@ -2055,14 +2065,16 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	int err;
>  
>  	if (!eb_use_cmdparser(eb)) {
> -		batch = eb_dispatch_secure(eb, eb->batch->vma);
> +		batch = eb_dispatch_secure(eb, eb->batches[0]->vma);
>  		if (IS_ERR(batch))
>  			return PTR_ERR(batch);
>  
>  		goto secure_batch;
>  	}
>  
> -	len = eb->batch_len;
> +	GEM_BUG_ON(intel_context_is_parallel(eb->context));

Return -EINVAL (or -ENODEV) rather than BUG_ON

> +
> +	len = eb->batch_len[0];
>  	if (!CMDPARSER_USES_GGTT(eb->i915)) {
>  		/*
>  		 * ppGTT backed shadow buffers must be mapped RO, to prevent
> @@ -2076,11 +2088,11 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	} else {
>  		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
>  	}
> -	if (unlikely(len < eb->batch_len)) /* last paranoid check of overflow */
> +	if (unlikely(len < eb->batch_len[0])) /* last paranoid check of overflow */
>  		return -EINVAL;
>  
>  	if (!pool) {
> -		pool = intel_gt_get_buffer_pool(eb->engine->gt, len,
> +		pool = intel_gt_get_buffer_pool(eb->gt, len,
>  						I915_MAP_WB);
>  		if (IS_ERR(pool))
>  			return PTR_ERR(pool);
> @@ -2105,7 +2117,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>  		trampoline = shadow;
>  
>  		shadow = shadow_batch_pin(eb, pool->obj,
> -					  &eb->engine->gt->ggtt->vm,
> +					  &eb->gt->ggtt->vm,
>  					  PIN_GLOBAL);
>  		if (IS_ERR(shadow)) {
>  			err = PTR_ERR(shadow);
> @@ -2127,26 +2139,27 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	if (err)
>  		goto err_trampoline;
>  
> -	err = intel_engine_cmd_parser(eb->engine,
> -				      eb->batch->vma,
> +	err = intel_engine_cmd_parser(eb->context->engine,
> +				      eb->batches[0]->vma,
>  				      eb->batch_start_offset,
> -				      eb->batch_len,
> +				      eb->batch_len[0],
>  				      shadow, trampoline);
>  	if (err)
>  		goto err_unpin_batch;
>  
> -	eb->batch = &eb->vma[eb->buffer_count++];
> -	eb->batch->vma = i915_vma_get(shadow);
> -	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> +	eb->batches[0] = &eb->vma[eb->buffer_count++];
> +	eb->batches[0]->vma = i915_vma_get(shadow);
> +	eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
>  
>  	eb->trampoline = trampoline;
>  	eb->batch_start_offset = 0;
>  
>  secure_batch:
>  	if (batch) {
> -		eb->batch = &eb->vma[eb->buffer_count++];
> -		eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
> -		eb->batch->vma = i915_vma_get(batch);
> +		GEM_BUG_ON(intel_context_is_parallel(eb->context));

Same as above, no BUG_ON

> +		eb->batches[0] = &eb->vma[eb->buffer_count++];
> +		eb->batches[0]->flags = __EXEC_OBJECT_HAS_PIN;
> +		eb->batches[0]->vma = i915_vma_get(batch);
>  	}
>  	return 0;
>  
> @@ -2162,19 +2175,18 @@ static int eb_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> -static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
> +static int eb_request_submit(struct i915_execbuffer *eb,
> +			     struct i915_request *rq,
> +			     struct i915_vma *batch,
> +			     u64 batch_len)
>  {
>  	int err;
>  
> -	if (intel_context_nopreempt(eb->context))
> -		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
> -
> -	err = eb_move_to_gpu(eb);
> -	if (err)
> -		return err;
> +	if (intel_context_nopreempt(rq->context))
> +		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &rq->fence.flags);
>  
>  	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
> -		err = i915_reset_gen7_sol_offsets(eb->request);
> +		err = i915_reset_gen7_sol_offsets(rq);
>  		if (err)
>  			return err;
>  	}
> @@ -2185,26 +2197,26 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	 * allows us to determine if the batch is still waiting on the GPU
>  	 * or actually running by checking the breadcrumb.
>  	 */
> -	if (eb->engine->emit_init_breadcrumb) {
> -		err = eb->engine->emit_init_breadcrumb(eb->request);
> +	if (rq->context->engine->emit_init_breadcrumb) {
> +		err = rq->context->engine->emit_init_breadcrumb(rq);
>  		if (err)
>  			return err;
>  	}
>  
> -	err = eb->engine->emit_bb_start(eb->request,
> -					batch->node.start +
> -					eb->batch_start_offset,
> -					eb->batch_len,
> -					eb->batch_flags);
> +	err = rq->context->engine->emit_bb_start(rq,
> +						 batch->node.start +
> +						 eb->batch_start_offset,
> +						 batch_len,
> +						 eb->batch_flags);
>  	if (err)
>  		return err;
>  
>  	if (eb->trampoline) {
> +		GEM_BUG_ON(intel_context_is_parallel(rq->context));
>  		GEM_BUG_ON(eb->batch_start_offset);
> -		err = eb->engine->emit_bb_start(eb->request,
> -						eb->trampoline->node.start +
> -						eb->batch_len,
> -						0, 0);
> +		err = rq->context->engine->emit_bb_start(rq,
> +							 eb->trampoline->node.start +
> +							 batch_len, 0, 0);
>  		if (err)
>  			return err;
>  	}
> @@ -2212,6 +2224,27 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
>  	return 0;
>  }
>  
> +static int eb_submit(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	err = eb_move_to_gpu(eb);
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		trace_i915_request_queue(eb->requests[i], eb->batch_flags);
> +		if (!err)
> +			err = eb_request_submit(eb, eb->requests[i],
> +						eb->batches[i]->vma,
> +						eb->batch_len[i]);
> +	}
> +
> +	return err;
> +}
> +
>  static int num_vcs_engines(const struct drm_i915_private *i915)
>  {
>  	return hweight_long(VDBOX_MASK(&i915->gt));
> @@ -2277,26 +2310,11 @@ static struct i915_request *eb_throttle(struct i915_execbuffer *eb, struct intel
>  	return i915_request_get(rq);
>  }
>  
> -static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +static int eb_pin_timeline(struct i915_execbuffer *eb, struct intel_context *ce,
> +			   bool throttle)
>  {
> -	struct intel_context *ce = eb->context;
>  	struct intel_timeline *tl;
> -	struct i915_request *rq = NULL;
> -	int err;
> -
> -	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> -
> -	if (unlikely(intel_context_is_banned(ce)))
> -		return ERR_PTR(-EIO);
> -
> -	/*
> -	 * Pinning the contexts may generate requests in order to acquire
> -	 * GGTT space, so do this first before we reserve a seqno for
> -	 * ourselves.
> -	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> -	if (err)
> -		return ERR_PTR(err);
> +	struct i915_request *rq;
>  
>  	/*
>  	 * Take a local wakeref for preparing to dispatch the execbuf as
> @@ -2307,33 +2325,108 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>  	 * taken on the engine, and the parent device.
>  	 */
>  	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl)) {
> -		intel_context_unpin(ce);
> -		return ERR_CAST(tl);
> -	}
> +	if (IS_ERR(tl))
> +		return PTR_ERR(tl);
>  
>  	intel_context_enter(ce);
>  	if (throttle)
>  		rq = eb_throttle(eb, ce);
>  	intel_context_timeline_unlock(tl);
>  
> +	if (rq) {
> +		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
> +		long timeout = nonblock ? 0 : MAX_SCHEDULE_TIMEOUT;
> +
> +		if (i915_request_wait(rq, I915_WAIT_INTERRUPTIBLE,
> +				      timeout) < 0) {
> +			i915_request_put(rq);
> +
> +			tl = intel_context_timeline_lock(ce);
> +			intel_context_exit(ce);
> +			intel_context_timeline_unlock(tl);
> +
> +			if (nonblock)
> +				return -EWOULDBLOCK;
> +			else
> +				return -EINTR;
> +		}
> +		i915_request_put(rq);
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_pin_engine(struct i915_execbuffer *eb, bool throttle)
> +{
> +	struct intel_context *ce = eb->context, *child;
> +	int err;
> +	int i = 0, j = 0;
> +
> +	GEM_BUG_ON(eb->args->flags & __EXEC_ENGINE_PINNED);
> +
> +	if (unlikely(intel_context_is_banned(ce)))
> +		return -EIO;
> +
> +	/*
> +	 * Pinning the contexts may generate requests in order to acquire
> +	 * GGTT space, so do this first before we reserve a seqno for
> +	 * ourselves.
> +	 */
> +	err = intel_context_pin_ww(ce, &eb->ww);
> +	if (err)
> +		return err;
> +	for_each_child(ce, child) {
> +		err = intel_context_pin_ww(child, &eb->ww);
> +		GEM_BUG_ON(err);	/* perma-pinned should incr a counter */
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = eb_pin_timeline(eb, child, throttle);
> +		if (err)
> +			goto unwind;
> +		++i;
> +	}
> +	err = eb_pin_timeline(eb, ce, throttle);
> +	if (err)
> +		goto unwind;
> +
>  	eb->args->flags |= __EXEC_ENGINE_PINNED;
> -	return rq;
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (j++ < i) {
> +			mutex_lock(&child->timeline->mutex);
> +			intel_context_exit(child);
> +			mutex_unlock(&child->timeline->mutex);
> +		}
> +	}
> +	for_each_child(ce, child)
> +		intel_context_unpin(child);
> +	intel_context_unpin(ce);
> +	return err;
>  }
>  
>  static void eb_unpin_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce = eb->context;
> -	struct intel_timeline *tl = ce->timeline;
> +	struct intel_context *ce = eb->context, *child;
>  
>  	if (!(eb->args->flags & __EXEC_ENGINE_PINNED))
>  		return;
>  
>  	eb->args->flags &= ~__EXEC_ENGINE_PINNED;
>  
> -	mutex_lock(&tl->mutex);
> +	for_each_child(ce, child) {
> +		mutex_lock(&child->timeline->mutex);
> +		intel_context_exit(child);
> +		mutex_unlock(&child->timeline->mutex);
> +
> +		intel_context_unpin(child);
> +	}
> +
> +	mutex_lock(&ce->timeline->mutex);
>  	intel_context_exit(ce);
> -	mutex_unlock(&tl->mutex);
> +	mutex_unlock(&ce->timeline->mutex);
>  
>  	intel_context_unpin(ce);
>  }
> @@ -2384,7 +2477,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>  static int
>  eb_select_engine(struct i915_execbuffer *eb)
>  {
> -	struct intel_context *ce;
> +	struct intel_context *ce, *child;
>  	unsigned int idx;
>  	int err;
>  
> @@ -2397,6 +2490,20 @@ eb_select_engine(struct i915_execbuffer *eb)
>  	if (IS_ERR(ce))
>  		return PTR_ERR(ce);
>  
> +	if (intel_context_is_parallel(ce)) {
> +		if (eb->buffer_count < ce->guc_number_children + 1) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +		if (eb->batch_start_offset || eb->args->batch_len) {
> +			intel_context_put(ce);
> +			return -EINVAL;
> +		}
> +	}
> +	eb->num_batches = ce->guc_number_children + 1;
> +
> +	for_each_child(ce, child)
> +		intel_context_get(child);
>  	intel_gt_pm_get(ce->engine->gt);
>  
>  	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -2404,6 +2511,13 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		if (err)
>  			goto err;
>  	}
> +	for_each_child(ce, child) {
> +		if (!test_bit(CONTEXT_ALLOC_BIT, &child->flags)) {
> +			err = intel_context_alloc_state(child);
> +			if (err)
> +				goto err;
> +		}
> +	}

Probably could delete this as this should be done when context is perma-pinned
but harmless to leave so may leave it.

>  
>  	/*
>  	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
> @@ -2414,7 +2528,7 @@ eb_select_engine(struct i915_execbuffer *eb)
>  		goto err;
>  
>  	eb->context = ce;
> -	eb->engine = ce->engine;
> +	eb->gt = ce->engine->gt;
>  
>  	/*
>  	 * Make sure engine pool stays alive even if we call intel_context_put
> @@ -2425,6 +2539,8 @@ eb_select_engine(struct i915_execbuffer *eb)
>  
>  err:
>  	intel_gt_pm_put(ce->engine->gt);
> +	for_each_child(ce, child)
> +		intel_context_put(child);
>  	intel_context_put(ce);
>  	return err;
>  }
> @@ -2432,7 +2548,11 @@ eb_select_engine(struct i915_execbuffer *eb)
>  static void
>  eb_put_engine(struct i915_execbuffer *eb)
>  {
> -	intel_gt_pm_put(eb->engine->gt);
> +	struct intel_context *child;
> +
> +	intel_gt_pm_put(eb->gt);
> +	for_each_child(eb->context, child)
> +		intel_context_put(child);
>  	intel_context_put(eb->context);
>  }
>  
> @@ -2655,7 +2775,8 @@ static void put_fence_array(struct eb_fence *fences, int num_fences)
>  }
>  
>  static int
> -await_fence_array(struct i915_execbuffer *eb)
> +await_fence_array(struct i915_execbuffer *eb,
> +		  struct i915_request *rq)
>  {
>  	unsigned int n;
>  	int err;
> @@ -2669,8 +2790,7 @@ await_fence_array(struct i915_execbuffer *eb)
>  		if (!eb->fences[n].dma_fence)
>  			continue;
>  
> -		err = i915_request_await_dma_fence(eb->request,
> -						   eb->fences[n].dma_fence);
> +		err = i915_request_await_dma_fence(rq, eb->fences[n].dma_fence);
>  		if (err < 0)
>  			return err;
>  	}
> @@ -2678,9 +2798,9 @@ await_fence_array(struct i915_execbuffer *eb)
>  	return 0;
>  }
>  
> -static void signal_fence_array(const struct i915_execbuffer *eb)
> +static void signal_fence_array(const struct i915_execbuffer *eb,
> +			       struct dma_fence * const fence)
>  {
> -	struct dma_fence * const fence = &eb->request->fence;
>  	unsigned int n;
>  
>  	for (n = 0; n < eb->num_fences; n++) {
> @@ -2728,12 +2848,12 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
>  			break;
>  }
>  
> -static int eb_request_add(struct i915_execbuffer *eb, int err)
> +static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq)

static void

Matt

>  {
> -	struct i915_request *rq = eb->request;
>  	struct intel_timeline * const tl = i915_request_timeline(rq);
>  	struct i915_sched_attr attr = {};
>  	struct i915_request *prev;
> +	int err = 0;
>  
>  	lockdep_assert_held(&tl->mutex);
>  	lockdep_unpin_lock(&tl->mutex, rq->cookie);
> @@ -2745,11 +2865,6 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  	/* Check that the context wasn't destroyed before submission */
>  	if (likely(!intel_context_is_closed(eb->context))) {
>  		attr = eb->gem_context->sched;
> -	} else {
> -		/* Serialise with context_close via the add_to_timeline */
> -		i915_request_set_error_once(rq, -ENOENT);
> -		__i915_request_skip(rq);
> -		err = -ENOENT; /* override any transient errors */
>  	}
>  
>  	__i915_request_queue(rq, &attr);
> @@ -2763,6 +2878,44 @@ static int eb_request_add(struct i915_execbuffer *eb, int err)
>  	return err;
>  }
>  
> +static int eb_requests_add(struct i915_execbuffer *eb, int err)
> +{
> +	int i;
> +
> +	/*
> +	* We iterate in reverse order of creation to release timeline mutexes in
> +	* same order.
> +	*/
> +	for_each_batch_add_order(eb, i) {
> +		struct i915_request *rq = eb->requests[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		if (unlikely(intel_context_is_closed(eb->context))) {
> +			/* Serialise with context_close via the add_to_timeline */
> +			i915_request_set_error_once(rq, -ENOENT);
> +			__i915_request_skip(rq);
> +			err = -ENOENT; /* override any transient errors */
> +		}
> +
> +		if (intel_context_is_parallel(eb->context)) {
> +			if (err) {
> +				__i915_request_skip(rq);
> +				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
> +					&rq->fence.flags);
> +			}
> +			if (i == 0)
> +				set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +					&rq->fence.flags);
> +		}
> +
> +		err |= eb_request_add(eb, rq);
> +	}
> +
> +	return err;
> +}
> +
>  static const i915_user_extension_fn execbuf_extensions[] = {
>  	[DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES] = parse_timeline_fences,
>  };
> @@ -2789,6 +2942,166 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>  				    eb);
>  }
>  
> +static void eb_requests_get(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_get(eb->requests[i]);
> +	}
> +}
> +
> +static void eb_requests_put(struct i915_execbuffer *eb)
> +{
> +	unsigned int i;
> +
> +	for_each_batch_create_order(eb, i) {
> +		if (!eb->requests[i])
> +			break;
> +
> +		i915_request_put(eb->requests[i]);
> +	}
> +}
> +static struct sync_file *
> +eb_composite_fence_create(struct i915_execbuffer *eb, int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences;
> +	unsigned int i;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(eb->context));
> +
> +	fences = kmalloc(eb->num_batches * sizeof(*fences), GFP_KERNEL);
> +	if (!fences)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for_each_batch_create_order(eb, i)
> +		fences[i] = &eb->requests[i]->fence;
> +
> +	fence_array = dma_fence_array_create(eb->num_batches,
> +					     fences,
> +					     eb->context->fence_context,
> +					     eb->context->seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for_each_batch_create_order(eb, i)
> +		dma_fence_get(fences[i]);
> +
> +	if (out_fence_fd != -1) {
> +		out_fence = sync_file_create(&fence_array->base);
> +		/* sync_file now owns fence_arry, drop creation ref */
> +		dma_fence_put(&fence_array->base);
> +		if (!out_fence)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return out_fence;
> +}
> +
> +static struct intel_context *
> +eb_find_context(struct i915_execbuffer *eb, unsigned int context_number)
> +{
> +	struct intel_context *child;
> +
> +	if (likely(context_number == 0))
> +		return eb->context;
> +
> +	for_each_child(eb->context, child)
> +		if (!--context_number)
> +			return child;
> +
> +	GEM_BUG_ON("Context not found");
> +
> +	return NULL;
> +}
> +
> +static struct sync_file *
> +eb_requests_create(struct i915_execbuffer *eb, struct dma_fence *in_fence,
> +		   int out_fence_fd)
> +{
> +	struct sync_file *out_fence = NULL;
> +	unsigned int i;
> +	int err;
> +
> +	for_each_batch_create_order(eb, i) {
> +		bool first_request_to_add = i + 1 == eb->num_batches;
> +
> +		/* Allocate a request for this batch buffer nice and early. */
> +		eb->requests[i] = i915_request_create(eb_find_context(eb, i));
> +		if (IS_ERR(eb->requests[i])) {
> +			eb->requests[i] = NULL;
> +			return ERR_PTR(PTR_ERR(eb->requests[i]));
> +		}
> +
> +		if (unlikely(eb->gem_context->syncobj &&
> +			     first_request_to_add)) {
> +			struct dma_fence *fence;
> +
> +			fence = drm_syncobj_fence_get(eb->gem_context->syncobj);
> +			err = i915_request_await_dma_fence(eb->requests[i], fence);
> +			dma_fence_put(fence);
> +			if (err)
> +				return ERR_PTR(err);
> +		}
> +
> +		if (in_fence && first_request_to_add) {
> +			if (eb->args->flags & I915_EXEC_FENCE_SUBMIT)
> +				err = i915_request_await_execution(eb->requests[i],
> +								   in_fence);
> +			else
> +				err = i915_request_await_dma_fence(eb->requests[i],
> +								   in_fence);
> +			if (err < 0)
> +				return ERR_PTR(err);
> +		}
> +
> +		if (eb->fences && first_request_to_add) {
> +			err = await_fence_array(eb, eb->requests[i]);
> +			if (err)
> +				return ERR_PTR(err);
> +		}
> +
> +		if (first_request_to_add) {
> +			if (intel_context_is_parallel(eb->context)) {
> +				out_fence = eb_composite_fence_create(eb, out_fence_fd);
> +				if (IS_ERR(out_fence))
> +					return ERR_PTR(-ENOMEM);
> +			} else if (out_fence_fd != -1) {
> +				out_fence = sync_file_create(&eb->requests[i]->fence);
> +				if (!out_fence)
> +					return ERR_PTR(-ENOMEM);
> +			}
> +		}
> +
> +		/*
> +		 * Whilst this request exists, batch_obj will be on the
> +		 * active_list, and so will hold the active reference. Only when
> +		 * this request is retired will the the batch_obj be moved onto
> +		 * the inactive_list and lose its active reference. Hence we do
> +		 * not need to explicitly hold another reference here.
> +		 */
> +		eb->requests[i]->batch = eb->batches[i]->vma;
> +		if (eb->batch_pool) {
> +			GEM_BUG_ON(intel_context_is_parallel(eb->context));
> +			intel_gt_buffer_pool_mark_active(eb->batch_pool,
> +							 eb->requests[i]);
> +		}
> +	}
> +
> +	return out_fence;
> +}
> +
>  static int
>  i915_gem_do_execbuffer(struct drm_device *dev,
>  		       struct drm_file *file,
> @@ -2799,7 +3112,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	struct i915_execbuffer eb;
>  	struct dma_fence *in_fence = NULL;
>  	struct sync_file *out_fence = NULL;
> -	struct i915_vma *batch;
>  	int out_fence_fd = -1;
>  	int err;
>  
> @@ -2823,12 +3135,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	eb.buffer_count = args->buffer_count;
>  	eb.batch_start_offset = args->batch_start_offset;
> -	eb.batch_len = args->batch_len;
>  	eb.trampoline = NULL;
>  
>  	eb.fences = NULL;
>  	eb.num_fences = 0;
>  
> +	memset(eb.requests, 0, sizeof(struct i915_request *) *
> +	       ARRAY_SIZE(eb.requests));
> +	eb.composite_fence = NULL;
> +
>  	eb.batch_flags = 0;
>  	if (args->flags & I915_EXEC_SECURE) {
>  		if (GRAPHICS_VER(i915) >= 11)
> @@ -2912,70 +3227,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	ww_acquire_done(&eb.ww.ctx);
>  
> -	batch = eb.batch->vma;
> -
> -	/* Allocate a request for this batch buffer nice and early. */
> -	eb.request = i915_request_create(eb.context);
> -	if (IS_ERR(eb.request)) {
> -		err = PTR_ERR(eb.request);
> -		goto err_vma;
> -	}
> -
> -	if (unlikely(eb.gem_context->syncobj)) {
> -		struct dma_fence *fence;
> -
> -		fence = drm_syncobj_fence_get(eb.gem_context->syncobj);
> -		err = i915_request_await_dma_fence(eb.request, fence);
> -		dma_fence_put(fence);
> -		if (err)
> -			goto err_ext;
> -	}
> -
> -	if (in_fence) {
> -		if (args->flags & I915_EXEC_FENCE_SUBMIT)
> -			err = i915_request_await_execution(eb.request,
> -							   in_fence);
> -		else
> -			err = i915_request_await_dma_fence(eb.request,
> -							   in_fence);
> -		if (err < 0)
> -			goto err_request;
> -	}
> -
> -	if (eb.fences) {
> -		err = await_fence_array(&eb);
> -		if (err)
> -			goto err_request;
> -	}
> -
> -	if (out_fence_fd != -1) {
> -		out_fence = sync_file_create(&eb.request->fence);
> -		if (!out_fence) {
> -			err = -ENOMEM;
> +	out_fence = eb_requests_create(&eb, in_fence, out_fence_fd);
> +	if (IS_ERR(out_fence)) {
> +		err = PTR_ERR(out_fence);
> +		if (eb.requests[0])
>  			goto err_request;
> -		}
> +		else
> +			goto err_vma;
>  	}
>  
> -	/*
> -	 * Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> -	 */
> -	eb.request->batch = batch;
> -	if (eb.batch_pool)
> -		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
> -
> -	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch);
> +	err = eb_submit(&eb);
>  
>  err_request:
> -	i915_request_get(eb.request);
> -	err = eb_request_add(&eb, err);
> +	eb_requests_get(&eb);
> +	err = eb_requests_add(&eb, err);
>  
>  	if (eb.fences)
> -		signal_fence_array(&eb);
> +		signal_fence_array(&eb, eb.composite_fence ?
> +				   eb.composite_fence :
> +				   &eb.requests[0]->fence);
>  
>  	if (out_fence) {
>  		if (err == 0) {
> @@ -2990,10 +3260,15 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	if (unlikely(eb.gem_context->syncobj)) {
>  		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> -					  &eb.request->fence);
> +					  eb.composite_fence ?
> +					  eb.composite_fence :
> +					  &eb.requests[0]->fence);
>  	}
>  
> -	i915_request_put(eb.request);
> +	if (!out_fence && eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
> +
> +	eb_requests_put(&eb);
>  
>  err_vma:
>  	eb_release_vmas(&eb, true);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 9dcc1b14697b..1f6a5ae3e33e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -237,7 +237,13 @@ intel_context_timeline_lock(struct intel_context *ce)
>  	struct intel_timeline *tl = ce->timeline;
>  	int err;
>  
> -	err = mutex_lock_interruptible(&tl->mutex);
> +	if (intel_context_is_parent(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex, 0);
> +	else if (intel_context_is_child(ce))
> +		err = mutex_lock_interruptible_nested(&tl->mutex,
> +						      ce->guc_child_index + 1);
> +	else
> +		err = mutex_lock_interruptible(&tl->mutex);
>  	if (err)
>  		return ERR_PTR(err);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 727f91e7f7c2..094fcfb5cbe1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -261,6 +261,18 @@ struct intel_context {
>  		 * context.
>  		 */
>  		struct i915_request *last_rq;
> +
> +		/**
> +		 * @fence_context: fence context composite fence when doing
> +		 * parallel submission
> +		 */
> +		u64 fence_context;
> +
> +		/**
> +		 * @seqno: seqno for composite fence when doing parallel
> +		 * submission
> +		 */
> +		u32 seqno;
>  	};
>  
>  #ifdef CONFIG_DRM_I915_SELFTEST
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 1a18f99bf12a..2ef38557b0f0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3034,6 +3034,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>  		}
>  	}
>  
> +	parent->fence_context = dma_fence_context_alloc(1);
> +
>  	parent->engine->emit_bb_start =
>  		emit_bb_start_parent_no_preempt_mid_batch;
>  	parent->engine->emit_fini_breadcrumb =
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 8f0073e19079..602cc246ba85 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -147,6 +147,15 @@ enum {
>  	 * tail.
>  	 */
>  	I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +
> +	/*
> +	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
> +	 * parent-child relationship (parallel submission, multi-lrc) that
> +	 * hit an error while generating requests in the execbuf IOCTL.
> +	 * Indicates this request should be skipped as another request in
> +	 * submission / relationship encoutered an error.
> +	 */
> +	I915_FENCE_FLAG_SKIP_PARALLEL,
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..90546fa58fc1 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,10 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>  	return i915_active_add_request(&vma->active, rq);
>  }
>  
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     struct dma_fence *fence,
> +			     unsigned int flags)
>  {
>  	struct drm_i915_gem_object *obj = vma->obj;
>  	int err;
> @@ -1257,9 +1258,11 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  			intel_frontbuffer_put(front);
>  		}
>  
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> -		obj->write_domain = I915_GEM_DOMAIN_RENDER;
> -		obj->read_domains = 0;
> +		if (fence) {
> +			dma_resv_add_excl_fence(vma->resv, fence);
> +			obj->write_domain = I915_GEM_DOMAIN_RENDER;
> +			obj->read_domains = 0;
> +		}
>  	} else {
>  		if (!(flags & __EXEC_OBJECT_NO_RESERVE)) {
>  			err = dma_resv_reserve_shared(vma->resv, 1);
> @@ -1267,8 +1270,10 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  				return err;
>  		}
>  
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> -		obj->write_domain = 0;
> +		if (fence) {
> +			dma_resv_add_shared_fence(vma->resv, fence);
> +			obj->write_domain = 0;
> +		}
>  	}
>  
>  	if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence)
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..648dbe744c96 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,16 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>  
>  int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>  					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  struct dma_fence *fence,
> +					  unsigned int flags);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, &rq->fence, flags);
> +}
>  
>  #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>  
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2021-09-30 22:21 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20 22:44 [PATCH 00/27] Parallel submission aka multi-bb execbuf Matthew Brost
2021-08-20 22:44 ` [Intel-gfx] " Matthew Brost
2021-08-20 22:44 ` [PATCH 01/27] drm/i915/guc: Squash Clean up GuC CI failures, simplify locking, and kernel DOC Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-20 22:44 ` [PATCH 02/27] drm/i915/guc: Allow flexible number of context ids Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:13   ` John Harrison
2021-09-10  0:14     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 03/27] drm/i915/guc: Connect the number of guc_ids to debugfs Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:16   ` John Harrison
2021-09-10  0:16     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 04/27] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:28   ` John Harrison
2021-09-10  0:21     ` Matthew Brost
2021-09-13  9:55   ` Tvrtko Ursulin
2021-09-13 17:12     ` Matthew Brost
2021-09-14  8:41       ` Tvrtko Ursulin
2021-08-20 22:44 ` [PATCH 05/27] drm/i915: Add GT PM unpark worker Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:36   ` John Harrison
2021-09-10  0:34     ` Matthew Brost
2021-09-10  8:36   ` Tvrtko Ursulin
2021-09-10 20:09     ` Matthew Brost
2021-09-13 10:33       ` Tvrtko Ursulin
2021-09-13 17:20         ` Matthew Brost
2021-08-20 22:44 ` [PATCH 06/27] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:46   ` John Harrison
2021-09-10  0:41     ` Matthew Brost
2021-09-13 22:26       ` John Harrison
2021-09-14  1:12         ` Matthew Brost
2021-08-20 22:44 ` [PATCH 07/27] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-09 22:51   ` John Harrison
2021-09-13 16:54     ` Matthew Brost
2021-09-13 22:38       ` John Harrison
2021-09-14  5:02         ` Matthew Brost
2021-09-13 16:55     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 08/27] drm/i915: Add logical engine mapping Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-10 11:12   ` Tvrtko Ursulin
2021-09-10 19:49     ` Matthew Brost
2021-09-13  9:24       ` Tvrtko Ursulin
2021-09-13 16:50         ` Matthew Brost
2021-09-14  8:34           ` Tvrtko Ursulin
2021-09-14 18:04             ` Matthew Brost
2021-09-15  8:24               ` Tvrtko Ursulin
2021-09-15 16:58                 ` Matthew Brost
2021-09-16  8:31                   ` Tvrtko Ursulin
2021-08-20 22:44 ` [PATCH 09/27] drm/i915: Expose logical engine instance to user Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-13 23:06   ` John Harrison
2021-09-14  1:08     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 10/27] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-13 23:19   ` John Harrison
2021-09-14  1:18     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 11/27] drm/i915/guc: Implement parallel context pin / unpin functions Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-20 22:44 ` [PATCH 12/27] drm/i915/guc: Add multi-lrc context registration Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-15 19:21   ` John Harrison
2021-09-15 19:31     ` Matthew Brost
2021-09-15 20:23       ` John Harrison
2021-09-15 20:33         ` Matthew Brost
2021-08-20 22:44 ` [PATCH 13/27] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-15 19:24   ` John Harrison
2021-09-15 19:34     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 14/27] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-15 20:04   ` John Harrison
2021-09-15 20:55     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 15/27] drm/i915/guc: Implement multi-lrc submission Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-21 14:04   ` kernel test robot
2021-08-21 14:04     ` kernel test robot
2021-08-21 14:04     ` kernel test robot
2021-08-22  2:18   ` kernel test robot
2021-08-22  2:18     ` kernel test robot
2021-08-22  2:18     ` [Intel-gfx] " kernel test robot
2021-09-20 21:48   ` John Harrison
2021-09-22 16:25     ` Matthew Brost
2021-09-22 20:15       ` John Harrison
2021-09-23  2:44         ` Matthew Brost
2021-08-20 22:44 ` [PATCH 16/27] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-20 21:57   ` John Harrison
2021-08-20 22:44 ` [PATCH 17/27] drm/i915/guc: Implement multi-lrc reset Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-20 22:44   ` John Harrison
2021-09-22 16:16     ` Matthew Brost
2021-08-20 22:44 ` [Intel-gfx] [PATCH 18/27] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
2021-08-20 22:44   ` Matthew Brost
2021-09-20 22:48   ` [Intel-gfx] " John Harrison
2021-09-21 19:13     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 19/27] drm/i915: Fix bug in user proto-context creation that leaked contexts Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-20 22:57   ` John Harrison
2021-09-21 14:49     ` Tvrtko Ursulin
2021-09-21 19:28       ` Matthew Brost
2021-09-21 19:28     ` Matthew Brost
2021-08-20 22:44 ` [Intel-gfx] [PATCH 20/27] drm/i915/guc: Connect UAPI to GuC multi-lrc interface Matthew Brost
2021-08-20 22:44   ` Matthew Brost
2021-08-29  4:00   ` [Intel-gfx] " kernel test robot
2021-08-29  4:00     ` kernel test robot
2021-08-29 19:59   ` kernel test robot
2021-08-29 19:59     ` kernel test robot
2021-09-21  0:09   ` John Harrison
2021-09-22 16:38     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 21/27] drm/i915/doc: Update parallel submit doc to point to i915_drm.h Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-21  0:12   ` John Harrison
2021-08-20 22:44 ` [PATCH 22/27] drm/i915/guc: Add basic GuC multi-lrc selftest Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-28 20:47   ` John Harrison
2021-08-20 22:44 ` [PATCH 23/27] drm/i915/guc: Implement no mid batch preemption for multi-lrc Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-10 11:25   ` Tvrtko Ursulin
2021-09-10 20:49     ` Matthew Brost
2021-09-13 10:52       ` Tvrtko Ursulin
2021-09-28 22:20   ` John Harrison
2021-09-28 22:33     ` Matthew Brost
2021-09-28 23:33       ` John Harrison
2021-09-29  0:22         ` Matthew Brost
2021-08-20 22:44 ` [PATCH 24/27] drm/i915: Multi-BB execbuf Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-21 19:01   ` kernel test robot
2021-08-21 19:01     ` kernel test robot
2021-08-30  3:46   ` kernel test robot
2021-08-30  3:46     ` kernel test robot
2021-09-30 22:16   ` Matthew Brost
2021-08-20 22:44 ` [PATCH 25/27] drm/i915/guc: Handle errors in multi-lrc requests Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-09-29 20:44   ` John Harrison
2021-09-29 20:58     ` Matthew Brost
2021-08-20 22:44 ` [PATCH 26/27] drm/i915: Enable multi-bb execbuf Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-20 22:44 ` [PATCH 27/27] drm/i915/execlists: Weak parallel submission support for execlists Matthew Brost
2021-08-20 22:44   ` [Intel-gfx] " Matthew Brost
2021-08-20 23:13 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev3) Patchwork
2021-08-20 23:14 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-08-20 23:45 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.