intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf
@ 2021-08-03 22:28 Matthew Brost
  2021-08-03 22:28 ` [Intel-gfx] [PATCH 01/46] drm/i915/guc: Allow flexible number of context ids Matthew Brost
                   ` (50 more replies)
  0 siblings, 51 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:28 UTC (permalink / raw)
  To: intel-gfx, dri-devel

As discussed in [1] we are introducing a new parallel submission uAPI
for the i915 which allows more than 1 BB to be submitted in an execbuf
IOCTL. This is the implemenation for both GuC and execlists.

In addition to selftests in the series, an IGT is available implemented
in the first 4 patches [2].

Media UMD changes to land soon.

This series is broken into 5 parts.

1. A series of GuC patches which introduces a state machine to deal with
flow control conditions gracefully (e.g. don't punt them to the user).
These are patches 1-12.

2. Update the GuC backend / connections to uAPI to configure it for
parallel submission. These are patches 13-30. 

3. Update execbuf IOCTL to accept more than 1 BB in a single IOCTL.
These are patches 31-44.

4. A weak execlists implemenation for parallel submission. Patch 45. 

5. Add a heuristic to enable issue schedule disables immediately after
unpin. Not all that related but wanted to get this out on the list for
review and based on the tip of all of these patches. Patch 46.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/series/92028/
[2] https://patchwork.freedesktop.org/series/93071/

Matthew Brost (46):
  drm/i915/guc: Allow flexible number of context ids
  drm/i915/guc: Connect the number of guc_ids to debugfs
  drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted
  drm/i915/guc: Don't allow requests not ready to consume all guc_ids
  drm/i915/guc: Introduce guc_submit_engine object
  drm/i915/guc: Check return of __xa_store when registering a context
  drm/i915/guc: Non-static lrc descriptor registration buffer
  drm/i915/guc: Take GT PM ref when deregistering context
  drm/i915: Add GT PM unpark worker
  drm/i915/guc: Take engine PM when a context is pinned with GuC
    submission
  drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  drm/i915/guc: Selftest for GuC flow control
  drm/i915: Add logical engine mapping
  drm/i915: Expose logical engine instance to user
  drm/i915/guc: Introduce context parent-child relationship
  drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  drm/i915/guc: Add multi-lrc context registration
  drm/i915/guc: Ensure GuC schedule operations do not operate on child
    contexts
  drm/i915/guc: Assign contexts in parent-child relationship consecutive
    guc_ids
  drm/i915/guc: Add hang check to GuC submit engine
  drm/i915/guc: Add guc_child_context_destroy
  drm/i915/guc: Implement multi-lrc submission
  drm/i915/guc: Insert submit fences between requests in parent-child
    relationship
  drm/i915/guc: Implement multi-lrc reset
  drm/i915/guc: Update debugfs for GuC multi-lrc
  drm/i915: Connect UAPI to GuC multi-lrc interface
  drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  drm/i915/guc: Add basic GuC multi-lrc selftest
  drm/i915/guc: Extend GuC flow control selftest for multi-lrc
  drm/i915/guc: Implement no mid batch preemption for multi-lrc
  drm/i915: Move secure execbuf check to execbuf2
  drm/i915: Move input/exec fence handling to i915_gem_execbuffer2
  drm/i915: Move output fence handling to i915_gem_execbuffer2
  drm/i915: Return output fence from i915_gem_do_execbuffer
  drm/i915: Store batch index in struct i915_execbuffer
  drm/i915: Allow callers of i915_gem_do_execbuffer to override the
    batch index
  drm/i915: Teach execbuf there can be more than one batch in the
    objects list
  drm/i915: Only track object dependencies on first request
  drm/i915: Force parallel contexts to use copy engine for reloc
  drm/i915: Multi-batch execbuffer2
  drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
  drm/i915: Hold all parallel requests until last request, properly
    handle error
  drm/i915/guc: Handle errors in multi-lrc requests
  drm/i915: Enable multi-bb execbuf
  drm/i915/execlists: Weak parallel submission support for execlists
  drm/i915/guc: Add delay before disabling scheduling on contexts

 Documentation/gpu/rfc/i915_parallel_execbuf.h |  122 -
 Documentation/gpu/rfc/i915_scheduler.rst      |    4 +-
 drivers/gpu/drm/i915/Makefile                 |    1 +
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  159 +-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |    6 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  582 ++-
 .../i915/gem/selftests/i915_gem_coherency.c   |    2 +-
 .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |    2 +-
 .../i915/gem/selftests/i915_gem_execbuffer.c  |   14 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c    |    2 +-
 .../drm/i915/gem/selftests/i915_gem_object.c  |    2 +-
 drivers/gpu/drm/i915/gt/intel_context.c       |  236 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   81 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   62 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   66 +-
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |    4 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |    5 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |    1 +
 .../drm/i915/gt/intel_execlists_submission.c  |  233 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |    3 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |    8 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |   13 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c |   35 +
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h |   32 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |    3 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   31 +-
 drivers/gpu/drm/i915/gt/intel_lrc.h           |    6 +-
 drivers/gpu/drm/i915/gt/intel_reset.c         |   10 +
 .../gpu/drm/i915/gt/intel_ring_submission.c   |    5 +-
 drivers/gpu/drm/i915/gt/mock_engine.c         |    4 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   12 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   46 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |    2 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   43 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h     |    9 +
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |   59 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   10 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 3159 +++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |    2 +
 .../i915/gt/uc/intel_guc_submission_types.h   |   67 +
 .../i915/gt/uc/selftest_guc_flow_control.c    |  891 +++++
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  179 +
 drivers/gpu/drm/i915/i915_query.c             |    2 +
 drivers/gpu/drm/i915/i915_request.c           |  120 +-
 drivers/gpu/drm/i915/i915_request.h           |   23 +
 drivers/gpu/drm/i915/i915_scheduler.c         |   22 +-
 drivers/gpu/drm/i915/i915_scheduler.h         |    3 +
 drivers/gpu/drm/i915/i915_selftest.h          |    2 +
 drivers/gpu/drm/i915/i915_trace.h             |   10 +
 drivers/gpu/drm/i915/i915_vma.c               |   13 +-
 drivers/gpu/drm/i915/i915_vma.h               |   16 +-
 drivers/gpu/drm/i915/intel_wakeref.c          |    5 +
 drivers/gpu/drm/i915/intel_wakeref.h          |    1 +
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |    2 +-
 .../drm/i915/selftests/i915_live_selftests.h  |    2 +
 drivers/gpu/drm/i915/selftests/i915_perf.c    |    2 +-
 drivers/gpu/drm/i915/selftests/i915_request.c |    2 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c     |    2 +-
 .../i915/selftests/intel_scheduler_helpers.c  |   12 +
 .../i915/selftests/intel_scheduler_helpers.h  |    2 +
 include/uapi/drm/i915_drm.h                   |  136 +-
 63 files changed, 5741 insertions(+), 862 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 01/46] drm/i915/guc: Allow flexible number of context ids
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
@ 2021-08-03 22:28 ` Matthew Brost
  2021-08-03 22:28 ` [Intel-gfx] [PATCH 02/46] drm/i915/guc: Connect the number of guc_ids to debugfs Matthew Brost
                   ` (49 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:28 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Number of available GuC contexts ids might be limited.
Stop referring in code to macro and use variable instead.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h           |  2 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 +++++++++-------
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index a9547069ee7e..1d7cb118e70f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -53,6 +53,8 @@ struct intel_guc {
 	 */
 	spinlock_t contexts_lock;
 	struct ida guc_ids;
+	u32 num_guc_ids;
+	u32 max_guc_ids;
 	struct list_head guc_id_list;
 
 	bool submission_supported;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 89ff0e4b4bc7..abfccec7d062 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -275,7 +275,7 @@ static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
 {
 	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
 
-	GEM_BUG_ON(index >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(index >= guc->max_guc_ids);
 
 	return &base[index];
 }
@@ -284,7 +284,7 @@ static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
 	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	GEM_BUG_ON(id >= GUC_MAX_LRC_DESCRIPTORS);
+	GEM_BUG_ON(id >= guc->max_guc_ids);
 
 	return ce;
 }
@@ -294,8 +294,7 @@ static int guc_lrc_desc_pool_create(struct intel_guc *guc)
 	u32 size;
 	int ret;
 
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) *
-			  GUC_MAX_LRC_DESCRIPTORS);
+	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
 	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
 					     (void **)&guc->lrc_desc_pool_vaddr);
 	if (ret)
@@ -1070,7 +1069,7 @@ static void guc_submit_request(struct i915_request *rq)
 static int new_guc_id(struct intel_guc *guc)
 {
 	return ida_simple_get(&guc->guc_ids, 0,
-			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
+			      guc->num_guc_ids, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
 
@@ -2562,6 +2561,8 @@ static bool __guc_submission_selected(struct intel_guc *guc)
 
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
+	guc->max_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
+	guc->num_guc_ids = GUC_MAX_LRC_DESCRIPTORS;
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
@@ -2571,7 +2572,7 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
 	struct intel_context *ce;
 
-	if (unlikely(desc_idx >= GUC_MAX_LRC_DESCRIPTORS)) {
+	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
 			"Invalid desc_idx %u", desc_idx);
 		return NULL;
@@ -2874,6 +2875,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
+	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
+	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
 	drm_printf(p, "GuC tasklet count: %u\n\n",
 		   atomic_read(&sched_engine->tasklet.count));
 
@@ -2913,7 +2916,6 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	unsigned long index;
-
 	xa_for_each(&guc->context_lookup, index, ce) {
 		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
 		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 02/46] drm/i915/guc: Connect the number of guc_ids to debugfs
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
  2021-08-03 22:28 ` [Intel-gfx] [PATCH 01/46] drm/i915/guc: Allow flexible number of context ids Matthew Brost
@ 2021-08-03 22:28 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted Matthew Brost
                   ` (48 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:28 UTC (permalink / raw)
  To: intel-gfx, dri-devel

For testing purposes it may make sense to reduce the number of guc_ids
available to be allocated. Add debugfs support for setting the number of
guc_ids.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    | 31 +++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  3 +-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index 72ddfff42f7d..7c479c5e7b3a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -50,11 +50,42 @@ static int guc_registered_contexts_show(struct seq_file *m, void *data)
 }
 DEFINE_GT_DEBUGFS_ATTRIBUTE(guc_registered_contexts);
 
+static int guc_num_id_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->num_guc_ids;
+
+	return 0;
+}
+
+static int guc_num_id_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	if (val > guc->max_guc_ids)
+		val = guc->max_guc_ids;
+	else if (val < 256)
+		val = 256;
+
+	guc->num_guc_ids = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
+
 void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
 {
 	static const struct debugfs_gt_file files[] = {
 		{ "guc_info", &guc_info_fops, NULL },
 		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
+		{ "guc_num_id", &guc_num_id_fops, NULL },
 	};
 
 	if (!intel_guc_is_supported(guc))
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index abfccec7d062..3b555c05c01c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2574,7 +2574,8 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 
 	if (unlikely(desc_idx >= guc->max_guc_ids)) {
 		drm_err(&guc_to_gt(guc)->i915->drm,
-			"Invalid desc_idx %u", desc_idx);
+			"Invalid desc_idx %u, max %u",
+			desc_idx, guc->max_guc_ids);
 		return NULL;
 	}
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
  2021-08-03 22:28 ` [Intel-gfx] [PATCH 01/46] drm/i915/guc: Allow flexible number of context ids Matthew Brost
  2021-08-03 22:28 ` [Intel-gfx] [PATCH 02/46] drm/i915/guc: Connect the number of guc_ids to debugfs Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-05  8:27   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids Matthew Brost
                   ` (47 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Rather than returning -EAGAIN to the user when no guc_ids are available,
implement a fair sharing algorithm in the kernel which blocks submissons
until guc_ids become available. Submissions are released one at a time,
based on priority, until the guc_id pressure is released to ensure fair
sharing of the guc_ids. Once the pressure is fully released, the normal
guc_id allocation (at request creation time in guc_request_alloc) can
resume as this allocation path should be significantly faster and a fair
sharing algorithm isn't needed when guc_ids are plentifully.

The fair sharing algorithm is implemented by forcing all submissions to
the tasklet which serializes submissions, dequeuing one at a time.

If the submission doesn't have a guc_id and new guc_id can't be found,
two lists are searched, one list with contexts that are not pinned but
still registered with the guc (searched first) and another list with
contexts that are pinned but do not have any submissions actively in
inflight (scheduling enabled + registered, searched second). If no
guc_ids can be found we kick a workqueue which will retire requests
hopefully freeing a guc_id. The workqueue + tasklet ping / pong back and
forth until a guc_id can be found.

Once a guc_id is found, we may have to disable context scheduling
depending on which list the context is stolen from. When we disable
scheduling, we block the tasklet from executing until the completion G2H
returns. The disable scheduling must be issued from the workqueue
because of the locking structure. When we deregister a context, we also
do the same thing (waiting on the G2H) but we can safely issue the
deregister H2G from the tasklet.

Once all the G2H have returned we can trigger a submission on the
context.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 805 ++++++++++++++++--
 drivers/gpu/drm/i915/i915_request.h           |   6 +
 4 files changed, 754 insertions(+), 86 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index e54351a170e2..8ed964ef967b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -185,6 +185,9 @@ struct intel_context {
 	/* GuC LRC descriptor reference count */
 	atomic_t guc_id_ref;
 
+	/* Number of rq submitted without a guc_id */
+	u16 guc_num_rq_submit_no_id;
+
 	/*
 	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
 	 */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 1d7cb118e70f..e76579396efd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -33,7 +33,28 @@ struct intel_guc {
 
 	/* Global engine used to submit requests to GuC */
 	struct i915_sched_engine *sched_engine;
-	struct i915_request *stalled_request;
+
+	/* Global state related to submission tasklet */
+	struct i915_request *stalled_rq;
+	struct intel_context *stalled_context;
+	struct work_struct retire_worker;
+	unsigned long flags;
+	int total_num_rq_with_no_guc_id;
+
+	/*
+	 * Submisson stall reason. See intel_guc_submission.c for detailed
+	 * description.
+	 */
+	enum {
+		STALL_NONE,
+		STALL_GUC_ID_WORKQUEUE,
+		STALL_GUC_ID_TASKLET,
+		STALL_SCHED_DISABLE,
+		STALL_REGISTER_CONTEXT,
+		STALL_DEREGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
 
 	/* intel_guc_recv interrupt related state */
 	spinlock_t irq_lock;
@@ -55,7 +76,8 @@ struct intel_guc {
 	struct ida guc_ids;
 	u32 num_guc_ids;
 	u32 max_guc_ids;
-	struct list_head guc_id_list;
+	struct list_head guc_id_list_no_ref;
+	struct list_head guc_id_list_unpinned;
 
 	bool submission_supported;
 	bool submission_selected;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3b555c05c01c..f42a707f60ca 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -59,6 +59,25 @@
  * ELSP context descriptor dword into Work Item.
  * See guc_add_request()
  *
+ * GuC flow control state machine:
+ * The tasklet, workqueue (retire_worker), and the G2H handlers together more or
+ * less form a state machine which is used to submit requests + flow control
+ * requests, while waiting on resources / actions, if necessary. The enum,
+ * submission_stall_reason, controls the handoff of stalls between these
+ * entities with stalled_rq & stalled_context being the arguments. Each state
+ * described below.
+ *
+ * STALL_NONE			No stall condition
+ * STALL_GUC_ID_WORKQUEUE	Workqueue will try to free guc_ids
+ * STALL_GUC_ID_TASKLET		Tasklet will try to find guc_id
+ * STALL_SCHED_DISABLE		Workqueue will issue context schedule disable
+ *				H2G
+ * STALL_REGISTER_CONTEXT	Tasklet needs to register context
+ * STALL_DEREGISTER_CONTEXT	G2H handler is waiting for context deregister,
+ *				will register context upon receipt of G2H
+ * STALL_MOVE_LRC_TAIL		Tasklet will try to move LRC tail
+ * STALL_ADD_REQUEST		Tasklet will try to add the request (submit
+ *				context)
  */
 
 /* GuC Virtual Engine */
@@ -72,6 +91,83 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+/*
+ * Global GuC flags helper functions
+ */
+enum {
+	GUC_STATE_TASKLET_BLOCKED,
+	GUC_STATE_GUC_IDS_EXHAUSTED,
+};
+
+static bool tasklet_blocked(struct intel_guc *guc)
+{
+	return test_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+}
+
+static void set_tasklet_blocked(struct intel_guc *guc)
+{
+	lockdep_assert_held(&guc->sched_engine->lock);
+	set_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+}
+
+static void __clr_tasklet_blocked(struct intel_guc *guc)
+{
+	lockdep_assert_held(&guc->sched_engine->lock);
+	clear_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+}
+
+static void clr_tasklet_blocked(struct intel_guc *guc)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->sched_engine->lock, flags);
+	__clr_tasklet_blocked(guc);
+	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+}
+
+static bool guc_ids_exhausted(struct intel_guc *guc)
+{
+	return test_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
+}
+
+static bool test_and_update_guc_ids_exhausted(struct intel_guc *guc)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	/*
+	 * Strict ordering on checking if guc_ids are exhausted isn't required,
+	 * so let's avoid grabbing the submission lock if possible.
+	 */
+	if (guc_ids_exhausted(guc)) {
+		spin_lock_irqsave(&guc->sched_engine->lock, flags);
+		ret = guc_ids_exhausted(guc);
+		if (ret)
+			++guc->total_num_rq_with_no_guc_id;
+		spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+	}
+
+	return ret;
+}
+
+static void set_and_update_guc_ids_exhausted(struct intel_guc *guc)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->sched_engine->lock, flags);
+	++guc->total_num_rq_with_no_guc_id;
+	set_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
+	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+}
+
+static void clr_guc_ids_exhausted(struct intel_guc *guc)
+{
+	lockdep_assert_held(&guc->sched_engine->lock);
+	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id);
+
+	clear_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
+}
+
 /*
  * Below is a set of functions which control the GuC scheduling state which do
  * not require a lock as all state transitions are mutually exclusive. i.e. It
@@ -82,6 +178,9 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 #define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
 #define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
 #define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
+#define SCHED_STATE_NO_LOCK_BLOCK_TASKLET		BIT(3)
+#define SCHED_STATE_NO_LOCK_GUC_ID_STOLEN		BIT(4)
+#define SCHED_STATE_NO_LOCK_NEEDS_REGISTER		BIT(5)
 static inline bool context_enabled(struct intel_context *ce)
 {
 	return (atomic_read(&ce->guc_sched_state_no_lock) &
@@ -135,6 +234,60 @@ static inline void clr_context_registered(struct intel_context *ce)
 		   &ce->guc_sched_state_no_lock);
 }
 
+static inline bool context_block_tasklet(struct intel_context *ce)
+{
+	return (atomic_read(&ce->guc_sched_state_no_lock) &
+		SCHED_STATE_NO_LOCK_BLOCK_TASKLET);
+}
+
+static inline void set_context_block_tasklet(struct intel_context *ce)
+{
+	atomic_or(SCHED_STATE_NO_LOCK_BLOCK_TASKLET,
+		  &ce->guc_sched_state_no_lock);
+}
+
+static inline void clr_context_block_tasklet(struct intel_context *ce)
+{
+	atomic_and((u32)~SCHED_STATE_NO_LOCK_BLOCK_TASKLET,
+		   &ce->guc_sched_state_no_lock);
+}
+
+static inline bool context_guc_id_stolen(struct intel_context *ce)
+{
+	return (atomic_read(&ce->guc_sched_state_no_lock) &
+		SCHED_STATE_NO_LOCK_GUC_ID_STOLEN);
+}
+
+static inline void set_context_guc_id_stolen(struct intel_context *ce)
+{
+	atomic_or(SCHED_STATE_NO_LOCK_GUC_ID_STOLEN,
+		  &ce->guc_sched_state_no_lock);
+}
+
+static inline void clr_context_guc_id_stolen(struct intel_context *ce)
+{
+	atomic_and((u32)~SCHED_STATE_NO_LOCK_GUC_ID_STOLEN,
+		   &ce->guc_sched_state_no_lock);
+}
+
+static inline bool context_needs_register(struct intel_context *ce)
+{
+	return (atomic_read(&ce->guc_sched_state_no_lock) &
+		SCHED_STATE_NO_LOCK_NEEDS_REGISTER);
+}
+
+static inline void set_context_needs_register(struct intel_context *ce)
+{
+	atomic_or(SCHED_STATE_NO_LOCK_NEEDS_REGISTER,
+		  &ce->guc_sched_state_no_lock);
+}
+
+static inline void clr_context_needs_register(struct intel_context *ce)
+{
+	atomic_and((u32)~SCHED_STATE_NO_LOCK_NEEDS_REGISTER,
+		   &ce->guc_sched_state_no_lock);
+}
+
 /*
  * Below is a set of functions which control the GuC scheduling state which
  * require a lock, aside from the special case where the functions are called
@@ -418,9 +571,12 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 					      true, timeout);
 }
 
-static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
+static inline bool request_has_no_guc_id(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
+}
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
 	struct intel_context *ce = rq->context;
@@ -439,18 +595,15 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		goto out;
 	}
 
+	/* Ensure context is in correct state before a submission */
+	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
+	GEM_BUG_ON(request_has_no_guc_id(rq));
 	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(context_needs_register(ce));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
-
-	/*
-	 * Corner case where the GuC firmware was blown away and reloaded while
-	 * this context was pinned.
-	 */
-	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
-		err = guc_lrc_desc_pin(ce, false);
-		if (unlikely(err))
-			goto out;
-	}
+	GEM_BUG_ON(context_pending_disable(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
 
 	/*
 	 * The request / context will be run on the hardware when scheduling
@@ -462,6 +615,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	enabled = context_enabled(ce);
 
 	if (!enabled) {
+		GEM_BUG_ON(context_pending_enable(ce));
+
 		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
 		action[len++] = ce->guc_id;
 		action[len++] = GUC_CONTEXT_ENABLE;
@@ -489,6 +644,67 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
+static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+{
+	int ret;
+
+	lockdep_assert_held(&guc->sched_engine->lock);
+
+	ret = __guc_add_request(guc, rq);
+	if (ret == -EBUSY) {
+		guc->stalled_rq = rq;
+		guc->submission_stall_reason = STALL_ADD_REQUEST;
+	} else {
+		guc->stalled_rq = NULL;
+		guc->submission_stall_reason = STALL_NONE;
+	}
+
+	return ret;
+}
+
+static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
+
+static int tasklet_register_context(struct intel_guc *guc,
+				    struct i915_request *rq)
+{
+	struct intel_context *ce = rq->context;
+	int ret = 0;
+
+	/* Check state */
+	lockdep_assert_held(&guc->sched_engine->lock);
+	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
+	GEM_BUG_ON(request_has_no_guc_id(rq));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+
+	/*
+	 * The guc_id is getting pinned during the tasklet and we need to
+	 * register this context or a corner case where the GuC firmware was
+	 * blown away and reloaded while this context was pinned
+	 */
+	if (unlikely((!lrc_desc_registered(guc, ce->guc_id) ||
+		      context_needs_register(ce)) &&
+		     !intel_context_is_banned(ce))) {
+		GEM_BUG_ON(context_pending_disable(ce));
+		GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+
+		ret = guc_lrc_desc_pin(ce, false);
+
+		if (likely(ret != -EBUSY))
+			clr_context_needs_register(ce);
+
+		if (unlikely(ret == -EBUSY)) {
+			guc->stalled_rq = rq;
+			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
+		} else if (unlikely(ret == -EINPROGRESS)) {
+			guc->stalled_rq = rq;
+			guc->submission_stall_reason = STALL_DEREGISTER_CONTEXT;
+		}
+	}
+
+	return ret;
+}
+
 static inline void guc_set_lrc_tail(struct i915_request *rq)
 {
 	rq->context->lrc_reg_state[CTX_RING_TAIL] =
@@ -500,77 +716,142 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static void kick_retire_wq(struct intel_guc *guc)
+{
+	queue_work(system_unbound_wq, &guc->retire_worker);
+}
+
+static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq);
+
 static int guc_dequeue_one_context(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
-	struct i915_request *last = NULL;
-	bool submit = false;
+	struct i915_request *last = guc->stalled_rq;
+	bool submit = !!last;
 	struct rb_node *rb;
 	int ret;
 
 	lockdep_assert_held(&sched_engine->lock);
+	GEM_BUG_ON(guc->stalled_context);
+	GEM_BUG_ON(!submit && guc->submission_stall_reason);
 
-	if (guc->stalled_request) {
-		submit = true;
-		last = guc->stalled_request;
-		goto resubmit;
-	}
+	if (submit) {
+		/* Flow control conditions */
+		switch (guc->submission_stall_reason) {
+		case STALL_GUC_ID_TASKLET:
+			goto done;
+		case STALL_REGISTER_CONTEXT:
+			goto register_context;
+		case STALL_MOVE_LRC_TAIL:
+			goto move_lrc_tail;
+		case STALL_ADD_REQUEST:
+			goto add_request;
+		default:
+			GEM_BUG_ON("Invalid stall state");
+		}
+	} else {
+		GEM_BUG_ON(!guc->total_num_rq_with_no_guc_id &&
+			   guc_ids_exhausted(guc));
 
-	while ((rb = rb_first_cached(&sched_engine->queue))) {
-		struct i915_priolist *p = to_priolist(rb);
-		struct i915_request *rq, *rn;
+		while ((rb = rb_first_cached(&sched_engine->queue))) {
+			struct i915_priolist *p = to_priolist(rb);
+			struct i915_request *rq, *rn;
 
-		priolist_for_each_request_consume(rq, rn, p) {
-			if (last && rq->context != last->context)
-				goto done;
+			priolist_for_each_request_consume(rq, rn, p) {
+				if (last && rq->context != last->context)
+					goto done;
 
-			list_del_init(&rq->sched.link);
+				list_del_init(&rq->sched.link);
 
-			__i915_request_submit(rq);
+				__i915_request_submit(rq);
 
-			trace_i915_request_in(rq, 0);
-			last = rq;
-			submit = true;
-		}
+				trace_i915_request_in(rq, 0);
+				last = rq;
+				submit = true;
+			}
 
-		rb_erase_cached(&p->node, &sched_engine->queue);
-		i915_priolist_free(p);
+			rb_erase_cached(&p->node, &sched_engine->queue);
+			i915_priolist_free(p);
+		}
 	}
+
 done:
 	if (submit) {
+		struct intel_context *ce = last->context;
+
+		if (ce->guc_num_rq_submit_no_id) {
+			ret = tasklet_pin_guc_id(guc, last);
+			if (ret)
+				goto blk_tasklet_kick;
+		}
+
+register_context:
+		ret = tasklet_register_context(guc, last);
+		if (unlikely(ret == -EINPROGRESS)) {
+			goto blk_tasklet;
+		} else if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (unlikely(ret != 0)) {
+			GEM_WARN_ON(ret);	/* Unexpected */
+			goto deadlk;
+		}
+
+move_lrc_tail:
 		guc_set_lrc_tail(last);
-resubmit:
+
+add_request:
 		ret = guc_add_request(guc, last);
-		if (unlikely(ret == -EPIPE))
+		if (unlikely(ret == -EPIPE)) {
+			goto deadlk;
+		} else if (ret == -EBUSY) {
+			goto schedule_tasklet;
+		} else if (unlikely(ret != 0)) {
+			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
-		else if (ret == -EBUSY) {
-			tasklet_schedule(&sched_engine->tasklet);
-			guc->stalled_request = last;
-			return false;
 		}
 	}
 
-	guc->stalled_request = NULL;
+	/*
+	 * No requests without a guc_id, enable guc_id allocation at request
+	 * creation time (guc_request_alloc).
+	 */
+	if (!guc->total_num_rq_with_no_guc_id)
+		clr_guc_ids_exhausted(guc);
+
 	return submit;
 
+schedule_tasklet:
+	tasklet_schedule(&sched_engine->tasklet);
+	return false;
+
 deadlk:
 	sched_engine->tasklet.callback = NULL;
 	tasklet_disable_nosync(&sched_engine->tasklet);
 	return false;
+
+blk_tasklet_kick:
+	kick_retire_wq(guc);
+blk_tasklet:
+	set_tasklet_blocked(guc);
+	return false;
 }
 
 static void guc_submission_tasklet(struct tasklet_struct *t)
 {
 	struct i915_sched_engine *sched_engine =
 		from_tasklet(sched_engine, t, tasklet);
+	struct intel_guc *guc = sched_engine->private_data;
 	unsigned long flags;
 	bool loop;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	do {
-		loop = guc_dequeue_one_context(sched_engine->private_data);
-	} while (loop);
+	if (likely(!tasklet_blocked(guc)))
+		do {
+			loop = guc_dequeue_one_context(guc);
+		} while (loop);
 
 	i915_sched_engine_reset_on_empty(sched_engine);
 
@@ -653,6 +934,14 @@ submission_disabled(struct intel_guc *guc)
 			!__tasklet_is_enabled(&sched_engine->tasklet));
 }
 
+static void kick_tasklet(struct intel_guc *guc)
+{
+	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+
+	if (likely(!tasklet_blocked(guc)))
+		tasklet_hi_schedule(&sched_engine->tasklet);
+}
+
 static void disable_submission(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
@@ -676,8 +965,16 @@ static void enable_submission(struct intel_guc *guc)
 	    __tasklet_enable(&sched_engine->tasklet)) {
 		GEM_BUG_ON(!guc->ct.enabled);
 
+		/* Reset tasklet state */
+		guc->stalled_rq = NULL;
+		if (guc->stalled_context)
+			intel_context_put(guc->stalled_context);
+		guc->stalled_context = NULL;
+		guc->submission_stall_reason = STALL_NONE;
+		guc->flags = 0;
+
 		/* And kick in case we missed a new request submission. */
-		tasklet_hi_schedule(&sched_engine->tasklet);
+		kick_tasklet(guc);
 	}
 	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
 }
@@ -856,6 +1153,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 out_replay:
 	guc_reset_state(ce, head, stalled);
 	__unwind_incomplete_requests(ce);
+	ce->guc_num_rq_submit_no_id = 0;
 	intel_context_put(ce);
 }
 
@@ -888,6 +1186,7 @@ static void guc_cancel_context_requests(struct intel_context *ce)
 	spin_lock(&ce->guc_active.lock);
 	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
 		i915_request_put(i915_request_mark_eio(rq));
+	ce->guc_num_rq_submit_no_id = 0;
 	spin_unlock(&ce->guc_active.lock);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
@@ -924,11 +1223,15 @@ guc_cancel_sched_engine_requests(struct i915_sched_engine *sched_engine)
 		struct i915_priolist *p = to_priolist(rb);
 
 		priolist_for_each_request_consume(rq, rn, p) {
+			struct intel_context *ce = rq->context;
+
 			list_del_init(&rq->sched.link);
 
 			__i915_request_submit(rq);
 
 			i915_request_put(i915_request_mark_eio(rq));
+
+			ce->guc_num_rq_submit_no_id = 0;
 		}
 
 		rb_erase_cached(&p->node, &sched_engine->queue);
@@ -980,6 +1283,51 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void retire_worker_sched_disable(struct intel_guc *guc,
+					struct intel_context *ce);
+
+static void retire_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc =
+		container_of(w, struct intel_guc, retire_worker);
+
+	/*
+	 * It is possible that another thread issues the schedule disable + that
+	 * G2H completes moving the state machine further along to a point
+	 * where nothing needs to be done here. Let's be paranoid and kick the
+	 * tasklet in that case.
+	 */
+	if (guc->submission_stall_reason != STALL_SCHED_DISABLE &&
+	    guc->submission_stall_reason != STALL_GUC_ID_WORKQUEUE) {
+		kick_tasklet(guc);
+		return;
+	}
+
+	if (guc->submission_stall_reason == STALL_SCHED_DISABLE) {
+		GEM_BUG_ON(!guc->stalled_context);
+		GEM_BUG_ON(context_guc_id_invalid(guc->stalled_context));
+
+		retire_worker_sched_disable(guc, guc->stalled_context);
+	}
+
+	/*
+	 * guc_id pressure, always try to release it regardless of state,
+	 * albeit after possibly issuing a schedule disable as that is async
+	 * operation.
+	 */
+	intel_gt_retire_requests(guc_to_gt(guc));
+
+	if (guc->submission_stall_reason == STALL_GUC_ID_WORKQUEUE) {
+		GEM_BUG_ON(guc->stalled_context);
+
+		/* Hopefully guc_ids are now available, kick tasklet */
+		guc->submission_stall_reason = STALL_GUC_ID_TASKLET;
+		clr_tasklet_blocked(guc);
+
+		kick_tasklet(guc);
+	}
+}
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1003,9 +1351,12 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
 	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
+	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
+	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
 	ida_init(&guc->guc_ids);
 
+	INIT_WORK(&guc->retire_worker, retire_worker_func);
+
 	return 0;
 }
 
@@ -1022,10 +1373,28 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
 				 struct i915_request *rq,
 				 int prio)
 {
+	bool empty = i915_sched_engine_is_empty(sched_engine);
+
 	GEM_BUG_ON(!list_empty(&rq->sched.link));
 	list_add_tail(&rq->sched.link,
 		      i915_sched_lookup_priolist(sched_engine, prio));
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+
+	if (empty)
+		kick_tasklet(&rq->engine->gt->uc.guc);
+}
+
+static bool need_tasklet(struct intel_guc *guc, struct intel_context *ce)
+{
+	struct i915_sched_engine * const sched_engine =
+		ce->engine->sched_engine;
+
+	lockdep_assert_held(&sched_engine->lock);
+
+	return guc_ids_exhausted(guc) || submission_disabled(guc) ||
+		guc->stalled_rq || guc->stalled_context ||
+		!lrc_desc_registered(guc, ce->guc_id) ||
+		!i915_sched_engine_is_empty(sched_engine);
 }
 
 static int guc_bypass_tasklet_submit(struct intel_guc *guc,
@@ -1039,8 +1408,6 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 
 	guc_set_lrc_tail(rq);
 	ret = guc_add_request(guc, rq);
-	if (ret == -EBUSY)
-		guc->stalled_request = rq;
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(guc);
@@ -1057,11 +1424,10 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (submission_disabled(guc) || guc->stalled_request ||
-	    !i915_sched_engine_is_empty(sched_engine))
+	if (need_tasklet(guc, rq->context))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
-		tasklet_hi_schedule(&sched_engine->tasklet);
+		kick_tasklet(guc);
 
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
@@ -1093,32 +1459,71 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
-static int steal_guc_id(struct intel_guc *guc)
+/*
+ * We have two lists for guc_ids available to steal. One list is for contexts
+ * that to have a zero guc_id_ref but are still pinned (scheduling enabled, only
+ * available inside tasklet) and the other is for contexts that are not pinned
+ * but still registered (available both outside and inside tasklet). Stealing
+ * from the latter only requires a deregister H2G, while the former requires a
+ * schedule disable H2G + a deregister H2G.
+ */
+static struct list_head *get_guc_id_list(struct intel_guc *guc,
+					 bool unpinned)
+{
+	if (unpinned)
+		return &guc->guc_id_list_unpinned;
+	else
+		return &guc->guc_id_list_no_ref;
+}
+
+static int steal_guc_id(struct intel_guc *guc, bool unpinned)
 {
 	struct intel_context *ce;
 	int guc_id;
+	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
 
 	lockdep_assert_held(&guc->contexts_lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(guc_id_list)) {
+		ce = list_first_entry(guc_id_list,
 				      struct intel_context,
 				      guc_id_link);
 
+		/* Ensure context getting stolen in expected state */
 		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
 		GEM_BUG_ON(context_guc_id_invalid(ce));
+		GEM_BUG_ON(context_guc_id_stolen(ce));
 
 		list_del_init(&ce->guc_id_link);
 		guc_id = ce->guc_id;
 		clr_context_registered(ce);
-		set_context_guc_id_invalid(ce);
+
+		/*
+		 * If stealing from the pinned list, defer invalidating
+		 * the guc_id until the retire workqueue processes this
+		 * context.
+		 */
+		if (!unpinned) {
+			GEM_BUG_ON(guc->stalled_context);
+			guc->stalled_context = intel_context_get(ce);
+			set_context_guc_id_stolen(ce);
+		} else {
+			set_context_guc_id_invalid(ce);
+		}
+
 		return guc_id;
 	} else {
 		return -EAGAIN;
 	}
 }
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out)
+enum {	/* Return values for pin_guc_id / assign_guc_id */
+	SAME_GUC_ID		= 0,
+	NEW_GUC_ID_DISABLED	= 1,
+	NEW_GUC_ID_ENABLED	= 2,
+};
+
+static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
 {
 	int ret;
 
@@ -1126,17 +1531,33 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc);
-		if (ret < 0)
-			return ret;
+		ret = steal_guc_id(guc, true);
+		if (ret >= 0) {
+			*out = ret;
+			ret = NEW_GUC_ID_DISABLED;
+		} else if (ret < 0 && tasklet) {
+			/*
+			 * We only steal a guc_id from a context with scheduling
+			 * enabled if guc_ids are exhausted and we are submitting
+			 * from the tasklet.
+			 */
+			ret = steal_guc_id(guc, false);
+			if (ret >= 0) {
+				*out = ret;
+				ret = NEW_GUC_ID_ENABLED;
+			}
+		}
+	} else {
+		*out = ret;
+		ret = SAME_GUC_ID;
 	}
 
-	*out = ret;
-	return 0;
+	return ret;
 }
 
 #define PIN_GUC_ID_TRIES	4
-static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
+static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
+		      bool tasklet)
 {
 	int ret = 0;
 	unsigned long flags, tries = PIN_GUC_ID_TRIES;
@@ -1146,11 +1567,15 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
+	if (!tasklet && guc_ids_exhausted(guc)) {
+		ret = -EAGAIN;
+		goto out_unlock;
+	}
+
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id);
-		if (ret)
+		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
+		if (unlikely(ret < 0))
 			goto out_unlock;
-		ret = 1;	/* Indidcates newly assigned guc_id */
 	}
 	if (!list_empty(&ce->guc_id_link))
 		list_del_init(&ce->guc_id_link);
@@ -1166,8 +1591,11 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	 * attempting to retire more requests. Double the sleep period each
 	 * subsequent pass before finally giving up. The sleep period has max of
 	 * 100ms and minimum of 1ms.
+	 *
+	 * We only try this if outside the tasklet, inside the tasklet we have a
+	 * (slower, more complex, blocking) different flow control algorithm.
 	 */
-	if (ret == -EAGAIN && --tries) {
+	if (ret == -EAGAIN && --tries && !tasklet) {
 		if (PIN_GUC_ID_TRIES - tries > 1) {
 			unsigned int timeslice_shifted =
 				ce->engine->props.timeslice_duration_ms <<
@@ -1184,7 +1612,9 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	return ret;
 }
 
-static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
+static void unpin_guc_id(struct intel_guc *guc,
+			 struct intel_context *ce,
+			 bool unpinned)
 {
 	unsigned long flags;
 
@@ -1194,9 +1624,17 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 		return;
 
 	spin_lock_irqsave(&guc->contexts_lock, flags);
-	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
-	    !atomic_read(&ce->guc_id_ref))
-		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
+
+	if (!list_empty(&ce->guc_id_link))
+		list_del_init(&ce->guc_id_link);
+
+	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
+	    !atomic_read(&ce->guc_id_ref)) {
+		struct list_head *head = get_guc_id_list(guc, unpinned);
+
+		list_add_tail(&ce->guc_id_link, head);
+	}
+
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
@@ -1300,6 +1738,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
+	GEM_BUG_ON(context_guc_id_invalid(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -1342,6 +1781,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		trace_intel_context_steal_guc_id(ce);
 		if (!loop) {
 			set_context_wait_for_deregister_to_register(ce);
+			set_context_block_tasklet(ce);
 			intel_context_get(ce);
 		} else {
 			bool disabled;
@@ -1369,7 +1809,14 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			ret = deregister_context(ce, ce->guc_id, loop);
 		if (unlikely(ret == -EBUSY)) {
 			clr_context_wait_for_deregister_to_register(ce);
+			clr_context_block_tasklet(ce);
 			intel_context_put(ce);
+		} else if (!loop && !ret) {
+			/*
+			 * A context de-registration has been issued from within
+			 * the tasklet. Need to block until it complete.
+			 */
+			return -EINPROGRESS;
 		} else if (unlikely(ret == -ENODEV)) {
 			ret = 0;	/* Will get registered later */
 		}
@@ -1425,7 +1872,9 @@ static void guc_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
-	unpin_guc_id(guc, ce);
+	GEM_BUG_ON(context_enabled(ce));
+
+	unpin_guc_id(guc, ce, true);
 	lrc_unpin(ce);
 }
 
@@ -1764,6 +2213,8 @@ static void guc_context_destroy(struct kref *kref)
 	unsigned long flags;
 	bool disabled;
 
+	GEM_BUG_ON(context_guc_id_stolen(ce));
+
 	/*
 	 * If the guc_id is invalid this context has been stolen and we can free
 	 * it immediately. Also can be freed immediately if the context is not
@@ -1925,6 +2376,9 @@ static void add_to_context(struct i915_request *rq)
 	spin_lock(&ce->guc_active.lock);
 	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
 
+	if (unlikely(request_has_no_guc_id(rq)))
+		++ce->guc_num_rq_submit_no_id;
+
 	if (rq->guc_prio == GUC_PRIO_INIT) {
 		rq->guc_prio = new_guc_prio;
 		add_context_inflight_prio(ce, rq->guc_prio);
@@ -1966,7 +2420,12 @@ static void remove_from_context(struct i915_request *rq)
 
 	spin_unlock_irq(&ce->guc_active.lock);
 
-	atomic_dec(&ce->guc_id_ref);
+	if (likely(!request_has_no_guc_id(rq)))
+		atomic_dec(&ce->guc_id_ref);
+	else
+		--ce_to_guc(rq->context)->total_num_rq_with_no_guc_id;
+	unpin_guc_id(ce_to_guc(ce), ce, false);
+
 	i915_request_notify_execute_cb_imm(rq);
 }
 
@@ -2018,13 +2477,144 @@ static void guc_signal_context_fence(struct intel_context *ce)
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 }
 
-static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
+static void invalidate_guc_id_sched_disable(struct intel_context *ce)
+{
+	set_context_guc_id_invalid(ce);
+	wmb();	/* Make sure guc_id invalidation visible first */
+	clr_context_guc_id_stolen(ce);
+}
+
+static void retire_worker_sched_disable(struct intel_guc *guc,
+					struct intel_context *ce)
+{
+	unsigned long flags;
+	bool disabled;
+
+	guc->stalled_context = NULL;
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (!disabled && !context_pending_disable(ce) && context_enabled(ce)) {
+		/*
+		 * Still enabled, issue schedule disable + configure state so
+		 * when G2H returns tasklet is kicked.
+		 */
+
+		struct intel_runtime_pm *runtime_pm =
+			&ce->engine->gt->i915->runtime_pm;
+		intel_wakeref_t wakeref;
+		u16 guc_id;
+
+		/*
+		 * We add +2 here as the schedule disable complete CTB handler
+		 * calls intel_context_sched_disable_unpin (-2 to pin_count).
+		 */
+		GEM_BUG_ON(!atomic_read(&ce->pin_count));
+		atomic_add(2, &ce->pin_count);
+
+		set_context_block_tasklet(ce);
+		guc_id = prep_context_pending_disable(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
+		with_intel_runtime_pm(runtime_pm, wakeref)
+			__guc_context_sched_disable(guc, ce, guc_id);
+
+		invalidate_guc_id_sched_disable(ce);
+	} else if (!disabled && context_pending_disable(ce)) {
+		/*
+		 * Schedule disable in flight, set bit to kick tasklet in G2H
+		 * handler and call it a day.
+		 */
+
+		set_context_block_tasklet(ce);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
+		invalidate_guc_id_sched_disable(ce);
+	} else {
+		/* Schedule disable is done, kick tasklet */
+
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
+		invalidate_guc_id_sched_disable(ce);
+
+		guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
+		clr_tasklet_blocked(guc);
+
+		kick_tasklet(ce_to_guc(ce));
+	}
+
+	intel_context_put(ce);
+}
+
+static bool context_needs_lrc_desc_pin(struct intel_context *ce, bool new_guc_id)
 {
 	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
 		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq)
+{
+	struct intel_context *ce = rq->context;
+	int ret = 0;
+
+	lockdep_assert_held(&guc->sched_engine->lock);
+	GEM_BUG_ON(!ce->guc_num_rq_submit_no_id);
+
+	if (atomic_add_unless(&ce->guc_id_ref, ce->guc_num_rq_submit_no_id, 0))
+		goto out;
+
+	ret = pin_guc_id(guc, ce, true);
+	if (unlikely(ret < 0)) {
+		/*
+		 * No guc_ids available, disable the tasklet and kick the retire
+		 * workqueue hopefully freeing up some guc_ids.
+		 */
+		guc->stalled_rq = rq;
+		guc->submission_stall_reason = STALL_GUC_ID_WORKQUEUE;
+		return ret;
+	}
+
+	if (ce->guc_num_rq_submit_no_id - 1 > 0)
+		atomic_add(ce->guc_num_rq_submit_no_id - 1,
+			   &ce->guc_id_ref);
+
+	if (context_needs_lrc_desc_pin(ce, !!ret))
+		set_context_needs_register(ce);
+
+	if (ret == NEW_GUC_ID_ENABLED) {
+		guc->stalled_rq = rq;
+		guc->submission_stall_reason = STALL_SCHED_DISABLE;
+	}
+
+	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
+out:
+	guc->total_num_rq_with_no_guc_id -= ce->guc_num_rq_submit_no_id;
+	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id < 0);
+
+	list_for_each_entry_reverse(rq, &ce->guc_active.requests, sched.link)
+		if (request_has_no_guc_id(rq)) {
+			--ce->guc_num_rq_submit_no_id;
+			clear_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
+				  &rq->fence.flags);
+		} else if (!ce->guc_num_rq_submit_no_id) {
+			break;
+		}
+
+	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
+
+	/*
+	 * When NEW_GUC_ID_ENABLED is returned it means we are stealing a guc_id
+	 * from a context that has scheduling enabled. We have to disable
+	 * scheduling before deregistering the context and it isn't safe to do
+	 * in the tasklet because of lock inversion (ce->guc_state.lock must be
+	 * acquired before guc->sched_engine->lock). To work around this
+	 * we do the schedule disable in retire workqueue and block the tasklet
+	 * until the schedule done G2H returns. Returning non-zero here kicks
+	 * the workqueue.
+	 */
+	return (ret == NEW_GUC_ID_ENABLED) ? ret : 0;
+}
+
 static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
@@ -2056,6 +2646,15 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	/*
+	 * guc_ids are exhausted, don't allocate one here, defer to submission
+	 * in the tasklet.
+	 */
+	if (test_and_update_guc_ids_exhausted(guc)) {
+		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
+		goto out;
+	}
+
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
@@ -2063,9 +2662,7 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * when guc_ids are being stolen due to over subscription. By the time
 	 * this function is reached, it is guaranteed that the guc_id will be
 	 * persistent until the generated request is retired. Thus, sealing these
-	 * race conditions. It is still safe to fail here if guc_ids are
-	 * exhausted and return -EAGAIN to the user indicating that they can try
-	 * again in the future.
+	 * race conditions.
 	 *
 	 * There is no need for a lock here as the timeline mutex ensures at
 	 * most one context can be executing this code path at once. The
@@ -2076,10 +2673,26 @@ static int guc_request_alloc(struct i915_request *rq)
 	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
 		goto out;
 
-	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
-	if (unlikely(ret < 0))
+	ret = pin_guc_id(guc, ce, false);	/* > 0 indicates new guc_id */
+	if (unlikely(ret == -EAGAIN)) {
+		/*
+		 * No guc_ids available, so we force this submission and all
+		 * future submissions to be serialized in the tasklet, sharing
+		 * the guc_ids on a per submission basis to ensure (more) fair
+		 * scheduling of submissions. Once the tasklet is flushed of
+		 * submissions we return to allocating guc_ids in this function.
+		 */
+		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
+		set_and_update_guc_ids_exhausted(guc);
+
+		return 0;
+	} else if (unlikely(ret < 0)) {
 		return ret;
-	if (context_needs_register(ce, !!ret)) {
+	}
+
+	GEM_BUG_ON(ret == NEW_GUC_ID_ENABLED);
+
+	if (context_needs_lrc_desc_pin(ce, !!ret)) {
 		ret = guc_lrc_desc_pin(ce, true);
 		if (unlikely(ret)) {	/* unwind */
 			if (ret == -EPIPE) {
@@ -2087,7 +2700,7 @@ static int guc_request_alloc(struct i915_request *rq)
 				goto out;	/* GPU will be reset */
 			}
 			atomic_dec(&ce->guc_id_ref);
-			unpin_guc_id(guc, ce);
+			unpin_guc_id(guc, ce, true);
 			return ret;
 		}
 	}
@@ -2358,7 +2971,7 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
 					  struct intel_context *ce)
 {
 	if (context_guc_id_invalid(ce))
-		pin_guc_id(guc, ce);
+		pin_guc_id(guc, ce, false);
 	guc_lrc_desc_pin(ce, true);
 }
 
@@ -2625,6 +3238,16 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			register_context(ce, true);
 		guc_signal_context_fence(ce);
+		if (context_block_tasklet(ce)) {
+			GEM_BUG_ON(guc->submission_stall_reason !=
+				   STALL_DEREGISTER_CONTEXT);
+
+			clr_context_block_tasklet(ce);
+			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+			clr_tasklet_blocked(guc);
+
+			kick_tasklet(ce_to_guc(ce));
+		}
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
@@ -2688,6 +3311,14 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		guc_blocked_fence_complete(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
+		if (context_block_tasklet(ce)) {
+			clr_context_block_tasklet(ce);
+			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
+			clr_tasklet_blocked(guc);
+
+			kick_tasklet(ce_to_guc(ce));
+		}
+
 		if (banned) {
 			guc_cancel_context_requests(ce);
 			intel_engine_signal_breadcrumbs(ce->engine);
@@ -2716,10 +3347,8 @@ static void capture_error_state(struct intel_guc *guc,
 
 static void guc_context_replay(struct intel_context *ce)
 {
-	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
-
 	__guc_reset_context(ce, true);
-	tasklet_hi_schedule(&sched_engine->tasklet);
+	kick_tasklet(ce_to_guc(ce));
 }
 
 static void guc_handle_context_reset(struct intel_guc *guc,
@@ -2878,8 +3507,16 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 		   atomic_read(&guc->outstanding_submission_g2h));
 	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
 	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
-	drm_printf(p, "GuC tasklet count: %u\n\n",
+	drm_printf(p, "GuC tasklet count: %u\n",
 		   atomic_read(&sched_engine->tasklet.count));
+	drm_printf(p, "GuC submit flags: 0x%04lx\n", guc->flags);
+	drm_printf(p, "GuC total number request without guc_id: %d\n",
+		   guc->total_num_rq_with_no_guc_id);
+	drm_printf(p, "GuC stall reason: %d\n", guc->submission_stall_reason);
+	drm_printf(p, "GuC stalled request: %s\n",
+		   yesno(guc->stalled_rq));
+	drm_printf(p, "GuC stalled context: %s\n\n",
+		   yesno(guc->stalled_context));
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	drm_printf(p, "Requests in GuC submit tasklet:\n");
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 1bc1349ba3c2..807f76750cf4 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -139,6 +139,12 @@ enum {
 	 * the GPU. Here we track such boost requests on a per-request basis.
 	 */
 	I915_FENCE_FLAG_BOOST,
+
+	/*
+	 * I915_FENCE_FLAG_GUC_ID_NOT_PINNED - Set to signal the GuC submission
+	 * tasklet that the guc_id isn't pinned.
+	 */
+	I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
 };
 
 /**
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (2 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-05  8:29   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 05/46] drm/i915/guc: Introduce guc_submit_engine object Matthew Brost
                   ` (46 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add a heuristic which checks if over half of the available guc_ids are
currently consumed by requests not ready to be submitted. If this
heuristic is true at request creation time (normal guc_id allocation
location) force all submissions + guc_ids allocations to tasklet.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  3 ++
 drivers/gpu/drm/i915/gt/intel_reset.c         |  9 ++++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 5 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 8ed964ef967b..c01530d7dc67 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -188,6 +188,9 @@ struct intel_context {
 	/* Number of rq submitted without a guc_id */
 	u16 guc_num_rq_submit_no_id;
 
+	/* GuC number of requests not ready */
+	atomic_t guc_num_rq_not_ready;
+
 	/*
 	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 91200c43951f..ea763138197f 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -22,6 +22,7 @@
 #include "intel_reset.h"
 
 #include "uc/intel_guc.h"
+#include "uc/intel_guc_submission.h"
 
 #define RESET_MAX_RETRIES 3
 
@@ -850,6 +851,14 @@ static void nop_submit_request(struct i915_request *request)
 {
 	RQ_TRACE(request, "-EIO\n");
 
+	/*
+	 * XXX: Kinda ugly to check for GuC submission here but this function is
+	 * going away once we switch to the DRM scheduler so we can live with
+	 * this for now.
+	 */
+	if (intel_engine_uses_guc(request->engine))
+		intel_guc_decr_num_rq_not_ready(request->context);
+
 	request = i915_request_mark_eio(request);
 	if (request) {
 		i915_request_submit(request);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index e76579396efd..917352c9f323 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -76,6 +76,7 @@ struct intel_guc {
 	struct ida guc_ids;
 	u32 num_guc_ids;
 	u32 max_guc_ids;
+	atomic_t num_guc_ids_not_ready;
 	struct list_head guc_id_list_no_ref;
 	struct list_head guc_id_list_unpinned;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f42a707f60ca..ba750fc87af1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1384,6 +1384,41 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
 		kick_tasklet(&rq->engine->gt->uc.guc);
 }
 
+/* Macro to tweak heuristic, using a simple over 50% not ready for now */
+#define TOO_MANY_GUC_IDS_NOT_READY(avail, consumed) \
+	((consumed) > (avail) / 2)
+static bool too_many_guc_ids_not_ready(struct intel_guc *guc,
+				       struct intel_context *ce)
+{
+	u32 available_guc_ids, guc_ids_consumed;
+
+	available_guc_ids = guc->num_guc_ids;
+	guc_ids_consumed = atomic_read(&guc->num_guc_ids_not_ready);
+
+	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
+		set_and_update_guc_ids_exhausted(guc);
+		return true;
+	}
+
+	return false;
+}
+
+static void incr_num_rq_not_ready(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
+		atomic_inc(&guc->num_guc_ids_not_ready);
+}
+
+void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1)
+		atomic_dec(&guc->num_guc_ids_not_ready);
+}
+
 static bool need_tasklet(struct intel_guc *guc, struct intel_context *ce)
 {
 	struct i915_sched_engine * const sched_engine =
@@ -1430,6 +1465,8 @@ static void guc_submit_request(struct i915_request *rq)
 		kick_tasklet(guc);
 
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
+
+	intel_guc_decr_num_rq_not_ready(rq->context);
 }
 
 static int new_guc_id(struct intel_guc *guc)
@@ -2647,10 +2684,13 @@ static int guc_request_alloc(struct i915_request *rq)
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
 	/*
-	 * guc_ids are exhausted, don't allocate one here, defer to submission
-	 * in the tasklet.
+	 * guc_ids are exhausted or a heuristic is met indicating too many
+	 * guc_ids are waiting on requests with submission dependencies (not
+	 * ready to submit). Don't allocate one here, defer to submission in the
+	 * tasklet.
 	 */
-	if (test_and_update_guc_ids_exhausted(guc)) {
+	if (test_and_update_guc_ids_exhausted(guc) ||
+	    too_many_guc_ids_not_ready(guc, ce)) {
 		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
 		goto out;
 	}
@@ -2684,6 +2724,7 @@ static int guc_request_alloc(struct i915_request *rq)
 		 */
 		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
 		set_and_update_guc_ids_exhausted(guc);
+		incr_num_rq_not_ready(ce);
 
 		return 0;
 	} else if (unlikely(ret < 0)) {
@@ -2708,6 +2749,8 @@ static int guc_request_alloc(struct i915_request *rq)
 	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
 
 out:
+	incr_num_rq_not_ready(ce);
+
 	/*
 	 * We block all requests on this context if a G2H is pending for a
 	 * schedule disable or context deregistration as the GuC will fail a
@@ -3512,6 +3555,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 	drm_printf(p, "GuC submit flags: 0x%04lx\n", guc->flags);
 	drm_printf(p, "GuC total number request without guc_id: %d\n",
 		   guc->total_num_rq_with_no_guc_id);
+	drm_printf(p, "GuC Number GuC IDs not ready: %d\n",
+		   atomic_read(&guc->num_guc_ids_not_ready));
 	drm_printf(p, "GuC stall reason: %d\n", guc->submission_stall_reason);
 	drm_printf(p, "GuC stalled request: %s\n",
 		   yesno(guc->stalled_rq));
@@ -3567,6 +3612,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			   atomic_read(&ce->pin_count));
 		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
 			   atomic_read(&ce->guc_id_ref));
+		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
+			   atomic_read(&ce->guc_num_rq_not_ready));
 		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
 			   ce->guc_state.sched_state,
 			   atomic_read(&ce->guc_sched_state_no_lock));
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c7ef44fa0c36..17af5e123b09 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -51,4 +51,6 @@ static inline bool intel_guc_submission_is_used(struct intel_guc *guc)
 	return intel_guc_is_used(guc) && intel_guc_submission_is_wanted(guc);
 }
 
+void intel_guc_decr_num_rq_not_ready(struct intel_context *ce);
+
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 05/46] drm/i915/guc: Introduce guc_submit_engine object
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (3 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 06/46] drm/i915/guc: Check return of __xa_store when registering a context Matthew Brost
                   ` (45 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Move fields related to controlling the GuC submission state machine to a
unique object (guc_submit_engine) rather than the global GuC state
(intel_guc). This encapsulation allows multiple instances of submission
objects to operate in parallel and a single instance can block if needed
while another can make forward progress. This is analogous to how the
execlist mode works assigning a schedule object per physical engine but
rather in GuC mode we assign a schedule object based on the blocking
dependencies.

The guc_submit_engine object also encapsulates the i915_sched_engine
object as well.

Lots of find-replace.

Currently only 1 guc_submit_engine instantiated, future patches will
instantiate more.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  33 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 557 +++++++++++-------
 .../i915/gt/uc/intel_guc_submission_types.h   |  52 ++
 drivers/gpu/drm/i915/i915_scheduler.c         |  22 +-
 drivers/gpu/drm/i915/i915_scheduler.h         |   3 +
 5 files changed, 410 insertions(+), 257 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 917352c9f323..8ac016201658 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -21,6 +21,11 @@
 
 struct __guc_ads_blob;
 
+enum {
+	GUC_SUBMIT_ENGINE_SINGLE_LRC,
+	GUC_SUBMIT_ENGINE_MAX
+};
+
 /*
  * Top level structure of GuC. It handles firmware loading and manages client
  * pool. intel_guc owns a intel_guc_client to replace the legacy ExecList
@@ -31,31 +36,6 @@ struct intel_guc {
 	struct intel_guc_log log;
 	struct intel_guc_ct ct;
 
-	/* Global engine used to submit requests to GuC */
-	struct i915_sched_engine *sched_engine;
-
-	/* Global state related to submission tasklet */
-	struct i915_request *stalled_rq;
-	struct intel_context *stalled_context;
-	struct work_struct retire_worker;
-	unsigned long flags;
-	int total_num_rq_with_no_guc_id;
-
-	/*
-	 * Submisson stall reason. See intel_guc_submission.c for detailed
-	 * description.
-	 */
-	enum {
-		STALL_NONE,
-		STALL_GUC_ID_WORKQUEUE,
-		STALL_GUC_ID_TASKLET,
-		STALL_SCHED_DISABLE,
-		STALL_REGISTER_CONTEXT,
-		STALL_DEREGISTER_CONTEXT,
-		STALL_MOVE_LRC_TAIL,
-		STALL_ADD_REQUEST,
-	} submission_stall_reason;
-
 	/* intel_guc_recv interrupt related state */
 	spinlock_t irq_lock;
 	unsigned int msg_enabled_mask;
@@ -68,6 +48,8 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
+	struct guc_submit_engine *gse[GUC_SUBMIT_ENGINE_MAX];
+
 	/*
 	 * contexts_lock protects the pool of free guc ids and a linked list of
 	 * guc ids available to be stolen
@@ -76,7 +58,6 @@ struct intel_guc {
 	struct ida guc_ids;
 	u32 num_guc_ids;
 	u32 max_guc_ids;
-	atomic_t num_guc_ids_not_ready;
 	struct list_head guc_id_list_no_ref;
 	struct list_head guc_id_list_unpinned;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba750fc87af1..842094de848d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -21,6 +21,7 @@
 #include "gt/intel_ring.h"
 
 #include "intel_guc_submission.h"
+#include "intel_guc_submission_types.h"
 
 #include "i915_drv.h"
 #include "i915_trace.h"
@@ -57,7 +58,7 @@
  * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
  * represents in-order queue. The kernel driver packs ring tail pointer and an
  * ELSP context descriptor dword into Work Item.
- * See guc_add_request()
+ * See gse_add_request()
  *
  * GuC flow control state machine:
  * The tasklet, workqueue (retire_worker), and the G2H handlers together more or
@@ -80,57 +81,57 @@
  *				context)
  */
 
-/* GuC Virtual Engine */
-struct guc_virtual_engine {
-	struct intel_engine_cs base;
-	struct intel_context context;
-};
-
 static struct intel_context *
 guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
+static inline struct guc_submit_engine *ce_to_gse(struct intel_context *ce)
+{
+	return container_of(ce->engine->sched_engine, struct guc_submit_engine,
+			    sched_engine);
+}
+
 /*
  * Global GuC flags helper functions
  */
 enum {
-	GUC_STATE_TASKLET_BLOCKED,
-	GUC_STATE_GUC_IDS_EXHAUSTED,
+	GSE_STATE_TASKLET_BLOCKED,
+	GSE_STATE_GUC_IDS_EXHAUSTED,
 };
 
-static bool tasklet_blocked(struct intel_guc *guc)
+static bool tasklet_blocked(struct guc_submit_engine *gse)
 {
-	return test_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
-static void set_tasklet_blocked(struct intel_guc *guc)
+static void set_tasklet_blocked(struct guc_submit_engine *gse)
 {
-	lockdep_assert_held(&guc->sched_engine->lock);
-	set_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+	lockdep_assert_held(&gse->sched_engine.lock);
+	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
-static void __clr_tasklet_blocked(struct intel_guc *guc)
+static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
 {
-	lockdep_assert_held(&guc->sched_engine->lock);
-	clear_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
+	lockdep_assert_held(&gse->sched_engine.lock);
+	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
-static void clr_tasklet_blocked(struct intel_guc *guc)
+static void clr_tasklet_blocked(struct guc_submit_engine *gse)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->sched_engine->lock, flags);
-	__clr_tasklet_blocked(guc);
-	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+	spin_lock_irqsave(&gse->sched_engine.lock, flags);
+	__clr_tasklet_blocked(gse);
+	spin_unlock_irqrestore(&gse->sched_engine.lock, flags);
 }
 
-static bool guc_ids_exhausted(struct intel_guc *guc)
+static bool guc_ids_exhausted(struct guc_submit_engine *gse)
 {
-	return test_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
+	return test_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
 }
 
-static bool test_and_update_guc_ids_exhausted(struct intel_guc *guc)
+static bool test_and_update_guc_ids_exhausted(struct guc_submit_engine *gse)
 {
 	unsigned long flags;
 	bool ret = false;
@@ -139,33 +140,33 @@ static bool test_and_update_guc_ids_exhausted(struct intel_guc *guc)
 	 * Strict ordering on checking if guc_ids are exhausted isn't required,
 	 * so let's avoid grabbing the submission lock if possible.
 	 */
-	if (guc_ids_exhausted(guc)) {
-		spin_lock_irqsave(&guc->sched_engine->lock, flags);
-		ret = guc_ids_exhausted(guc);
+	if (guc_ids_exhausted(gse)) {
+		spin_lock_irqsave(&gse->sched_engine.lock, flags);
+		ret = guc_ids_exhausted(gse);
 		if (ret)
-			++guc->total_num_rq_with_no_guc_id;
-		spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+			++gse->total_num_rq_with_no_guc_id;
+		spin_unlock_irqrestore(&gse->sched_engine.lock, flags);
 	}
 
 	return ret;
 }
 
-static void set_and_update_guc_ids_exhausted(struct intel_guc *guc)
+static void set_and_update_guc_ids_exhausted(struct guc_submit_engine *gse)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->sched_engine->lock, flags);
-	++guc->total_num_rq_with_no_guc_id;
-	set_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
-	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
+	spin_lock_irqsave(&gse->sched_engine.lock, flags);
+	++gse->total_num_rq_with_no_guc_id;
+	set_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
+	spin_unlock_irqrestore(&gse->sched_engine.lock, flags);
 }
 
-static void clr_guc_ids_exhausted(struct intel_guc *guc)
+static void clr_guc_ids_exhausted(struct guc_submit_engine *gse)
 {
-	lockdep_assert_held(&guc->sched_engine->lock);
-	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id);
+	lockdep_assert_held(&gse->sched_engine.lock);
+	GEM_BUG_ON(gse->total_num_rq_with_no_guc_id);
 
-	clear_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
+	clear_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
 }
 
 /*
@@ -419,6 +420,20 @@ static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
 	return &ce->engine->gt->uc.guc;
 }
 
+static inline struct i915_sched_engine *
+ce_to_sched_engine(struct intel_context *ce)
+{
+	return ce->engine->sched_engine;
+}
+
+static inline struct i915_sched_engine *
+guc_to_sched_engine(struct intel_guc *guc, int index)
+{
+	GEM_BUG_ON(index < 0 || index >= GUC_SUBMIT_ENGINE_MAX);
+
+	return &guc->gse[index]->sched_engine;
+}
+
 static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 {
 	return rb_entry(rb, struct i915_priolist, node);
@@ -644,19 +659,20 @@ static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 	return err;
 }
 
-static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
+static int gse_add_request(struct guc_submit_engine *gse,
+			   struct i915_request *rq)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->sched_engine->lock);
+	lockdep_assert_held(&gse->sched_engine.lock);
 
-	ret = __guc_add_request(guc, rq);
+	ret = __guc_add_request(gse->sched_engine.private_data, rq);
 	if (ret == -EBUSY) {
-		guc->stalled_rq = rq;
-		guc->submission_stall_reason = STALL_ADD_REQUEST;
+		gse->stalled_rq = rq;
+		gse->submission_stall_reason = STALL_ADD_REQUEST;
 	} else {
-		guc->stalled_rq = NULL;
-		guc->submission_stall_reason = STALL_NONE;
+		gse->stalled_rq = NULL;
+		gse->submission_stall_reason = STALL_NONE;
 	}
 
 	return ret;
@@ -664,14 +680,15 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 
-static int tasklet_register_context(struct intel_guc *guc,
+static int tasklet_register_context(struct guc_submit_engine *gse,
 				    struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
+	struct intel_guc *guc = gse->sched_engine.private_data;
 	int ret = 0;
 
 	/* Check state */
-	lockdep_assert_held(&guc->sched_engine->lock);
+	lockdep_assert_held(&gse->sched_engine.lock);
 	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
 	GEM_BUG_ON(request_has_no_guc_id(rq));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
@@ -694,11 +711,11 @@ static int tasklet_register_context(struct intel_guc *guc,
 			clr_context_needs_register(ce);
 
 		if (unlikely(ret == -EBUSY)) {
-			guc->stalled_rq = rq;
-			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
+			gse->stalled_rq = rq;
+			gse->submission_stall_reason = STALL_REGISTER_CONTEXT;
 		} else if (unlikely(ret == -EINPROGRESS)) {
-			guc->stalled_rq = rq;
-			guc->submission_stall_reason = STALL_DEREGISTER_CONTEXT;
+			gse->stalled_rq = rq;
+			gse->submission_stall_reason = STALL_DEREGISTER_CONTEXT;
 		}
 	}
 
@@ -716,28 +733,29 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
-static void kick_retire_wq(struct intel_guc *guc)
+static void kick_retire_wq(struct guc_submit_engine *gse)
 {
-	queue_work(system_unbound_wq, &guc->retire_worker);
+	queue_work(system_unbound_wq, &gse->retire_worker);
 }
 
-static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq);
+static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
+			      struct i915_request *rq);
 
-static int guc_dequeue_one_context(struct intel_guc *guc)
+static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
-	struct i915_request *last = guc->stalled_rq;
+	struct i915_sched_engine * const sched_engine = &gse->sched_engine;
+	struct i915_request *last = gse->stalled_rq;
 	bool submit = !!last;
 	struct rb_node *rb;
 	int ret;
 
 	lockdep_assert_held(&sched_engine->lock);
-	GEM_BUG_ON(guc->stalled_context);
-	GEM_BUG_ON(!submit && guc->submission_stall_reason);
+	GEM_BUG_ON(gse->stalled_context);
+	GEM_BUG_ON(!submit && gse->submission_stall_reason);
 
 	if (submit) {
 		/* Flow control conditions */
-		switch (guc->submission_stall_reason) {
+		switch (gse->submission_stall_reason) {
 		case STALL_GUC_ID_TASKLET:
 			goto done;
 		case STALL_REGISTER_CONTEXT:
@@ -750,8 +768,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 			GEM_BUG_ON("Invalid stall state");
 		}
 	} else {
-		GEM_BUG_ON(!guc->total_num_rq_with_no_guc_id &&
-			   guc_ids_exhausted(guc));
+		GEM_BUG_ON(!gse->total_num_rq_with_no_guc_id &&
+			   guc_ids_exhausted(gse));
 
 		while ((rb = rb_first_cached(&sched_engine->queue))) {
 			struct i915_priolist *p = to_priolist(rb);
@@ -780,13 +798,13 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		struct intel_context *ce = last->context;
 
 		if (ce->guc_num_rq_submit_no_id) {
-			ret = tasklet_pin_guc_id(guc, last);
+			ret = tasklet_pin_guc_id(gse, last);
 			if (ret)
 				goto blk_tasklet_kick;
 		}
 
 register_context:
-		ret = tasklet_register_context(guc, last);
+		ret = tasklet_register_context(gse, last);
 		if (unlikely(ret == -EINPROGRESS)) {
 			goto blk_tasklet;
 		} else if (unlikely(ret == -EPIPE)) {
@@ -802,7 +820,7 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 		guc_set_lrc_tail(last);
 
 add_request:
-		ret = guc_add_request(guc, last);
+		ret = gse_add_request(gse, last);
 		if (unlikely(ret == -EPIPE)) {
 			goto deadlk;
 		} else if (ret == -EBUSY) {
@@ -817,8 +835,8 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	 * No requests without a guc_id, enable guc_id allocation at request
 	 * creation time (guc_request_alloc).
 	 */
-	if (!guc->total_num_rq_with_no_guc_id)
-		clr_guc_ids_exhausted(guc);
+	if (!gse->total_num_rq_with_no_guc_id)
+		clr_guc_ids_exhausted(gse);
 
 	return submit;
 
@@ -832,25 +850,26 @@ static int guc_dequeue_one_context(struct intel_guc *guc)
 	return false;
 
 blk_tasklet_kick:
-	kick_retire_wq(guc);
+	kick_retire_wq(gse);
 blk_tasklet:
-	set_tasklet_blocked(guc);
+	set_tasklet_blocked(gse);
 	return false;
 }
 
-static void guc_submission_tasklet(struct tasklet_struct *t)
+static void gse_submission_tasklet(struct tasklet_struct *t)
 {
 	struct i915_sched_engine *sched_engine =
 		from_tasklet(sched_engine, t, tasklet);
-	struct intel_guc *guc = sched_engine->private_data;
+	struct guc_submit_engine *gse =
+		container_of(sched_engine, typeof(*gse), sched_engine);
 	unsigned long flags;
 	bool loop;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (likely(!tasklet_blocked(guc)))
+	if (likely(!tasklet_blocked(gse)))
 		do {
-			loop = guc_dequeue_one_context(guc);
+			loop = gse_dequeue_one_context(gse);
 		} while (loop);
 
 	i915_sched_engine_reset_on_empty(sched_engine);
@@ -925,69 +944,99 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 	}
 }
 
-static inline bool
-submission_disabled(struct intel_guc *guc)
+static bool submission_disabled(struct intel_guc *guc)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	int i;
+
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+		struct i915_sched_engine *sched_engine;
+
+		if (unlikely(!guc->gse[i]))
+			return true;
+
+		sched_engine = guc_to_sched_engine(guc, i);
+
+		if (unlikely(!__tasklet_is_enabled(&sched_engine->tasklet)))
+			return true;
+	}
 
-	return unlikely(!sched_engine ||
-			!__tasklet_is_enabled(&sched_engine->tasklet));
+	return false;
 }
 
-static void kick_tasklet(struct intel_guc *guc)
+static void kick_tasklet(struct guc_submit_engine *gse)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	struct i915_sched_engine *sched_engine = &gse->sched_engine;
 
-	if (likely(!tasklet_blocked(guc)))
+	if (likely(!tasklet_blocked(gse)))
 		tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
 static void disable_submission(struct intel_guc *guc)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	int i;
 
-	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
-		GEM_BUG_ON(!guc->ct.enabled);
-		__tasklet_disable_sync_once(&sched_engine->tasklet);
-		sched_engine->tasklet.callback = NULL;
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+		struct i915_sched_engine *sched_engine =
+			guc_to_sched_engine(guc, i);
+
+		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
+			GEM_BUG_ON(!guc->ct.enabled);
+			__tasklet_disable_sync_once(&sched_engine->tasklet);
+			sched_engine->tasklet.callback = NULL;
+		}
 	}
 }
 
 static void enable_submission(struct intel_guc *guc)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
 	unsigned long flags;
+	int i;
 
-	spin_lock_irqsave(&guc->sched_engine->lock, flags);
-	sched_engine->tasklet.callback = guc_submission_tasklet;
-	wmb();	/* Make sure callback visible */
-	if (!__tasklet_is_enabled(&sched_engine->tasklet) &&
-	    __tasklet_enable(&sched_engine->tasklet)) {
-		GEM_BUG_ON(!guc->ct.enabled);
-
-		/* Reset tasklet state */
-		guc->stalled_rq = NULL;
-		if (guc->stalled_context)
-			intel_context_put(guc->stalled_context);
-		guc->stalled_context = NULL;
-		guc->submission_stall_reason = STALL_NONE;
-		guc->flags = 0;
-
-		/* And kick in case we missed a new request submission. */
-		kick_tasklet(guc);
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+		struct i915_sched_engine *sched_engine =
+			guc_to_sched_engine(guc, i);
+		struct guc_submit_engine *gse = guc->gse[i];
+
+		spin_lock_irqsave(&sched_engine->lock, flags);
+		sched_engine->tasklet.callback = gse_submission_tasklet;
+		wmb();	/* Mask sure callback is visible */
+		if (!__tasklet_is_enabled(&sched_engine->tasklet) &&
+		    __tasklet_enable(&sched_engine->tasklet)) {
+			GEM_BUG_ON(!guc->ct.enabled);
+
+			/* Reset GuC submit engine state */
+			gse->stalled_rq = NULL;
+			if (gse->stalled_context)
+				intel_context_put(gse->stalled_context);
+			gse->stalled_context = NULL;
+			gse->submission_stall_reason = STALL_NONE;
+			gse->flags = 0;
+
+			/* And kick in case we missed a new request submission. */
+			kick_tasklet(gse);
+		}
+		spin_unlock_irqrestore(&sched_engine->lock, flags);
 	}
-	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
 }
 
-static void guc_flush_submissions(struct intel_guc *guc)
+static void gse_flush_submissions(struct guc_submit_engine *gse)
 {
-	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	struct i915_sched_engine * const sched_engine = &gse->sched_engine;
 	unsigned long flags;
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static void guc_flush_submissions(struct intel_guc *guc)
+{
+	int i;
+
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i)
+		if (likely(guc->gse[i]))
+			gse_flush_submissions(guc->gse[i]);
+}
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -1171,13 +1220,12 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 		if (intel_context_is_pinned(ce))
 			__guc_reset_context(ce, stalled);
 
-	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
 
 static void guc_cancel_context_requests(struct intel_context *ce)
 {
-	struct i915_sched_engine *sched_engine = ce_to_guc(ce)->sched_engine;
+	struct i915_sched_engine *sched_engine = ce_to_sched_engine(ce);
 	struct i915_request *rq;
 	unsigned long flags;
 
@@ -1192,8 +1240,9 @@ static void guc_cancel_context_requests(struct intel_context *ce)
 }
 
 static void
-guc_cancel_sched_engine_requests(struct i915_sched_engine *sched_engine)
+gse_cancel_requests(struct guc_submit_engine *gse)
 {
+	struct i915_sched_engine *sched_engine = &gse->sched_engine;
 	struct i915_request *rq, *rn;
 	struct rb_node *rb;
 	unsigned long flags;
@@ -1250,12 +1299,14 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 {
 	struct intel_context *ce;
 	unsigned long index;
+	int i;
 
 	xa_for_each(&guc->context_lookup, index, ce)
 		if (intel_context_is_pinned(ce))
 			guc_cancel_context_requests(ce);
 
-	guc_cancel_sched_engine_requests(guc->sched_engine);
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i)
+		gse_cancel_requests(guc->gse[i]);
 
 	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
@@ -1283,13 +1334,13 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
-static void retire_worker_sched_disable(struct intel_guc *guc,
+static void retire_worker_sched_disable(struct guc_submit_engine *gse,
 					struct intel_context *ce);
 
 static void retire_worker_func(struct work_struct *w)
 {
-	struct intel_guc *guc =
-		container_of(w, struct intel_guc, retire_worker);
+	struct guc_submit_engine *gse =
+		container_of(w, struct guc_submit_engine, retire_worker);
 
 	/*
 	 * It is possible that another thread issues the schedule disable + that
@@ -1297,17 +1348,17 @@ static void retire_worker_func(struct work_struct *w)
 	 * where nothing needs to be done here. Let's be paranoid and kick the
 	 * tasklet in that case.
 	 */
-	if (guc->submission_stall_reason != STALL_SCHED_DISABLE &&
-	    guc->submission_stall_reason != STALL_GUC_ID_WORKQUEUE) {
-		kick_tasklet(guc);
+	if (gse->submission_stall_reason != STALL_SCHED_DISABLE &&
+	    gse->submission_stall_reason != STALL_GUC_ID_WORKQUEUE) {
+		kick_tasklet(gse);
 		return;
 	}
 
-	if (guc->submission_stall_reason == STALL_SCHED_DISABLE) {
-		GEM_BUG_ON(!guc->stalled_context);
-		GEM_BUG_ON(context_guc_id_invalid(guc->stalled_context));
+	if (gse->submission_stall_reason == STALL_SCHED_DISABLE) {
+		GEM_BUG_ON(!gse->stalled_context);
+		GEM_BUG_ON(context_guc_id_invalid(gse->stalled_context));
 
-		retire_worker_sched_disable(guc, guc->stalled_context);
+		retire_worker_sched_disable(gse, gse->stalled_context);
 	}
 
 	/*
@@ -1315,16 +1366,16 @@ static void retire_worker_func(struct work_struct *w)
 	 * albeit after possibly issuing a schedule disable as that is async
 	 * operation.
 	 */
-	intel_gt_retire_requests(guc_to_gt(guc));
+	intel_gt_retire_requests(guc_to_gt(gse->sched_engine.private_data));
 
-	if (guc->submission_stall_reason == STALL_GUC_ID_WORKQUEUE) {
-		GEM_BUG_ON(guc->stalled_context);
+	if (gse->submission_stall_reason == STALL_GUC_ID_WORKQUEUE) {
+		GEM_BUG_ON(gse->stalled_context);
 
 		/* Hopefully guc_ids are now available, kick tasklet */
-		guc->submission_stall_reason = STALL_GUC_ID_TASKLET;
-		clr_tasklet_blocked(guc);
+		gse->submission_stall_reason = STALL_GUC_ID_TASKLET;
+		clr_tasklet_blocked(gse);
 
-		kick_tasklet(guc);
+		kick_tasklet(gse);
 	}
 }
 
@@ -1355,18 +1406,24 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
 	ida_init(&guc->guc_ids);
 
-	INIT_WORK(&guc->retire_worker, retire_worker_func);
-
 	return 0;
 }
 
 void intel_guc_submission_fini(struct intel_guc *guc)
 {
+	int i;
+
 	if (!guc->lrc_desc_pool)
 		return;
 
 	guc_lrc_desc_pool_destroy(guc);
-	i915_sched_engine_put(guc->sched_engine);
+
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+		struct i915_sched_engine *sched_engine =
+			guc_to_sched_engine(guc, i);
+
+		i915_sched_engine_put(sched_engine);
+	}
 }
 
 static inline void queue_request(struct i915_sched_engine *sched_engine,
@@ -1381,22 +1438,23 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
 	if (empty)
-		kick_tasklet(&rq->engine->gt->uc.guc);
+		kick_tasklet(ce_to_gse(rq->context));
 }
 
 /* Macro to tweak heuristic, using a simple over 50% not ready for now */
 #define TOO_MANY_GUC_IDS_NOT_READY(avail, consumed) \
 	((consumed) > (avail) / 2)
-static bool too_many_guc_ids_not_ready(struct intel_guc *guc,
+static bool too_many_guc_ids_not_ready(struct guc_submit_engine *gse,
 				       struct intel_context *ce)
 {
 	u32 available_guc_ids, guc_ids_consumed;
+	struct intel_guc *guc = gse->sched_engine.private_data;
 
 	available_guc_ids = guc->num_guc_ids;
-	guc_ids_consumed = atomic_read(&guc->num_guc_ids_not_ready);
+	guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
 
 	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
-		set_and_update_guc_ids_exhausted(guc);
+		set_and_update_guc_ids_exhausted(gse);
 		return true;
 	}
 
@@ -1405,34 +1463,36 @@ static bool too_many_guc_ids_not_ready(struct intel_guc *guc,
 
 static void incr_num_rq_not_ready(struct intel_context *ce)
 {
-	struct intel_guc *guc = ce_to_guc(ce);
+	struct guc_submit_engine *gse = ce_to_gse(ce);
 
 	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
-		atomic_inc(&guc->num_guc_ids_not_ready);
+		atomic_inc(&gse->num_guc_ids_not_ready);
 }
 
 void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
 {
-	struct intel_guc *guc = ce_to_guc(ce);
+	struct guc_submit_engine *gse = ce_to_gse(ce);
 
-	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1)
-		atomic_dec(&guc->num_guc_ids_not_ready);
+	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1) {
+		GEM_BUG_ON(!atomic_read(&gse->num_guc_ids_not_ready));
+		atomic_dec(&gse->num_guc_ids_not_ready);
+	}
 }
 
-static bool need_tasklet(struct intel_guc *guc, struct intel_context *ce)
+static bool need_tasklet(struct guc_submit_engine *gse, struct intel_context *ce)
 {
-	struct i915_sched_engine * const sched_engine =
-		ce->engine->sched_engine;
+	struct i915_sched_engine * const sched_engine = &gse->sched_engine;
+	struct intel_guc *guc = gse->sched_engine.private_data;
 
 	lockdep_assert_held(&sched_engine->lock);
 
-	return guc_ids_exhausted(guc) || submission_disabled(guc) ||
-		guc->stalled_rq || guc->stalled_context ||
+	return guc_ids_exhausted(gse) || submission_disabled(guc) ||
+		gse->stalled_rq || gse->stalled_context ||
 		!lrc_desc_registered(guc, ce->guc_id) ||
 		!i915_sched_engine_is_empty(sched_engine);
 }
 
-static int guc_bypass_tasklet_submit(struct intel_guc *guc,
+static int gse_bypass_tasklet_submit(struct guc_submit_engine *gse,
 				     struct i915_request *rq)
 {
 	int ret;
@@ -1442,27 +1502,27 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
 	trace_i915_request_in(rq, 0);
 
 	guc_set_lrc_tail(rq);
-	ret = guc_add_request(guc, rq);
+	ret = gse_add_request(gse, rq);
 
 	if (unlikely(ret == -EPIPE))
-		disable_submission(guc);
+		disable_submission(gse->sched_engine.private_data);
 
 	return ret;
 }
 
 static void guc_submit_request(struct i915_request *rq)
 {
-	struct i915_sched_engine *sched_engine = rq->engine->sched_engine;
-	struct intel_guc *guc = &rq->engine->gt->uc.guc;
+	struct guc_submit_engine *gse = ce_to_gse(rq->context);
+	struct i915_sched_engine *sched_engine = &gse->sched_engine;
 	unsigned long flags;
 
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (need_tasklet(guc, rq->context))
+	if (need_tasklet(gse, rq->context))
 		queue_request(sched_engine, rq, rq_prio(rq));
-	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
-		kick_tasklet(guc);
+	else if (gse_bypass_tasklet_submit(gse, rq) == -EBUSY)
+		kick_tasklet(gse);
 
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 
@@ -1541,8 +1601,9 @@ static int steal_guc_id(struct intel_guc *guc, bool unpinned)
 		 * context.
 		 */
 		if (!unpinned) {
-			GEM_BUG_ON(guc->stalled_context);
-			guc->stalled_context = intel_context_get(ce);
+			GEM_BUG_ON(ce_to_gse(ce)->stalled_context);
+
+			ce_to_gse(ce)->stalled_context = intel_context_get(ce);
 			set_context_guc_id_stolen(ce);
 		} else {
 			set_context_guc_id_invalid(ce);
@@ -1604,7 +1665,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
 try_again:
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 
-	if (!tasklet && guc_ids_exhausted(guc)) {
+	if (!tasklet && guc_ids_exhausted(ce_to_gse(ce))) {
 		ret = -EAGAIN;
 		goto out_unlock;
 	}
@@ -2111,7 +2172,7 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
-	guc_flush_submissions(guc);
+	gse_flush_submissions(ce_to_gse(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	set_context_banned(ce);
@@ -2199,7 +2260,7 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 	with_intel_runtime_pm(runtime_pm, wakeref)
-		__guc_context_sched_disable(guc, ce, guc_id);
+		__guc_context_sched_disable(ce_to_guc(ce), ce, guc_id);
 
 	return;
 unpin:
@@ -2460,7 +2521,7 @@ static void remove_from_context(struct i915_request *rq)
 	if (likely(!request_has_no_guc_id(rq)))
 		atomic_dec(&ce->guc_id_ref);
 	else
-		--ce_to_guc(rq->context)->total_num_rq_with_no_guc_id;
+		--ce_to_gse(rq->context)->total_num_rq_with_no_guc_id;
 	unpin_guc_id(ce_to_guc(ce), ce, false);
 
 	i915_request_notify_execute_cb_imm(rq);
@@ -2521,13 +2582,14 @@ static void invalidate_guc_id_sched_disable(struct intel_context *ce)
 	clr_context_guc_id_stolen(ce);
 }
 
-static void retire_worker_sched_disable(struct intel_guc *guc,
+static void retire_worker_sched_disable(struct guc_submit_engine *gse,
 					struct intel_context *ce)
 {
+	struct intel_guc *guc = gse->sched_engine.private_data;
 	unsigned long flags;
 	bool disabled;
 
-	guc->stalled_context = NULL;
+	gse->stalled_context = NULL;
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
 	if (!disabled && !context_pending_disable(ce) && context_enabled(ce)) {
@@ -2573,10 +2635,10 @@ static void retire_worker_sched_disable(struct intel_guc *guc,
 
 		invalidate_guc_id_sched_disable(ce);
 
-		guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
-		clr_tasklet_blocked(guc);
+		gse->submission_stall_reason = STALL_REGISTER_CONTEXT;
+		clr_tasklet_blocked(gse);
 
-		kick_tasklet(ce_to_guc(ce));
+		kick_tasklet(gse);
 	}
 
 	intel_context_put(ce);
@@ -2589,25 +2651,26 @@ static bool context_needs_lrc_desc_pin(struct intel_context *ce, bool new_guc_id
 		!submission_disabled(ce_to_guc(ce));
 }
 
-static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq)
+static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
+			      struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 	int ret = 0;
 
-	lockdep_assert_held(&guc->sched_engine->lock);
+	lockdep_assert_held(&gse->sched_engine.lock);
 	GEM_BUG_ON(!ce->guc_num_rq_submit_no_id);
 
 	if (atomic_add_unless(&ce->guc_id_ref, ce->guc_num_rq_submit_no_id, 0))
 		goto out;
 
-	ret = pin_guc_id(guc, ce, true);
+	ret = pin_guc_id(gse->sched_engine.private_data, ce, true);
 	if (unlikely(ret < 0)) {
 		/*
 		 * No guc_ids available, disable the tasklet and kick the retire
 		 * workqueue hopefully freeing up some guc_ids.
 		 */
-		guc->stalled_rq = rq;
-		guc->submission_stall_reason = STALL_GUC_ID_WORKQUEUE;
+		gse->stalled_rq = rq;
+		gse->submission_stall_reason = STALL_GUC_ID_WORKQUEUE;
 		return ret;
 	}
 
@@ -2619,14 +2682,14 @@ static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq)
 		set_context_needs_register(ce);
 
 	if (ret == NEW_GUC_ID_ENABLED) {
-		guc->stalled_rq = rq;
-		guc->submission_stall_reason = STALL_SCHED_DISABLE;
+		gse->stalled_rq = rq;
+		gse->submission_stall_reason = STALL_SCHED_DISABLE;
 	}
 
 	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
 out:
-	guc->total_num_rq_with_no_guc_id -= ce->guc_num_rq_submit_no_id;
-	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id < 0);
+	gse->total_num_rq_with_no_guc_id -= ce->guc_num_rq_submit_no_id;
+	GEM_BUG_ON(gse->total_num_rq_with_no_guc_id < 0);
 
 	list_for_each_entry_reverse(rq, &ce->guc_active.requests, sched.link)
 		if (request_has_no_guc_id(rq)) {
@@ -2644,7 +2707,7 @@ static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq)
 	 * from a context that has scheduling enabled. We have to disable
 	 * scheduling before deregistering the context and it isn't safe to do
 	 * in the tasklet because of lock inversion (ce->guc_state.lock must be
-	 * acquired before guc->sched_engine->lock). To work around this
+	 * acquired before gse->sched_engine.lock). To work around this
 	 * we do the schedule disable in retire workqueue and block the tasklet
 	 * until the schedule done G2H returns. Returning non-zero here kicks
 	 * the workqueue.
@@ -2656,6 +2719,7 @@ static int guc_request_alloc(struct i915_request *rq)
 {
 	struct intel_context *ce = rq->context;
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct guc_submit_engine *gse = ce_to_gse(ce);
 	unsigned long flags;
 	int ret;
 
@@ -2689,8 +2753,8 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * ready to submit). Don't allocate one here, defer to submission in the
 	 * tasklet.
 	 */
-	if (test_and_update_guc_ids_exhausted(guc) ||
-	    too_many_guc_ids_not_ready(guc, ce)) {
+	if (test_and_update_guc_ids_exhausted(gse) ||
+	    too_many_guc_ids_not_ready(gse, ce)) {
 		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
 		goto out;
 	}
@@ -2723,7 +2787,7 @@ static int guc_request_alloc(struct i915_request *rq)
 		 * submissions we return to allocating guc_ids in this function.
 		 */
 		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
-		set_and_update_guc_ids_exhausted(guc);
+		set_and_update_guc_ids_exhausted(gse);
 		incr_num_rq_not_ready(ce);
 
 		return 0;
@@ -3131,17 +3195,41 @@ static void guc_sched_engine_destroy(struct kref *kref)
 {
 	struct i915_sched_engine *sched_engine =
 		container_of(kref, typeof(*sched_engine), ref);
-	struct intel_guc *guc = sched_engine->private_data;
+	struct guc_submit_engine *gse =
+		container_of(sched_engine, typeof(*gse), sched_engine);
+	struct intel_guc *guc = gse->sched_engine.private_data;
 
-	guc->sched_engine = NULL;
+	guc->gse[gse->id] = NULL;
 	tasklet_kill(&sched_engine->tasklet); /* flush the callback */
-	kfree(sched_engine);
+	kfree(gse);
+}
+
+static void guc_submit_engine_init(struct intel_guc *guc,
+				   struct guc_submit_engine *gse,
+				   int id)
+{
+	struct i915_sched_engine *sched_engine = &gse->sched_engine;
+
+	i915_sched_engine_init(sched_engine, ENGINE_VIRTUAL);
+	INIT_WORK(&gse->retire_worker, retire_worker_func);
+	tasklet_setup(&sched_engine->tasklet, gse_submission_tasklet);
+	sched_engine->schedule = i915_schedule;
+	sched_engine->disabled = guc_sched_engine_disabled;
+	sched_engine->destroy = guc_sched_engine_destroy;
+	sched_engine->bump_inflight_request_prio =
+		guc_bump_inflight_request_prio;
+	sched_engine->retire_inflight_request_prio =
+		guc_retire_inflight_request_prio;
+	sched_engine->private_data = guc;
+	gse->id = id;
 }
 
 int intel_guc_submission_setup(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *i915 = engine->i915;
 	struct intel_guc *guc = &engine->gt->uc.guc;
+	struct i915_sched_engine *sched_engine;
+	int ret, i;
 
 	/*
 	 * The setup relies on several assumptions (e.g. irqs always enabled)
@@ -3149,24 +3237,20 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 	 */
 	GEM_BUG_ON(GRAPHICS_VER(i915) < 11);
 
-	if (!guc->sched_engine) {
-		guc->sched_engine = i915_sched_engine_create(ENGINE_VIRTUAL);
-		if (!guc->sched_engine)
-			return -ENOMEM;
-
-		guc->sched_engine->schedule = i915_schedule;
-		guc->sched_engine->disabled = guc_sched_engine_disabled;
-		guc->sched_engine->private_data = guc;
-		guc->sched_engine->destroy = guc_sched_engine_destroy;
-		guc->sched_engine->bump_inflight_request_prio =
-			guc_bump_inflight_request_prio;
-		guc->sched_engine->retire_inflight_request_prio =
-			guc_retire_inflight_request_prio;
-		tasklet_setup(&guc->sched_engine->tasklet,
-			      guc_submission_tasklet);
+	if (!guc->gse[0]) {
+		for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+			guc->gse[i] = kzalloc(sizeof(*guc->gse[i]), GFP_KERNEL);
+			if (!guc->gse[i]) {
+				ret = -ENOMEM;
+				goto put_sched_engine;
+			}
+			guc_submit_engine_init(guc, guc->gse[i], i);
+		}
 	}
+
+	sched_engine = guc_to_sched_engine(guc, GUC_SUBMIT_ENGINE_SINGLE_LRC);
 	i915_sched_engine_put(engine->sched_engine);
-	engine->sched_engine = i915_sched_engine_get(guc->sched_engine);
+	engine->sched_engine = i915_sched_engine_get(sched_engine);
 
 	guc_default_vfuncs(engine);
 	guc_default_irqs(engine);
@@ -3182,6 +3266,16 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 	engine->release = guc_release;
 
 	return 0;
+
+put_sched_engine:
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
+		struct i915_sched_engine *sched_engine =
+			guc_to_sched_engine(guc, i);
+
+		if (sched_engine)
+			i915_sched_engine_put(sched_engine);
+	}
+	return ret;
 }
 
 void intel_guc_submission_enable(struct intel_guc *guc)
@@ -3282,14 +3376,16 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 			register_context(ce, true);
 		guc_signal_context_fence(ce);
 		if (context_block_tasklet(ce)) {
-			GEM_BUG_ON(guc->submission_stall_reason !=
+			struct guc_submit_engine *gse = ce_to_gse(ce);
+
+			GEM_BUG_ON(gse->submission_stall_reason !=
 				   STALL_DEREGISTER_CONTEXT);
 
 			clr_context_block_tasklet(ce);
-			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
-			clr_tasklet_blocked(guc);
+			gse->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+			clr_tasklet_blocked(gse);
 
-			kick_tasklet(ce_to_guc(ce));
+			kick_tasklet(gse);
 		}
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
@@ -3355,11 +3451,13 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
 		if (context_block_tasklet(ce)) {
+			struct guc_submit_engine *gse = ce_to_gse(ce);
+
 			clr_context_block_tasklet(ce);
-			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
-			clr_tasklet_blocked(guc);
+			gse->submission_stall_reason = STALL_REGISTER_CONTEXT;
+			clr_tasklet_blocked(gse);
 
-			kick_tasklet(ce_to_guc(ce));
+			kick_tasklet(gse);
 		}
 
 		if (banned) {
@@ -3391,7 +3489,7 @@ static void capture_error_state(struct intel_guc *guc,
 static void guc_context_replay(struct intel_context *ce)
 {
 	__guc_reset_context(ce, true);
-	kick_tasklet(ce_to_guc(ce));
+	kick_tasklet(ce_to_gse(ce));
 }
 
 static void guc_handle_context_reset(struct intel_guc *guc,
@@ -3536,35 +3634,32 @@ void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 	}
 }
 
-void intel_guc_submission_print_info(struct intel_guc *guc,
-				     struct drm_printer *p)
+static void gse_log_submission_info(struct guc_submit_engine *gse,
+				    struct drm_printer *p, int id)
 {
-	struct i915_sched_engine *sched_engine = guc->sched_engine;
+	struct i915_sched_engine *sched_engine = &gse->sched_engine;
 	struct rb_node *rb;
 	unsigned long flags;
 
 	if (!sched_engine)
 		return;
 
-	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
-		   atomic_read(&guc->outstanding_submission_g2h));
-	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
-	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
-	drm_printf(p, "GuC tasklet count: %u\n",
+	drm_printf(p, "GSE[%d] tasklet count: %u\n", id,
 		   atomic_read(&sched_engine->tasklet.count));
-	drm_printf(p, "GuC submit flags: 0x%04lx\n", guc->flags);
-	drm_printf(p, "GuC total number request without guc_id: %d\n",
-		   guc->total_num_rq_with_no_guc_id);
-	drm_printf(p, "GuC Number GuC IDs not ready: %d\n",
-		   atomic_read(&guc->num_guc_ids_not_ready));
-	drm_printf(p, "GuC stall reason: %d\n", guc->submission_stall_reason);
-	drm_printf(p, "GuC stalled request: %s\n",
-		   yesno(guc->stalled_rq));
-	drm_printf(p, "GuC stalled context: %s\n\n",
-		   yesno(guc->stalled_context));
+	drm_printf(p, "GSE[%d] submit flags: 0x%04lx\n", id, gse->flags);
+	drm_printf(p, "GSE[%d] total number request without guc_id: %d\n",
+		   id, gse->total_num_rq_with_no_guc_id);
+	drm_printf(p, "GSE[%d] Number GuC IDs not ready: %d\n",
+		   id, atomic_read(&gse->num_guc_ids_not_ready));
+	drm_printf(p, "GSE[%d] stall reason: %d\n",
+		   id, gse->submission_stall_reason);
+	drm_printf(p, "GSE[%d] stalled request: %s\n",
+		   id, yesno(gse->stalled_rq));
+	drm_printf(p, "GSE[%d] stalled context: %s\n\n",
+		   id, yesno(gse->stalled_context));
 
 	spin_lock_irqsave(&sched_engine->lock, flags);
-	drm_printf(p, "Requests in GuC submit tasklet:\n");
+	drm_printf(p, "Requests in GSE[%d] submit tasklet:\n", id);
 	for (rb = rb_first_cached(&sched_engine->queue); rb; rb = rb_next(rb)) {
 		struct i915_priolist *pl = to_priolist(rb);
 		struct i915_request *rq;
@@ -3594,6 +3689,20 @@ static inline void guc_log_context_priority(struct drm_printer *p,
 	drm_printf(p, "\n");
 }
 
+void intel_guc_submission_print_info(struct intel_guc *guc,
+				     struct drm_printer *p)
+{
+	int i;
+
+	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
+		   atomic_read(&guc->outstanding_submission_g2h));
+	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
+	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
+
+	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i)
+		gse_log_submission_info(guc->gse[i], p, i);
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
@@ -3627,6 +3736,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
+	struct i915_sched_engine *sched_engine;
 	unsigned int n;
 	int err;
 
@@ -3635,6 +3745,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		return ERR_PTR(-ENOMEM);
 
 	guc = &siblings[0]->gt->uc.guc;
+	sched_engine = guc_to_sched_engine(guc, GUC_SUBMIT_ENGINE_SINGLE_LRC);
 
 	ve->base.i915 = siblings[0]->i915;
 	ve->base.gt = siblings[0]->gt;
@@ -3648,7 +3759,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 	snprintf(ve->base.name, sizeof(ve->base.name), "virtual");
 
-	ve->base.sched_engine = i915_sched_engine_get(guc->sched_engine);
+	ve->base.sched_engine = i915_sched_engine_get(sched_engine);
 
 	ve->base.cops = &virtual_guc_context_ops;
 	ve->base.request_alloc = guc_request_alloc;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
new file mode 100644
index 000000000000..0c224ab18c02
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2014-2019 Intel Corporation
+ */
+
+#ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
+#define _INTEL_GUC_SUBMISSION_TYPES_H_
+
+#include "gt/intel_engine_types.h"
+#include "gt/intel_context_types.h"
+#include "i915_scheduler_types.h"
+
+struct intel_guc;
+struct i915_request;
+
+/* GuC Virtual Engine */
+struct guc_virtual_engine {
+	struct intel_engine_cs base;
+	struct intel_context context;
+};
+
+/*
+ * Object which encapsulates the globally operated on i915_sched_engine +
+ * the GuC submission state machine described in intel_guc_submission.c.
+ */
+struct guc_submit_engine {
+	struct i915_sched_engine sched_engine;
+	struct work_struct retire_worker;
+	struct i915_request *stalled_rq;
+	struct intel_context *stalled_context;
+	unsigned long flags;
+	int total_num_rq_with_no_guc_id;
+	atomic_t num_guc_ids_not_ready;
+	int id;
+
+	/*
+	 * Submisson stall reason. See intel_guc_submission.c for detailed
+	 * description.
+	 */
+	enum {
+		STALL_NONE,
+		STALL_GUC_ID_WORKQUEUE,
+		STALL_GUC_ID_TASKLET,
+		STALL_SCHED_DISABLE,
+		STALL_REGISTER_CONTEXT,
+		STALL_DEREGISTER_CONTEXT,
+		STALL_MOVE_LRC_TAIL,
+		STALL_ADD_REQUEST,
+	} submission_stall_reason;
+};
+
+#endif
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 762127dd56c5..1e7eb49e374c 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -448,15 +448,9 @@ static bool default_disabled(struct i915_sched_engine *sched_engine)
 	return false;
 }
 
-struct i915_sched_engine *
-i915_sched_engine_create(unsigned int subclass)
+void i915_sched_engine_init(struct i915_sched_engine *sched_engine,
+			    unsigned int subclass)
 {
-	struct i915_sched_engine *sched_engine;
-
-	sched_engine = kzalloc(sizeof(*sched_engine), GFP_KERNEL);
-	if (!sched_engine)
-		return NULL;
-
 	kref_init(&sched_engine->ref);
 
 	sched_engine->queue = RB_ROOT_CACHED;
@@ -481,6 +475,18 @@ i915_sched_engine_create(unsigned int subclass)
 	lock_map_release(&sched_engine->lock.dep_map);
 	local_irq_enable();
 #endif
+}
+
+struct i915_sched_engine *
+i915_sched_engine_create(unsigned int subclass)
+{
+	struct i915_sched_engine *sched_engine;
+
+	sched_engine = kzalloc(sizeof(*sched_engine), GFP_KERNEL);
+	if (!sched_engine)
+		return NULL;
+
+	i915_sched_engine_init(sched_engine, subclass);
 
 	return sched_engine;
 }
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index 0b9b86af6c7f..4e4ef32b2cbc 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -48,6 +48,9 @@ static inline void i915_priolist_free(struct i915_priolist *p)
 		__i915_priolist_free(p);
 }
 
+void i915_sched_engine_init(struct i915_sched_engine *sched_engine,
+			    unsigned int subclass);
+
 struct i915_sched_engine *
 i915_sched_engine_create(unsigned int subclass);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 06/46] drm/i915/guc: Check return of __xa_store when registering a context
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (4 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 05/46] drm/i915/guc: Introduce guc_submit_engine object Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 07/46] drm/i915/guc: Non-static lrc descriptor registration buffer Matthew Brost
                   ` (44 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Check return of __xa_store when registering a context as this can fail
in a rare case if not memory can not be allocated. If this occurs fall
back on the tasklet flow control and try again in the future.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 842094de848d..a7f7174b5343 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -505,18 +505,24 @@ static inline bool lrc_desc_registered(struct intel_guc *guc, u32 id)
 	return __get_context(guc, id);
 }
 
-static inline void set_lrc_desc_registered(struct intel_guc *guc, u32 id,
-					   struct intel_context *ce)
+static inline int set_lrc_desc_registered(struct intel_guc *guc, u32 id,
+					  struct intel_context *ce)
 {
 	unsigned long flags;
+	void *ret;
 
 	/*
 	 * xarray API doesn't have xa_save_irqsave wrapper, so calling the
 	 * lower level functions directly.
 	 */
 	xa_lock_irqsave(&guc->context_lookup, flags);
-	__xa_store(&guc->context_lookup, id, ce, GFP_ATOMIC);
+	ret = __xa_store(&guc->context_lookup, id, ce, GFP_ATOMIC);
 	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
+	if (unlikely(xa_is_err(ret)))
+		return -EBUSY;	/* Try again in future */
+
+	return 0;
 }
 
 static int guc_submission_send_busy_loop(struct intel_guc *guc,
@@ -1854,7 +1860,9 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	rcu_read_unlock();
 
 	reset_lrc_desc(guc, desc_idx);
-	set_lrc_desc_registered(guc, desc_idx, ce);
+	ret = set_lrc_desc_registered(guc, desc_idx, ce);
+	if (unlikely(ret))
+		return ret;
 
 	desc = __get_lrc_desc(guc, desc_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 07/46] drm/i915/guc: Non-static lrc descriptor registration buffer
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (5 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 06/46] drm/i915/guc: Check return of __xa_store when registering a context Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 08/46] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
                   ` (43 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Dynamically allocate space for lrc descriptor registration with the GuC
rather than using a large static buffer indexed by the guc_id. If no
space is available to register a context, fall back to tasklet flow
control mechanism. Only allow 1/2 of the space to be allocated outside
the tasklet to prevent unready requests/contexts from consuming all
registration space.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   9 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 202 ++++++++++++------
 3 files changed, 150 insertions(+), 64 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index c01530d7dc67..2df79ba39867 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -179,6 +179,9 @@ struct intel_context {
 	/* GuC scheduling state flags that do not require a lock. */
 	atomic_t guc_sched_state_no_lock;
 
+	/* GuC lrc descriptor registration buffer */
+	unsigned int guc_lrcd_reg_idx;
+
 	/* GuC LRC descriptor ID */
 	u16 guc_id;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 8ac016201658..c0a12ae95ba5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -69,8 +69,13 @@ struct intel_guc {
 	u32 ads_regset_size;
 	u32 ads_golden_ctxt_size;
 
-	struct i915_vma *lrc_desc_pool;
-	void *lrc_desc_pool_vaddr;
+	/* GuC LRC descriptor registration */
+	struct {
+		struct i915_vma *vma;
+		void *vaddr;
+		struct ida ida;
+		unsigned int max_idx;
+	} lrcd_reg;
 
 	/* guc_id to intel_context lookup */
 	struct xarray context_lookup;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a7f7174b5343..bfda15bf9182 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -439,65 +439,54 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
-static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, u32 index)
+static u32 __get_lrc_desc_offset(struct intel_guc *guc, int index)
 {
-	struct guc_lrc_desc *base = guc->lrc_desc_pool_vaddr;
-
+	GEM_BUG_ON(index >= guc->lrcd_reg.max_idx);
 	GEM_BUG_ON(index >= guc->max_guc_ids);
 
-	return &base[index];
+	return intel_guc_ggtt_offset(guc, guc->lrcd_reg.vma) +
+		(index * sizeof(struct guc_lrc_desc));
 }
 
-static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
+static struct guc_lrc_desc *__get_lrc_desc(struct intel_guc *guc, int index)
 {
-	struct intel_context *ce = xa_load(&guc->context_lookup, id);
+	struct guc_lrc_desc *desc;
 
-	GEM_BUG_ON(id >= guc->max_guc_ids);
+	GEM_BUG_ON(index >= guc->lrcd_reg.max_idx);
+	GEM_BUG_ON(index >= guc->max_guc_ids);
 
-	return ce;
+	desc = guc->lrcd_reg.vaddr;
+	desc = &desc[index];
+	memset(desc, 0, sizeof(*desc));
+
+	return desc;
 }
 
-static int guc_lrc_desc_pool_create(struct intel_guc *guc)
+static inline struct intel_context *__get_context(struct intel_guc *guc, u32 id)
 {
-	u32 size;
-	int ret;
-
-	size = PAGE_ALIGN(sizeof(struct guc_lrc_desc) * guc->max_guc_ids);
-	ret = intel_guc_allocate_and_map_vma(guc, size, &guc->lrc_desc_pool,
-					     (void **)&guc->lrc_desc_pool_vaddr);
-	if (ret)
-		return ret;
+	struct intel_context *ce = xa_load(&guc->context_lookup, id);
 
-	return 0;
-}
+	GEM_BUG_ON(id >= guc->max_guc_ids);
 
-static void guc_lrc_desc_pool_destroy(struct intel_guc *guc)
-{
-	guc->lrc_desc_pool_vaddr = NULL;
-	i915_vma_unpin_and_release(&guc->lrc_desc_pool, I915_VMA_RELEASE_MAP);
+	return ce;
 }
 
 static inline bool guc_submission_initialized(struct intel_guc *guc)
 {
-	return !!guc->lrc_desc_pool_vaddr;
+	return !!guc->lrcd_reg.max_idx;
 }
 
-static inline void reset_lrc_desc(struct intel_guc *guc, u32 id)
+static inline void clr_lrc_desc_registered(struct intel_guc *guc, u32 id)
 {
-	if (likely(guc_submission_initialized(guc))) {
-		struct guc_lrc_desc *desc = __get_lrc_desc(guc, id);
-		unsigned long flags;
-
-		memset(desc, 0, sizeof(*desc));
+	unsigned long flags;
 
-		/*
-		 * xarray API doesn't have xa_erase_irqsave wrapper, so calling
-		 * the lower level functions directly.
-		 */
-		xa_lock_irqsave(&guc->context_lookup, flags);
-		__xa_erase(&guc->context_lookup, id);
-		xa_unlock_irqrestore(&guc->context_lookup, flags);
-	}
+	/*
+	 * xarray API doesn't have xa_erase_irqsave wrapper, so calling
+	 * the lower level functions directly.
+	 */
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	__xa_erase(&guc->context_lookup, id);
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
 }
 
 static inline bool lrc_desc_registered(struct intel_guc *guc, u32 id)
@@ -1385,6 +1374,9 @@ static void retire_worker_func(struct work_struct *w)
 	}
 }
 
+static int guc_lrcd_reg_init(struct intel_guc *guc);
+static void guc_lrcd_reg_fini(struct intel_guc *guc);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1393,17 +1385,12 @@ int intel_guc_submission_init(struct intel_guc *guc)
 {
 	int ret;
 
-	if (guc->lrc_desc_pool)
+	if (guc_submission_initialized(guc))
 		return 0;
 
-	ret = guc_lrc_desc_pool_create(guc);
+	ret = guc_lrcd_reg_init(guc);
 	if (ret)
 		return ret;
-	/*
-	 * Keep static analysers happy, let them know that we allocated the
-	 * vma after testing that it didn't exist earlier.
-	 */
-	GEM_BUG_ON(!guc->lrc_desc_pool);
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
@@ -1419,10 +1406,10 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 {
 	int i;
 
-	if (!guc->lrc_desc_pool)
+	if (!guc_submission_initialized(guc))
 		return;
 
-	guc_lrc_desc_pool_destroy(guc);
+	guc_lrcd_reg_fini(guc);
 
 	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
 		struct i915_sched_engine *sched_engine =
@@ -1495,6 +1482,7 @@ static bool need_tasklet(struct guc_submit_engine *gse, struct intel_context *ce
 	return guc_ids_exhausted(gse) || submission_disabled(guc) ||
 		gse->stalled_rq || gse->stalled_context ||
 		!lrc_desc_registered(guc, ce->guc_id) ||
+		context_needs_register(ce) ||
 		!i915_sched_engine_is_empty(sched_engine);
 }
 
@@ -1546,7 +1534,7 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
 		ida_simple_remove(&guc->guc_ids, ce->guc_id);
-		reset_lrc_desc(guc, ce->guc_id);
+		clr_lrc_desc_registered(guc, ce->guc_id);
 		set_context_guc_id_invalid(ce);
 	}
 	if (!list_empty(&ce->guc_id_link))
@@ -1743,14 +1731,14 @@ static void unpin_guc_id(struct intel_guc *guc,
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
+					 struct intel_context *ce,
 					 u32 guc_id,
-					 u32 offset,
 					 bool loop)
 {
 	u32 action[] = {
 		INTEL_GUC_ACTION_REGISTER_CONTEXT,
 		guc_id,
-		offset,
+		__get_lrc_desc_offset(guc, ce->guc_lrcd_reg_idx),
 	};
 
 	return guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -1760,13 +1748,11 @@ static int __guc_action_register_context(struct intel_guc *guc,
 static int register_context(struct intel_context *ce, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
-	u32 offset = intel_guc_ggtt_offset(guc, guc->lrc_desc_pool) +
-		ce->guc_id * sizeof(struct guc_lrc_desc);
 	int ret;
 
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce->guc_id, offset, loop);
+	ret = __guc_action_register_context(guc, ce, ce->guc_id, loop);
 	if (likely(!ret))
 		set_context_registered(ce);
 
@@ -1828,6 +1814,86 @@ static void guc_context_policy_init(struct intel_engine_cs *engine,
 
 static inline u8 map_i915_prio_to_guc_prio(int prio);
 
+static int alloc_lrcd_reg_idx_buffer(struct intel_guc *guc, int num_per_vma)
+{
+	u32 size = num_per_vma * sizeof(struct guc_lrc_desc);
+	struct i915_vma **vma = &guc->lrcd_reg.vma;
+	void **vaddr = &guc->lrcd_reg.vaddr;
+	int ret;
+
+	GEM_BUG_ON(!is_power_of_2(size));
+
+	ret = intel_guc_allocate_and_map_vma(guc, size, vma, vaddr);
+	if (unlikely(ret))
+		return ret;
+
+	guc->lrcd_reg.max_idx += num_per_vma;
+
+	return 0;
+}
+
+static int alloc_lrcd_reg_idx(struct intel_guc *guc, bool tasklet)
+{
+	int ret;
+	gfp_t gfp = tasklet ? GFP_ATOMIC :
+		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
+
+	might_sleep_if(!tasklet);
+
+	/*
+	 * We only allow 1/2 of the space to be allocated outside of tasklet
+	 * (flow control) to ensure requests that are not ready don't consume
+	 * all context registration space.
+	 */
+	ret = ida_simple_get(&guc->lrcd_reg.ida, 0,
+			     tasklet ? guc->lrcd_reg.max_idx :
+			     guc->lrcd_reg.max_idx / 2, gfp);
+	if (unlikely(ret < 0))
+		return -EBUSY;
+
+	return ret;
+}
+
+static void __free_lrcd_reg_idx(struct intel_guc *guc, struct intel_context *ce)
+{
+	if (ce->guc_lrcd_reg_idx && guc->lrcd_reg.max_idx) {
+		ida_simple_remove(&guc->lrcd_reg.ida, ce->guc_lrcd_reg_idx);
+		ce->guc_lrcd_reg_idx = 0;
+	}
+}
+
+static void free_lrcd_reg_idx(struct intel_guc *guc, struct intel_context *ce)
+{
+	__free_lrcd_reg_idx(guc, ce);
+}
+
+static int guc_lrcd_reg_init(struct intel_guc *guc)
+{
+	unsigned int buffer_size = I915_GTT_PAGE_SIZE_4K * 16;
+	int ret;
+
+	ida_init(&guc->lrcd_reg.ida);
+
+	ret = alloc_lrcd_reg_idx_buffer(guc, buffer_size /
+					sizeof(struct guc_lrc_desc));
+	if (unlikely(ret))
+		return ret;
+
+	/* Zero is reserved */
+	ret = alloc_lrcd_reg_idx(guc, false);
+	GEM_BUG_ON(ret);
+
+	return ret;
+}
+
+static void guc_lrcd_reg_fini(struct intel_guc *guc)
+{
+	i915_vma_unpin_and_release(&guc->lrcd_reg.vma,
+				   I915_VMA_RELEASE_MAP);
+	ida_destroy(&guc->lrcd_reg.ida);
+	guc->lrcd_reg.max_idx = 0;
+}
+
 static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 {
 	struct intel_engine_cs *engine = ce->engine;
@@ -1851,6 +1917,14 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	GEM_BUG_ON(i915_gem_object_is_lmem(guc->ct.vma->obj) !=
 		   i915_gem_object_is_lmem(ce->ring->vma->obj));
 
+	/* Allocate space for registration */
+	if (likely(!ce->guc_lrcd_reg_idx)) {
+		ret = alloc_lrcd_reg_idx(guc, !loop);
+		if (unlikely(ret < 0))
+			return ret;
+		ce->guc_lrcd_reg_idx = ret;
+	}
+
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
 	rcu_read_lock();
@@ -1859,12 +1933,11 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		prio = ctx->sched.priority;
 	rcu_read_unlock();
 
-	reset_lrc_desc(guc, desc_idx);
 	ret = set_lrc_desc_registered(guc, desc_idx, ce);
 	if (unlikely(ret))
 		return ret;
 
-	desc = __get_lrc_desc(guc, desc_idx);
+	desc = __get_lrc_desc(guc, ce->guc_lrcd_reg_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
 	desc->engine_submit_mask = adjust_engine_mask(engine->class,
 						      engine->mask);
@@ -1902,7 +1975,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			}
 			spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 			if (unlikely(disabled)) {
-				reset_lrc_desc(guc, desc_idx);
+				clr_lrc_desc_registered(guc, desc_idx);
 				return 0;	/* Will get registered later */
 			}
 		}
@@ -1930,7 +2003,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			ret = register_context(ce, loop);
 		if (unlikely(ret == -EBUSY))
-			reset_lrc_desc(guc, desc_idx);
+			clr_lrc_desc_registered(guc, desc_idx);
 		else if (unlikely(ret == -ENODEV))
 			ret = 0;	/* Will get registered later */
 	}
@@ -2296,6 +2369,7 @@ static void __guc_context_destroy(struct intel_context *ce)
 
 	lrc_fini(ce);
 	intel_context_fini(ce);
+	__free_lrcd_reg_idx(ce_to_guc(ce), ce);
 
 	if (intel_engine_is_virtual(ce->engine)) {
 		struct guc_virtual_engine *ve =
@@ -2807,11 +2881,11 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	if (context_needs_lrc_desc_pin(ce, !!ret)) {
 		ret = guc_lrc_desc_pin(ce, true);
-		if (unlikely(ret)) {	/* unwind */
-			if (ret == -EPIPE) {
-				disable_submission(guc);
-				goto out;	/* GPU will be reset */
-			}
+		if (unlikely(ret == -EBUSY))
+			set_context_needs_register(ce);
+		else if (ret == -EPIPE)
+			disable_submission(guc); /* GPU will be reset */
+		else if (unlikely(ret)) {	/* unwind */
 			atomic_dec(&ce->guc_id_ref);
 			unpin_guc_id(guc, ce, true);
 			return ret;
@@ -3438,6 +3512,8 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 
 	if (context_pending_enable(ce)) {
 		clr_context_pending_enable(ce);
+
+		free_lrcd_reg_idx(guc, ce);
 	} else if (context_pending_disable(ce)) {
 		bool banned;
 
@@ -3706,6 +3782,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 		   atomic_read(&guc->outstanding_submission_g2h));
 	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
 	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
+	drm_printf(p, "GuC max context registered: %u\n\n",
+		   guc->lrcd_reg.max_idx);
 
 	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i)
 		gse_log_submission_info(guc->gse[i], p, i);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 08/46] drm/i915/guc: Take GT PM ref when deregistering context
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (6 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 07/46] drm/i915/guc: Non-static lrc descriptor registration buffer Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 09/46] drm/i915: Add GT PM unpark worker Matthew Brost
                   ` (42 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.h     |  5 +
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         | 13 +++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  4 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 99 +++++++++++++++----
 4 files changed, 102 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.h b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
index 70ea46d6cfb0..17a5028ea177 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.h
@@ -16,6 +16,11 @@ intel_engine_pm_is_awake(const struct intel_engine_cs *engine)
 	return intel_wakeref_is_active(&engine->wakeref);
 }
 
+static inline void __intel_engine_pm_get(struct intel_engine_cs *engine)
+{
+	__intel_wakeref_get(&engine->wakeref);
+}
+
 static inline void intel_engine_pm_get(struct intel_engine_cs *engine)
 {
 	intel_wakeref_get(&engine->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index d0588d8aaa44..a17bf0d4592b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -41,6 +41,19 @@ static inline void intel_gt_pm_put_async(struct intel_gt *gt)
 	intel_wakeref_put_async(&gt->wakeref);
 }
 
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)
+
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
 	return intel_wakeref_wait_for_idle(&gt->wakeref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index c0a12ae95ba5..72fdfa1f6ccd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -61,6 +61,10 @@ struct intel_guc {
 	struct list_head guc_id_list_no_ref;
 	struct list_head guc_id_list_unpinned;
 
+	spinlock_t destroy_lock;	/* protects list / worker */
+	struct list_head destroyed_contexts;
+	struct work_struct destroy_worker;
+
 	bool submission_supported;
 	bool submission_selected;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bfda15bf9182..262fa77b56e2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -914,6 +914,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 			if (deregister)
 				guc_signal_context_fence(ce);
 			if (destroyed) {
+				intel_gt_pm_put_async(guc_to_gt(guc));
 				release_guc_id(guc, ce);
 				__guc_context_destroy(ce);
 			}
@@ -1032,6 +1033,8 @@ static void guc_flush_submissions(struct intel_guc *guc)
 			gse_flush_submissions(guc->gse[i]);
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
 	int i;
@@ -1050,6 +1053,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
+	guc_flush_destroyed_contexts(guc);
 
 	/*
 	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
@@ -1377,6 +1381,8 @@ static void retire_worker_func(struct work_struct *w)
 static int guc_lrcd_reg_init(struct intel_guc *guc);
 static void guc_lrcd_reg_fini(struct intel_guc *guc);
 
+static void destroy_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1399,6 +1405,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
 	ida_init(&guc->guc_ids);
 
+	spin_lock_init(&guc->destroy_lock);
+	INIT_LIST_HEAD(&guc->destroyed_contexts);
+	INIT_WORK(&guc->destroy_worker, destroy_worker_func);
+
 	return 0;
 }
 
@@ -1409,6 +1419,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	if (!guc_submission_initialized(guc))
 		return;
 
+	guc_flush_destroyed_contexts(guc);
 	guc_lrcd_reg_fini(guc);
 
 	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i) {
@@ -2351,11 +2362,29 @@ static void guc_context_sched_disable(struct intel_context *ce)
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = guc_to_gt(guc);
+	unsigned long flags;
+	bool disabled;
 
+	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
 	GEM_BUG_ON(ce != __get_context(guc, ce->guc_id));
 	GEM_BUG_ON(context_enabled(ce));
 
+	/* Seal race with Reset */
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	disabled = submission_disabled(guc);
+	if (likely(!disabled)) {
+		__intel_gt_pm_get(gt);
+		set_context_destroyed(ce);
+	}
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+	if (unlikely(disabled)) {
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+		return;
+	}
+
 	clr_context_registered(ce);
 	deregister_context(ce, ce->guc_id, true);
 }
@@ -2384,12 +2413,52 @@ static void __guc_context_destroy(struct intel_context *ce)
 	}
 }
 
+static void guc_flush_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->destroy_lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->destroyed_contexts, guc_id_link) {
+		list_del_init(&ce->guc_id_link);
+		release_guc_id(guc, ce);
+		__guc_context_destroy(ce);
+	}
+	spin_unlock_irqrestore(&guc->destroy_lock, flags);
+}
+
+static void deregister_destroyed_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->destroy_lock, flags);
+	list_for_each_entry_safe(ce, cn,
+				 &guc->destroyed_contexts, guc_id_link) {
+		list_del_init(&ce->guc_id_link);
+		spin_unlock_irqrestore(&guc->destroy_lock, flags);
+		guc_lrc_desc_unpin(ce);
+		spin_lock_irqsave(&guc->destroy_lock, flags);
+	}
+	spin_unlock_irqrestore(&guc->destroy_lock, flags);
+}
+
+static void destroy_worker_func(struct work_struct *w)
+{
+	struct intel_guc *guc =
+		container_of(w, struct intel_guc, destroy_worker);
+	struct intel_gt *gt = guc_to_gt(guc);
+	int tmp;
+
+	with_intel_gt_pm(gt, tmp)
+		deregister_destroyed_contexts(guc);
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-	struct intel_runtime_pm *runtime_pm = ce->engine->uncore->rpm;
 	struct intel_guc *guc = ce_to_guc(ce);
-	intel_wakeref_t wakeref;
 	unsigned long flags;
 	bool disabled;
 
@@ -2429,12 +2498,12 @@ static void guc_context_destroy(struct kref *kref)
 		list_del_init(&ce->guc_id_link);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 
-	/* Seal race with Reset */
-	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	/* Seal race with reset */
+	spin_lock_irqsave(&guc->destroy_lock, flags);
 	disabled = submission_disabled(guc);
 	if (likely(!disabled))
-		set_context_destroyed(ce);
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		list_add_tail(&ce->guc_id_link, &guc->destroyed_contexts);
+	spin_unlock_irqrestore(&guc->destroy_lock, flags);
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
@@ -2442,20 +2511,11 @@ static void guc_context_destroy(struct kref *kref)
 	}
 
 	/*
-	 * We defer GuC context deregistration until the context is destroyed
-	 * in order to save on CTBs. With this optimization ideally we only need
-	 * 1 CTB to register the context during the first pin and 1 CTB to
-	 * deregister the context when the context is destroyed. Without this
-	 * optimization, a CTB would be needed every pin & unpin.
-	 *
-	 * XXX: Need to acqiure the runtime wakeref as this can be triggered
-	 * from context_free_worker when runtime wakeref is not held.
-	 * guc_lrc_desc_unpin requires the runtime as a GuC register is written
-	 * in H2G CTB to deregister the context. A future patch may defer this
-	 * H2G CTB if the runtime wakeref is zero.
+	 * We use a worker to issue the H2G to deregister the context as we can
+	 * take the GT PM for the first time which isn't allowed from an atomic
+	 * context.
 	 */
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		guc_lrc_desc_unpin(ce);
+	queue_work(system_unbound_wq, &guc->destroy_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
@@ -3472,6 +3532,7 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
 		intel_context_put(ce);
 	} else if (context_destroyed(ce)) {
 		/* Context has been destroyed */
+		intel_gt_pm_put_async(guc_to_gt(guc));
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 09/46] drm/i915: Add GT PM unpark worker
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (7 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 08/46] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
                   ` (41 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Sometimes it is desirable to queue work up for later if the GT PM isn't
held and run that work on next GT PM unpark.

Implemented with a list in the GT of all pending work, workqueues in
the list, a callback to add a workqueue to the list, and finally a
wakeref post_get callback that iterates / drains the list + queues the
workqueues.

First user of this is deregistration of GuC contexts.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  8 +++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.c | 35 +++++++++++++++++++
 .../gpu/drm/i915/gt/intel_gt_pm_unpark_work.h | 32 +++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |  3 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  3 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++---
 drivers/gpu/drm/i915/intel_wakeref.c          |  5 +++
 drivers/gpu/drm/i915/intel_wakeref.h          |  1 +
 9 files changed, 99 insertions(+), 5 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index a64aa43f7cd9..405558c08d6c 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -29,6 +29,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
 
 	spin_lock_init(&gt->irq_lock);
 
+	spin_lock_init(&gt->pm_unpark_work_lock);
+	INIT_LIST_HEAD(&gt->pm_unpark_work_list);
+
 	INIT_LIST_HEAD(&gt->closed_vma);
 	spin_lock_init(&gt->closed_lock);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dea8e2479897..564c11a3748b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -90,6 +90,13 @@ static int __gt_unpark(struct intel_wakeref *wf)
 	return 0;
 }
 
+static void __gt_unpark_work_queue(struct intel_wakeref *wf)
+{
+	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
+
+	intel_gt_pm_unpark_work_queue(gt);
+}
+
 static int __gt_park(struct intel_wakeref *wf)
 {
 	struct intel_gt *gt = container_of(wf, typeof(*gt), wakeref);
@@ -118,6 +125,7 @@ static int __gt_park(struct intel_wakeref *wf)
 
 static const struct intel_wakeref_ops wf_ops = {
 	.get = __gt_unpark,
+	.post_get = __gt_unpark_work_queue,
 	.put = __gt_park,
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
new file mode 100644
index 000000000000..23162dbd0c35
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "intel_runtime_pm.h"
+#include "intel_gt_pm.h"
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt)
+{
+	struct intel_gt_pm_unpark_work *work, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	list_for_each_entry_safe(work, next,
+				 &gt->pm_unpark_work_list, link) {
+		list_del_init(&work->link);
+		queue_work(system_unbound_wq, &work->worker);
+	}
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&gt->pm_unpark_work_lock, flags);
+	if (intel_gt_pm_is_awake(gt))
+		queue_work(system_unbound_wq, &work->worker);
+	else if (list_empty(&work->link))
+		list_add_tail(&work->link, &gt->pm_unpark_work_list);
+	spin_unlock_irqrestore(&gt->pm_unpark_work_lock, flags);
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
new file mode 100644
index 000000000000..08e9011be023
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm_unpark_work.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#ifndef INTEL_GT_PM_UNPARK_WORK_H
+#define INTEL_GT_PM_UNPARK_WORK_H
+
+#include <linux/list.h>
+#include <linux/workqueue.h>
+
+struct intel_gt;
+
+struct intel_gt_pm_unpark_work {
+	struct list_head link;
+	struct work_struct worker;
+};
+
+void intel_gt_pm_unpark_work_queue(struct intel_gt *gt);
+
+void intel_gt_pm_unpark_work_add(struct intel_gt *gt,
+				 struct intel_gt_pm_unpark_work *work);
+
+static inline void
+intel_gt_pm_unpark_work_init(struct intel_gt_pm_unpark_work *work,
+			     work_func_t fn)
+{
+	INIT_LIST_HEAD(&work->link);
+	INIT_WORK(&work->worker, fn);
+}
+
+#endif /* INTEL_GT_PM_UNPARK_WORK_H */
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 97a5075288d2..8d8a946561fa 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -91,6 +91,9 @@ struct intel_gt {
 	struct intel_wakeref wakeref;
 	atomic_t user_wakeref;
 
+	struct list_head pm_unpark_work_list;
+	spinlock_t pm_unpark_work_lock;	/* protect list */
+
 	struct list_head closed_vma;
 	spinlock_t closed_lock; /* guards the list of closed_vma */
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 72fdfa1f6ccd..aedd5a4281b8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -18,6 +18,7 @@
 #include "intel_uc_fw.h"
 #include "i915_utils.h"
 #include "i915_vma.h"
+#include "gt/intel_gt_pm_unpark_work.h"
 
 struct __guc_ads_blob;
 
@@ -63,7 +64,7 @@ struct intel_guc {
 
 	spinlock_t destroy_lock;	/* protects list / worker */
 	struct list_head destroyed_contexts;
-	struct work_struct destroy_worker;
+	struct intel_gt_pm_unpark_work destroy_worker;
 
 	bool submission_supported;
 	bool submission_selected;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 262fa77b56e2..7fe4d1559a81 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1406,8 +1406,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	ida_init(&guc->guc_ids);
 
 	spin_lock_init(&guc->destroy_lock);
+
 	INIT_LIST_HEAD(&guc->destroyed_contexts);
-	INIT_WORK(&guc->destroy_worker, destroy_worker_func);
+	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
 
 	return 0;
 }
@@ -2446,13 +2447,18 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
 
 static void destroy_worker_func(struct work_struct *w)
 {
+	struct intel_gt_pm_unpark_work *destroy_worker =
+		container_of(w, struct intel_gt_pm_unpark_work, worker);
 	struct intel_guc *guc =
-		container_of(w, struct intel_guc, destroy_worker);
+		container_of(destroy_worker, struct intel_guc, destroy_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	int tmp;
 
-	with_intel_gt_pm(gt, tmp)
+	with_intel_gt_pm_if_awake(gt, tmp)
 		deregister_destroyed_contexts(guc);
+
+	if (!list_empty(&guc->destroyed_contexts))
+		intel_gt_pm_unpark_work_add(gt, destroy_worker);
 }
 
 static void guc_context_destroy(struct kref *kref)
@@ -2515,7 +2521,7 @@ static void guc_context_destroy(struct kref *kref)
 	 * take the GT PM for the first time which isn't allowed from an atomic
 	 * context.
 	 */
-	queue_work(system_unbound_wq, &guc->destroy_worker);
+	intel_gt_pm_unpark_work_add(guc_to_gt(guc), &guc->destroy_worker);
 }
 
 static int guc_context_alloc(struct intel_context *ce)
diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
index dfd87d082218..282fc4f312e3 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -24,6 +24,8 @@ static void rpm_put(struct intel_wakeref *wf)
 
 int __intel_wakeref_get_first(struct intel_wakeref *wf)
 {
+	bool do_post = false;
+
 	/*
 	 * Treat get/put as different subclasses, as we may need to run
 	 * the put callback from under the shrinker and do not want to
@@ -44,8 +46,11 @@ int __intel_wakeref_get_first(struct intel_wakeref *wf)
 		}
 
 		smp_mb__before_atomic(); /* release wf->count */
+		do_post = true;
 	}
 	atomic_inc(&wf->count);
+	if (do_post && wf->ops->post_get)
+		wf->ops->post_get(wf);
 	mutex_unlock(&wf->mutex);
 
 	INTEL_WAKEREF_BUG_ON(atomic_read(&wf->count) <= 0);
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index 545c8f277c46..ef7e6a698e8a 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -30,6 +30,7 @@ typedef depot_stack_handle_t intel_wakeref_t;
 
 struct intel_wakeref_ops {
 	int (*get)(struct intel_wakeref *wf);
+	void (*post_get)(struct intel_wakeref *wf);
 	int (*put)(struct intel_wakeref *wf);
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (8 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 09/46] drm/i915: Add GT PM unpark worker Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 14:23   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
                   ` (40 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a scheduling of user context could be enabled.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |  1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 903de270f2db..5e3a1e2095b0 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -103,6 +103,7 @@ gt-y += \
 	gt/intel_gt_clock_utils.o \
 	gt/intel_gt_irq.o \
 	gt/intel_gt_pm.o \
+	gt/intel_gt_pm_unpark_work.o \
 	gt/intel_gt_pm_irq.o \
 	gt/intel_gt_requests.o \
 	gt/intel_gtt.o \
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 7fe4d1559a81..c5d9548bfd00 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2056,7 +2056,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
 
 static int guc_context_pin(struct intel_context *ce, void *vaddr)
 {
-	return __guc_context_pin(ce, ce->engine, vaddr);
+	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+
+	if (likely(!ret && !intel_context_is_barrier(ce)))
+		intel_engine_pm_get(ce->engine);
+
+	return ret;
 }
 
 static void guc_context_unpin(struct intel_context *ce)
@@ -2067,6 +2072,9 @@ static void guc_context_unpin(struct intel_context *ce)
 
 	unpin_guc_id(guc, ce, true);
 	lrc_unpin(ce);
+
+	if (likely(!intel_context_is_barrier(ce)))
+		intel_engine_pm_put(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -3002,8 +3010,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
 static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
+	int ret = __guc_context_pin(ce, engine, vaddr);
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+
+	if (likely(!ret))
+		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+			intel_engine_pm_get(engine);
 
-	return __guc_context_pin(ce, engine, vaddr);
+	return ret;
+}
+
+static void guc_virtual_context_unpin(struct intel_context *ce)
+{
+	intel_engine_mask_t tmp, mask = ce->engine->mask;
+	struct intel_engine_cs *engine;
+	struct intel_guc *guc = ce_to_guc(ce);
+
+	GEM_BUG_ON(context_enabled(ce));
+	GEM_BUG_ON(intel_context_is_barrier(ce));
+
+	unpin_guc_id(guc, ce, true);
+	lrc_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
+		intel_engine_pm_put(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -3040,7 +3070,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
-	.unpin = guc_context_unpin,
+	.unpin = guc_virtual_context_unpin,
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (9 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 14:27   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 12/46] drm/i915/guc: Selftest for GuC flow control Matthew Brost
                   ` (39 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all contexts are pinned. By not calling
switch_to_kernel_context we save on issuing a request to the engine.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 1f07ac4e0672..58099de6bf07 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -162,6 +162,10 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	unsigned long flags;
 	bool result = true;
 
+	/* No need to switch_to_kernel_context if GuC submission */
+	if (intel_engine_uses_guc(engine))
+		return true;
+
 	/* GPU is pointing to the void, as good as in the kernel context. */
 	if (intel_gt_is_wedged(engine->gt))
 		return true;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 12/46] drm/i915/guc: Selftest for GuC flow control
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (10 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping Matthew Brost
                   ` (38 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add 5 selftests for hard (from user space) to recreate flow conditions.
Test listed below:

1. A test to verify that the number of guc_ids can be exhausted and all
submissions still complete.

2. A test to verify that the flow control state machine can recover from
a full GPU reset.

3. A teset to verify that the lrcd registration slots can be exhausted
and all submissions still complete.

4. A test to verify that the H2G channel can deadlock and a full GPU
reset recovers the system.

5. A test to stress to CTB channel but submitting to lots of contexts
and then immediately destroy the contexts.

Tests 1, 2, and 3 also ensure when the flow control is triggered by
unready requests those unready requests do not DoS ready requests.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   6 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  43 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h     |   9 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  16 +
 .../i915/gt/uc/intel_guc_submission_types.h   |   2 +
 .../i915/gt/uc/selftest_guc_flow_control.c    | 581 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 9 files changed, 662 insertions(+), 10 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index aedd5a4281b8..c0c60ccabfa4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -103,6 +103,12 @@ struct intel_guc {
 
 	/* To serialize the intel_guc_send actions */
 	struct mutex send_mutex;
+
+	I915_SELFTEST_DECLARE(bool gse_hang_expected;)
+	I915_SELFTEST_DECLARE(bool deadlock_expected;)
+	I915_SELFTEST_DECLARE(bool bad_desc_expected;)
+	I915_SELFTEST_DECLARE(bool inject_bad_sched_disable;)
+	I915_SELFTEST_DECLARE(bool inject_corrupt_h2g;)
 };
 
 static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 22b4733b55e2..ab1ce8901c15 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -3,7 +3,6 @@
  * Copyright © 2016-2019 Intel Corporation
  */
 
-#include <linux/circ_buf.h>
 #include <linux/ktime.h>
 #include <linux/time64.h>
 #include <linux/timekeeping.h>
@@ -414,8 +413,9 @@ static int ct_write(struct intel_guc_ct *ct,
 	u32 *cmds = ctb->cmds;
 	unsigned int i;
 
-	if (unlikely(desc->status))
-		goto corrupted;
+	if (!I915_SELFTEST_ONLY(ct_to_guc(ct)->deadlock_expected))
+		if (unlikely(desc->status))
+			goto corrupted;
 
 	GEM_BUG_ON(tail > size);
 
@@ -443,6 +443,15 @@ static int ct_write(struct intel_guc_ct *ct,
 		 FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len) |
 		 FIELD_PREP(GUC_CTB_MSG_0_FENCE, fence);
 
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+	if (ct_to_guc(ct)->inject_corrupt_h2g) {
+		header = FIELD_PREP(GUC_CTB_MSG_0_FORMAT, 3) |
+			 FIELD_PREP(GUC_CTB_MSG_0_NUM_DWORDS, len + 5) |
+			 FIELD_PREP(GUC_CTB_MSG_0_FENCE, 0xdead);
+		ct_to_guc(ct)->inject_corrupt_h2g = false;
+	}
+#endif
+
 	type = (flags & INTEL_GUC_CT_SEND_NB) ? GUC_HXG_TYPE_EVENT :
 		GUC_HXG_TYPE_REQUEST;
 	hxg = FIELD_PREP(GUC_HXG_MSG_0_TYPE, type) |
@@ -481,8 +490,12 @@ static int ct_write(struct intel_guc_ct *ct,
 	return 0;
 
 corrupted:
-	CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
-		 desc->head, desc->tail, desc->status);
+	if (I915_SELFTEST_ONLY(ct_to_guc(ct)->bad_desc_expected))
+		CT_DEBUG(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
+			 desc->head, desc->tail, desc->status);
+	else
+		CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
+			 desc->head, desc->tail, desc->status);
 	ctb->broken = true;
 	return -EPIPE;
 }
@@ -539,9 +552,18 @@ static inline bool ct_deadlocked(struct intel_guc_ct *ct)
 		struct guc_ct_buffer_desc *send = ct->ctbs.send.desc;
 		struct guc_ct_buffer_desc *recv = ct->ctbs.send.desc;
 
-		CT_ERROR(ct, "Communication stalled for %lld ms, desc status=%#x,%#x\n",
-			 ktime_ms_delta(ktime_get(), ct->stall_time),
-			 send->status, recv->status);
+		/*
+		 * CI doesn't like error messages, demote to debug if deadlock was
+		 * intentionally hit.
+		 */
+		if (I915_SELFTEST_ONLY(ct_to_guc(ct)->deadlock_expected))
+			CT_DEBUG(ct, "Communication stalled for %lld ms, desc status=%#x,%#x\n",
+				 ktime_ms_delta(ktime_get(), ct->stall_time),
+				 send->status, recv->status);
+		else
+			CT_ERROR(ct, "Communication stalled for %lld ms, desc status=%#x,%#x\n",
+				 ktime_ms_delta(ktime_get(), ct->stall_time),
+				 send->status, recv->status);
 		ct->ctbs.send.broken = true;
 	}
 
@@ -767,8 +789,9 @@ int intel_guc_ct_send(struct intel_guc_ct *ct, const u32 *action, u32 len,
 		return -ENODEV;
 	}
 
-	if (unlikely(ct->ctbs.send.broken))
-		return -EPIPE;
+	if (!I915_SELFTEST_ONLY(ct_to_guc(ct)->deadlock_expected))
+		if (unlikely(ct->ctbs.send.broken))
+			return -EPIPE;
 
 	if (flags & INTEL_GUC_CT_SEND_NB)
 		return ct_send_nb(ct, action, len, flags);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
index f709a19c7e21..5963eda95022 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
@@ -6,6 +6,7 @@
 #ifndef _INTEL_GUC_CT_H_
 #define _INTEL_GUC_CT_H_
 
+#include <linux/circ_buf.h>
 #include <linux/interrupt.h>
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
@@ -117,4 +118,12 @@ void intel_guc_ct_event_handler(struct intel_guc_ct *ct);
 
 void intel_guc_ct_print_info(struct intel_guc_ct *ct, struct drm_printer *p);
 
+static inline bool intel_guc_ct_is_recv_buffer_empty(struct intel_guc_ct *ct)
+{
+	struct intel_guc_ct_buffer *ctb = &ct->ctbs.recv;
+
+	return atomic_read(&ctb->space) ==
+		(CIRC_SPACE(0, 0, ctb->size) - ctb->resv_space);
+}
+
 #endif /* _INTEL_GUC_CT_H_ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c5d9548bfd00..310116f40509 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -824,6 +824,7 @@ static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 			GEM_WARN_ON(ret);	/* Unexpected */
 			goto deadlk;
 		}
+		I915_SELFTEST_DECLARE(++gse->tasklets_submit_count;)
 	}
 
 	/*
@@ -2107,7 +2108,15 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 		GUC_CONTEXT_DISABLE
 	};
 
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+	if (guc->inject_bad_sched_disable &&
+	    guc_id == GUC_INVALID_LRC_ID)
+		guc->inject_bad_sched_disable = false;
+	else
+		GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
+#else
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
+#endif
 
 	trace_intel_context_sched_disable(ce);
 
@@ -2770,6 +2779,9 @@ static void retire_worker_sched_disable(struct guc_submit_engine *gse,
 		guc_id = prep_context_pending_disable(ce);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
+		if (I915_SELFTEST_ONLY(guc->inject_bad_sched_disable))
+			guc_id = GUC_INVALID_LRC_ID;
+
 		with_intel_runtime_pm(runtime_pm, wakeref)
 			__guc_context_sched_disable(guc, ce, guc_id);
 
@@ -4021,3 +4033,7 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 
 	return false;
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc_flow_control.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
index 0c224ab18c02..7069b7248f55 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
@@ -47,6 +47,8 @@ struct guc_submit_engine {
 		STALL_MOVE_LRC_TAIL,
 		STALL_ADD_REQUEST,
 	} submission_stall_reason;
+
+	I915_SELFTEST_DECLARE(u64 tasklets_submit_count;)
 };
 
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
new file mode 100644
index 000000000000..f31ab2674b2b
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
@@ -0,0 +1,581 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2021 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static int __request_add_spin(struct i915_request *rq, struct igt_spinner *spin)
+{
+	int err = 0;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (spin && !igt_wait_for_spinner(spin, rq))
+		err = -ETIMEDOUT;
+
+	return err;
+}
+
+static struct i915_request *nop_kernel_request(struct intel_engine_cs *engine)
+{
+	struct i915_request *rq;
+
+	rq = intel_engine_create_kernel_request(engine);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static struct i915_request *nop_user_request(struct intel_context *ce,
+					     struct i915_request *from)
+{
+	struct i915_request *rq;
+	int ret;
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	return rq;
+}
+
+static int nop_request_wait(struct intel_engine_cs *engine, bool kernel,
+			    bool flow_control)
+{
+	struct i915_gpu_error *global = &engine->gt->i915->gpu_error;
+	unsigned int reset_count = i915_reset_count(global);
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	struct guc_submit_engine *gse = guc->gse[GUC_SUBMIT_ENGINE_SINGLE_LRC];
+	u64 tasklets_submit_count = gse->tasklets_submit_count;
+	struct intel_context *ce;
+	struct i915_request *nop;
+	int ret;
+
+	if (kernel) {
+		nop = nop_kernel_request(engine);
+	} else {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce))
+			return PTR_ERR(ce);
+		nop = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+	}
+	if (IS_ERR(nop))
+		return PTR_ERR(nop);
+
+	ret = intel_selftest_wait_for_rq(nop);
+	i915_request_put(nop);
+	if (ret)
+		return ret;
+
+	if (!flow_control &&
+	    gse->tasklets_submit_count != tasklets_submit_count) {
+		pr_err("Flow control for single-lrc unexpectedly kicked in\n");
+		ret = -EINVAL;
+	}
+
+	if (flow_control &&
+	    gse->tasklets_submit_count == tasklets_submit_count) {
+		pr_err("Flow control for single-lrc did not kick in\n");
+		ret = -EINVAL;
+	}
+
+	if (i915_reset_count(global) != reset_count) {
+		pr_err("Unexpected GPU reset during single-lrc submit\n");
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+#define NUM_GUC_ID		256
+#define NUM_CONTEXT		1024
+#define NUM_RQ_PER_CONTEXT	2
+#define HEARTBEAT_INTERVAL	1500
+
+static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
+{
+	struct intel_gt *gt = arg;
+	struct intel_guc *guc = &gt->uc.guc;
+	struct guc_submit_engine *gse = guc->gse[GUC_SUBMIT_ENGINE_SINGLE_LRC];
+	struct intel_context **contexts;
+	int ret = 0;
+	int i, j, k;
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *spin_rq = NULL, *last = NULL;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	unsigned int reset_count;
+	u64 tasklets_submit_count = gse->tasklets_submit_count;
+	u32 old_beat;
+
+	contexts = kmalloc(sizeof(*contexts) * NUM_CONTEXT, GFP_KERNEL);
+	if (!contexts) {
+		pr_err("Context array allocation failed\n");
+		return -ENOMEM;
+	}
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	if (limit_guc_ids)
+		guc->num_guc_ids = NUM_GUC_ID;
+
+	ce = intel_context_create(intel_selftest_find_any_engine(gt));
+	if (IS_ERR(ce)) {
+		ret = PTR_ERR(ce);
+		pr_err("Failed to create context: %d\n", ret);
+		goto err;
+	}
+
+	reset_count = i915_reset_count(global);
+	engine = ce->engine;
+
+	old_beat = engine->props.heartbeat_interval_ms;
+	if (hang) {
+		ret = intel_engine_set_heartbeat(engine, HEARTBEAT_INTERVAL);
+		if (ret) {
+			pr_err("Failed to boost heartbeat interval: %d\n", ret);
+			goto err;
+		}
+	}
+
+	/* Create spinner to block requests in below loop */
+	ret = igt_spinner_init(&spin, engine->gt);
+	if (ret) {
+		pr_err("Failed to create spinner: %d\n", ret);
+		goto err_heartbeat;
+	}
+	spin_rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	intel_context_put(ce);
+	if (IS_ERR(spin_rq)) {
+		ret = PTR_ERR(spin_rq);
+		pr_err("Failed to create spinner request: %d\n", ret);
+		goto err_heartbeat;
+	}
+	ret = __request_add_spin(spin_rq, &spin);
+	if (ret) {
+		pr_err("Failed to add Spinner request: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/*
+	 * Create of lot of requests in a loop to trigger the flow control state
+	 * machine. Using a three level loop as it is interesting to hit flow
+	 * control with more than 1 request on each context in a row and also
+	 * interleave requests with other contexts.
+	 */
+	for (i = 0; i < NUM_RQ_PER_CONTEXT; ++i) {
+		for (j = 0; j < NUM_CONTEXT; ++j) {
+			for (k = 0; k < NUM_RQ_PER_CONTEXT; ++k) {
+				bool first_pass = !i && !k;
+
+				if (last)
+					i915_request_put(last);
+				last = NULL;
+
+				if (first_pass)
+					contexts[j] = intel_context_create(engine);
+				ce = contexts[j];
+
+				if (IS_ERR(ce)) {
+					ret = PTR_ERR(ce);
+					pr_err("Failed to create context, %d,%d,%d: %d\n",
+					       i, j, k, ret);
+					goto err_spin_rq;
+				}
+
+				last = nop_user_request(ce, spin_rq);
+				if (first_pass)
+					intel_context_put(ce);
+				if (IS_ERR(last)) {
+					ret = PTR_ERR(last);
+					pr_err("Failed to create request, %d,%d,%d: %d\n",
+					       i, j, k, ret);
+					goto err_spin_rq;
+				}
+			}
+		}
+	}
+
+	/* Verify GuC submit engine state */
+	if (limit_guc_ids && !guc_ids_exhausted(gse)) {
+		pr_err("guc_ids not exhausted\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+	if (!limit_guc_ids && guc_ids_exhausted(gse)) {
+		pr_err("guc_ids exhausted\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+
+	/* Ensure no DoS from unready requests */
+	ret = nop_request_wait(engine, false, true);
+	if (ret < 0) {
+		pr_err("User NOP request DoS: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Inject hang in flow control state machine */
+	if (hang) {
+		guc->gse_hang_expected = true;
+		guc->inject_bad_sched_disable = true;
+	}
+
+	/* Release blocked requests */
+	igt_spinner_end(&spin);
+	ret = intel_selftest_wait_for_rq(spin_rq);
+	if (ret) {
+		pr_err("Spin request failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	i915_request_put(spin_rq);
+	igt_spinner_fini(&spin);
+	spin_rq = NULL;
+
+	/* Wait for last request / GT to idle */
+	ret = i915_request_wait(last, 0, hang ? HZ * 30 : HZ * 10);
+	if (ret < 0) {
+		pr_err("Last request failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	i915_request_put(last);
+	last = NULL;
+	ret = intel_gt_wait_for_idle(gt, HZ * 5);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Check state after idle */
+	if (guc_ids_exhausted(gse)) {
+		pr_err("guc_ids exhausted after last request signaled\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+	if (hang) {
+		if (i915_reset_count(global) == reset_count) {
+			pr_err("Failed to record a GPU reset\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+	} else {
+		if (i915_reset_count(global) != reset_count) {
+			pr_err("Unexpected GPU reset\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+		if (gse->tasklets_submit_count == tasklets_submit_count) {
+			pr_err("Flow control failed to kick in\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+	}
+
+	/* Verify requests can be submitted after flow control */
+	ret = nop_request_wait(engine, true, false);
+	if (ret < 0) {
+		pr_err("Kernel NOP failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	ret = nop_request_wait(engine, false, false);
+	if (ret < 0) {
+		pr_err("User NOP failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+err_spin_rq:
+	if (spin_rq) {
+		igt_spinner_end(&spin);
+		intel_selftest_wait_for_rq(spin_rq);
+		i915_request_put(spin_rq);
+		igt_spinner_fini(&spin);
+		intel_gt_wait_for_idle(gt, HZ * 5);
+	}
+err_heartbeat:
+	if (last)
+		i915_request_put(last);
+	intel_engine_set_heartbeat(engine, old_beat);
+err:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	guc->num_guc_ids = guc->max_guc_ids;
+	guc->gse_hang_expected = false;
+	guc->inject_bad_sched_disable = false;
+	kfree(contexts);
+
+	return ret;
+}
+
+static int intel_guc_flow_control_guc_ids(void *arg)
+{
+	return __intel_guc_flow_control_guc(arg, true, false);
+}
+
+static int intel_guc_flow_control_lrcd_reg(void *arg)
+{
+	return __intel_guc_flow_control_guc(arg, false, false);
+}
+
+static int intel_guc_flow_control_hang_state_machine(void *arg)
+{
+	return __intel_guc_flow_control_guc(arg, true, true);
+}
+
+#define NUM_RQ_STRESS_CTBS	0x4000
+static int intel_guc_flow_control_stress_ctbs(void *arg)
+{
+	struct intel_gt *gt = arg;
+	int ret = 0;
+	int i;
+	struct intel_context *ce;
+	struct i915_request *last = NULL, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	unsigned int reset_count;
+	struct intel_guc *guc = &gt->uc.guc;
+	struct intel_guc_ct_buffer *ctb = &guc->ct.ctbs.recv;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	reset_count = i915_reset_count(global);
+	engine = intel_selftest_find_any_engine(gt);
+
+	/*
+	 * Create a bunch of requests, and then idle the GT which will create a
+	 * lot of H2G / G2H traffic.
+	 */
+	for (i = 0; i < NUM_RQ_STRESS_CTBS; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err;
+		}
+
+		if (last)
+			i915_request_put(last);
+		last = rq;
+	}
+
+	ret = i915_request_wait(last, 0, HZ * 10);
+	if (ret < 0) {
+		pr_err("Last request failed to complete: %d\n", ret);
+		goto err;
+	}
+	i915_request_put(last);
+	last = NULL;
+
+	ret = intel_gt_wait_for_idle(gt, HZ * 10);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+	if (i915_reset_count(global) != reset_count) {
+		pr_err("Unexpected GPU reset\n");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = nop_request_wait(engine, true, false);
+	if (ret < 0) {
+		pr_err("Kernel NOP failed to complete: %d\n", ret);
+		goto err;
+	}
+
+	ret = nop_request_wait(engine, false, false);
+	if (ret < 0) {
+		pr_err("User NOP failed to complete: %d\n", ret);
+		goto err;
+	}
+
+	ret = intel_gt_wait_for_idle(gt, HZ);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err;
+	}
+
+	ret = wait_for(intel_guc_ct_is_recv_buffer_empty(&guc->ct), HZ);
+	if (ret) {
+		pr_err("Recv CTB not expected value=%d,%d outstanding_ctb=%d\n",
+		       atomic_read(&ctb->space),
+		       CIRC_SPACE(0, 0, ctb->size) - ctb->resv_space,
+		       atomic_read(&guc->outstanding_submission_g2h));
+		ret = -EINVAL;
+		goto err;
+	}
+
+err:
+	if (last)
+		i915_request_put(last);
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+
+	return ret;
+}
+
+#define NUM_RQ_DEADLOCK		2048
+static int __intel_guc_flow_control_deadlock_h2g(void *arg, bool bad_desc)
+{
+	struct intel_gt *gt = arg;
+	struct intel_guc *guc = &gt->uc.guc;
+	int ret = 0;
+	int i;
+	struct intel_context *ce;
+	struct i915_request *last = NULL, *rq;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	unsigned int reset_count;
+	u32 old_beat;
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	reset_count = i915_reset_count(global);
+	engine = intel_selftest_find_any_engine(gt);
+
+	old_beat = engine->props.heartbeat_interval_ms;
+	ret = intel_engine_set_heartbeat(engine, HEARTBEAT_INTERVAL);
+	if (ret) {
+		pr_err("Failed to boost heartbeat interval: %d\n", ret);
+		goto err;
+	}
+
+	guc->inject_corrupt_h2g = true;
+	if (bad_desc)
+		guc->bad_desc_expected = true;
+	else
+		guc->deadlock_expected = true;
+
+	for (i = 0; i < NUM_RQ_DEADLOCK; ++i) {
+		ce = intel_context_create(engine);
+		if (IS_ERR(ce)) {
+			ret = PTR_ERR(ce);
+			pr_err("Failed to create context, %d: %d\n", i, ret);
+			goto err_heartbeat;
+		}
+
+		rq = nop_user_request(ce, NULL);
+		intel_context_put(ce);
+
+		if (IS_ERR(rq)) {
+			ret = PTR_ERR(rq);
+			pr_err("Failed to create request, %d: %d\n", i, ret);
+			goto err_heartbeat;
+		}
+
+		if (last)
+			i915_request_put(last);
+		last = rq;
+	}
+
+	pr_debug("Number requests before deadlock: %d\n", i);
+
+	ret = i915_request_wait(last, 0, HZ * 5);
+	if (ret < 0) {
+		pr_err("Last request failed to complete: %d\n", ret);
+		goto err_heartbeat;
+	}
+	i915_request_put(last);
+	last = NULL;
+
+	ret = intel_gt_wait_for_idle(gt, HZ * 10);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err_heartbeat;
+	}
+
+	if (i915_reset_count(global) == reset_count) {
+		pr_err("Failed to record a GPU reset\n");
+		ret = -EINVAL;
+		goto err_heartbeat;
+	}
+
+	ret = nop_request_wait(engine, true, false);
+	if (ret < 0) {
+		pr_err("Kernel NOP failed to complete: %d\n", ret);
+		goto err_heartbeat;
+	}
+
+	ret = nop_request_wait(engine, false, false);
+	if (ret < 0) {
+		pr_err("User NOP failed to complete: %d\n", ret);
+		goto err_heartbeat;
+	}
+
+err_heartbeat:
+	if (last)
+		i915_request_put(last);
+	intel_engine_set_heartbeat(engine, old_beat);
+err:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	guc->inject_corrupt_h2g = false;
+	guc->deadlock_expected = false;
+	guc->bad_desc_expected = false;
+
+	return ret;
+}
+
+static int intel_guc_flow_control_deadlock_h2g(void *arg)
+{
+	return __intel_guc_flow_control_deadlock_h2g(arg, false);
+}
+
+static int intel_guc_flow_control_bad_desc_h2g(void *arg)
+{
+	return __intel_guc_flow_control_deadlock_h2g(arg, true);
+}
+
+int intel_guc_flow_control(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_flow_control_stress_ctbs),
+		SUBTEST(intel_guc_flow_control_guc_ids),
+		SUBTEST(intel_guc_flow_control_lrcd_reg),
+		SUBTEST(intel_guc_flow_control_hang_state_machine),
+		SUBTEST(intel_guc_flow_control_deadlock_h2g),
+		SUBTEST(intel_guc_flow_control_bad_desc_h2g),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index e2fd1b61af71..d9bd732b741a 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -47,5 +47,6 @@ selftest(hangcheck, intel_hangcheck_live_selftests)
 selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
+selftest(guc_flow_control, intel_guc_flow_control)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 4b328346b48a..310fb83c527e 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -14,6 +14,18 @@
 #define REDUCED_PREEMPT		10
 #define WAIT_FOR_RESET_TIME	10000
 
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, gt, id)
+		return engine;
+
+	pr_err("No valid engine found!\n");
+	return NULL;
+}
+
 int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 				 struct intel_selftest_saved_policy *saved,
 				 u32 modify_type)
diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
index 35c098601ac0..6c776345f75c 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.h
@@ -10,6 +10,7 @@
 
 struct i915_request;
 struct intel_engine_cs;
+struct intel_gt;
 
 struct intel_selftest_saved_policy {
 	u32 flags;
@@ -29,5 +30,6 @@ int intel_selftest_modify_policy(struct intel_engine_cs *engine,
 int intel_selftest_restore_policy(struct intel_engine_cs *engine,
 				  struct intel_selftest_saved_policy *saved);
 int intel_selftest_wait_for_rq(struct i915_request *rq);
+struct intel_engine_cs *intel_selftest_find_any_engine(struct intel_gt *gt);
 
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (11 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 12/46] drm/i915/guc: Selftest for GuC flow control Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 14:28   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user Matthew Brost
                   ` (37 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  1 +
 .../drm/i915/gt/intel_execlists_submission.c  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
 5 files changed, 56 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 0d9105a31d84..4d790f9a65dd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
 	GEM_DEBUG_WARN_ON(iir);
 }
 
-static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
+static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
+			      u8 logical_instance)
 {
 	const struct engine_info *info = &intel_engines[id];
 	struct drm_i915_private *i915 = gt->i915;
@@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 
 	engine->class = info->class;
 	engine->instance = info->instance;
+	engine->logical_mask = BIT(logical_instance);
 	__sprint_engine_name(engine);
 
 	engine->props.heartbeat_interval_ms =
@@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
 	return info->engine_mask;
 }
 
+static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
+				 u8 class, const u8 *map, u8 num_instances)
+{
+	int i, j;
+	u8 current_logical_id = 0;
+
+	for (j = 0; j < num_instances; ++j) {
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			if (!HAS_ENGINE(gt, i) ||
+			    intel_engines[i].class != class)
+				continue;
+
+			if (intel_engines[i].instance == map[j]) {
+				logical_ids[intel_engines[i].instance] =
+					current_logical_id++;
+				break;
+			}
+		}
+	}
+}
+
+static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
+{
+	int i;
+	u8 map[MAX_ENGINE_INSTANCE + 1];
+
+	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
+		map[i] = i;
+	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
+}
+
 /**
  * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
  * @gt: pointer to struct intel_gt
@@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	struct drm_i915_private *i915 = gt->i915;
 	const unsigned int engine_mask = init_engine_mask(gt);
 	unsigned int mask = 0;
-	unsigned int i;
+	unsigned int i, class;
+	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
 	int err;
 
 	drm_WARN_ON(&i915->drm, engine_mask == 0);
@@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
 	if (i915_inject_probe_failure(i915))
 		return -ENODEV;
 
-	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
-		if (!HAS_ENGINE(gt, i))
-			continue;
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		setup_logical_ids(gt, logical_ids, class);
 
-		err = intel_engine_setup(gt, i);
-		if (err)
-			goto cleanup;
+		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
+			u8 instance = intel_engines[i].instance;
+
+			if (intel_engines[i].class != class ||
+			    !HAS_ENGINE(gt, i))
+				continue;
 
-		mask |= BIT(i);
+			err = intel_engine_setup(gt, i,
+						 logical_ids[instance]);
+			if (err)
+				goto cleanup;
+
+			mask |= BIT(i);
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index ed91bcff20eb..85e5c9a9e502 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -266,6 +266,7 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 
 	intel_engine_mask_t mask;
+	intel_engine_mask_t logical_mask;
 
 	u8 class;
 	u8 instance;
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index de5f9c86b9a4..baa1797af1c8 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3879,6 +3879,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 
 		ve->siblings[ve->num_siblings++] = sibling;
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		/*
 		 * All physical engines must be compatible for their emission
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 6926919bcac6..9f5f43a16182 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
 	for_each_engine(engine, gt, id) {
 		u8 guc_class = engine_class_to_guc_class(engine->class);
 
-		system_info->mapping_table[guc_class][engine->instance] =
+		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
 			engine->instance;
 	}
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 310116f40509..dec757d319a2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1795,23 +1795,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
-static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
-{
-	switch (class) {
-	case RENDER_CLASS:
-		return mask >> RCS0;
-	case VIDEO_ENHANCEMENT_CLASS:
-		return mask >> VECS0;
-	case VIDEO_DECODE_CLASS:
-		return mask >> VCS0;
-	case COPY_ENGINE_CLASS:
-		return mask >> BCS0;
-	default:
-		MISSING_CASE(class);
-		return 0;
-	}
-}
-
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -1952,8 +1935,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	desc = __get_lrc_desc(guc, ce->guc_lrcd_reg_idx);
 	desc->engine_class = engine_class_to_guc_class(engine->class);
-	desc->engine_submit_mask = adjust_engine_mask(engine->class,
-						      engine->mask);
+	desc->engine_submit_mask = engine->logical_mask;
 	desc->hw_context_desc = ce->lrc.lrca;
 	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
 	desc->priority = ce->guc_prio;
@@ -3978,6 +3960,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		}
 
 		ve->base.mask |= sibling->mask;
+		ve->base.logical_mask |= sibling->logical_mask;
 
 		if (n != 0 && ve->base.class != sibling->class) {
 			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (12 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 14:30   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
                   ` (36 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these needs to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/i915_query.c | 2 ++
 include/uapi/drm/i915_drm.h       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
index e49da36c62fb..8a72923fbdba 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
 	for_each_uabi_engine(engine, i915) {
 		info.engine.engine_class = engine->uabi_class;
 		info.engine.engine_instance = engine->uabi_instance;
+		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
 		info.capabilities = engine->uabi_capabilities;
+		info.logical_instance = ilog2(engine->logical_mask);
 
 		if (copy_to_user(info_ptr, &info, sizeof(info)))
 			return -EFAULT;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f13d241417f..ef72e07fe08c 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2706,14 +2706,20 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** @logical_instance: Logical instance of engine */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	/** @rsvd2: Reserved fields. */
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (13 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 14:37   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions Matthew Brost
                   ` (35 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its'
children and itself.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.

Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
 drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
 3 files changed, 59 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 745e84c72c90..8cb92b10b547 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	spin_lock_init(&ce->guc_state.lock);
 	INIT_LIST_HEAD(&ce->guc_state.fences);
 
+	INIT_LIST_HEAD(&ce->guc_child_list);
+
 	spin_lock_init(&ce->guc_active.lock);
 	INIT_LIST_HEAD(&ce->guc_active.requests);
 
@@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 
 void intel_context_fini(struct intel_context *ce)
 {
+	struct intel_context *child, *next;
+
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
 
+	/* Need to put the creation ref for the children */
+	if (intel_context_is_parent(ce))
+		for_each_child_safe(ce, child, next)
+			intel_context_put(child);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 }
@@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child)
+{
+	/*
+	 * Callers responsibility to validate that this function is used
+	 * correctly but we use GEM_BUG_ON here ensure that they do.
+	 */
+	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
+	GEM_BUG_ON(intel_context_is_pinned(parent));
+	GEM_BUG_ON(intel_context_is_child(parent));
+	GEM_BUG_ON(intel_context_is_pinned(child));
+	GEM_BUG_ON(intel_context_is_child(child));
+	GEM_BUG_ON(intel_context_is_parent(child));
+
+	parent->guc_number_children++;
+	list_add_tail(&child->guc_child_link,
+		      &parent->guc_child_list);
+	child->parent = parent;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c41098950746..ad6ce5ac4824 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
 int intel_context_reconfigure_sseu(struct intel_context *ce,
 				   const struct intel_sseu sseu);
 
+static inline bool intel_context_is_child(struct intel_context *ce)
+{
+	return !!ce->parent;
+}
+
+static inline bool intel_context_is_parent(struct intel_context *ce)
+{
+	return !!ce->guc_number_children;
+}
+
+void intel_context_bind_parent_child(struct intel_context *parent,
+				     struct intel_context *child);
+
+#define for_each_child(parent, ce)\
+	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
+#define for_each_child_safe(parent, ce, cn)\
+	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
+
 /**
  * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
  * @ce - the context
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 2df79ba39867..66b22b370a72 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -202,6 +202,18 @@ struct intel_context {
 	/* GuC context blocked fence */
 	struct i915_sw_fence guc_blocked;
 
+	/* Head of children list or link in parent's children list */
+	union {
+		struct list_head guc_child_list;	/* parent */
+		struct list_head guc_child_link;	/* child */
+	};
+
+	/* Pointer to parent */
+	struct intel_context *parent;
+
+	/* Number of children if parent */
+	u8 guc_number_children;
+
 	/*
 	 * GuC priority management
 	 */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (14 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 15:17   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 17/46] drm/i915/guc: Add multi-lrc context registration Matthew Brost
                   ` (34 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Implement GuC parent-child context pin / unpin functions in which in any
contexts in the relationship are pinned all the contexts are pinned. The
parent owns most of the pinning / unpinning process and the children
direct any pins / unpins to the parent.

Patch implements a number of unused functions that will be connected
later in the series.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
 drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
 drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
 drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
 9 files changed, 371 insertions(+), 112 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 8cb92b10b547..bb4c14656067 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
 	intel_ring_unpin(ring);
 }
 
-static int intel_context_pre_pin(struct intel_context *ce,
-				 struct i915_gem_ww_ctx *ww)
+static int __intel_context_pre_pin(struct intel_context *ce,
+				   struct i915_gem_ww_ctx *ww)
 {
 	int err;
 
@@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
 	return err;
 }
 
-static void intel_context_post_unpin(struct intel_context *ce)
+static void __intel_context_post_unpin(struct intel_context *ce)
 {
 	if (ce->state)
 		__context_unpin_state(ce->state);
@@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
 	__ring_retire(ce->ring);
 }
 
-int __intel_context_do_pin_ww(struct intel_context *ce,
-			      struct i915_gem_ww_ctx *ww)
+static int intel_context_pre_pin(struct intel_context *ce,
+				 struct i915_gem_ww_ctx *ww)
 {
-	bool handoff = false;
-	void *vaddr;
+	struct intel_context *child;
+	int err, i = 0;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	for_each_child(ce, child) {
+		err = __intel_context_pre_pin(child, ww);
+		if (unlikely(err))
+			goto unwind;
+		++i;
+	}
+
+	err = __intel_context_pre_pin(ce, ww);
+	if (unlikely(err))
+		goto unwind;
+
+	return 0;
+
+unwind:
+	for_each_child(ce, child) {
+		if (!i--)
+			break;
+		__intel_context_post_unpin(ce);
+	}
+
+	return err;
+}
+
+static void intel_context_post_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	for_each_child(ce, child)
+		__intel_context_post_unpin(child);
+
+	__intel_context_post_unpin(ce);
+}
+
+static int __do_ww_lock(struct intel_context *ce,
+			struct i915_gem_ww_ctx *ww)
+{
+	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
+
+	if (!err && ce->ring->vma->obj)
+		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
+	if (!err && ce->state)
+		err = i915_gem_object_lock(ce->state->obj, ww);
+
+	return err;
+}
+
+static int do_ww_lock(struct intel_context *ce,
+		      struct i915_gem_ww_ctx *ww)
+{
+	struct intel_context *child;
 	int err = 0;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	for_each_child(ce, child) {
+		err = __do_ww_lock(child, ww);
+		if (unlikely(err))
+			return err;
+	}
+
+	return __do_ww_lock(ce, ww);
+}
+
+static int __intel_context_do_pin_ww(struct intel_context *ce,
+				     struct i915_gem_ww_ctx *ww)
+{
+	bool handoff = false;
+	int err;
+
 	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
 		err = intel_context_alloc_state(ce);
 		if (err)
@@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	 * refcount for __intel_context_active(), which prevent a lock
 	 * inversion of ce->pin_mutex vs dma_resv_lock().
 	 */
+	err = do_ww_lock(ce, ww);
+	if (err)
+		return err;
 
-	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
-	if (!err && ce->ring->vma->obj)
-		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
-	if (!err && ce->state)
-		err = i915_gem_object_lock(ce->state->obj, ww);
-	if (!err)
-		err = intel_context_pre_pin(ce, ww);
+	err = intel_context_pre_pin(ce, ww);
 	if (err)
 		return err;
 
@@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	if (err)
 		goto err_ctx_unpin;
 
-	err = ce->ops->pre_pin(ce, ww, &vaddr);
+	err = ce->ops->pre_pin(ce, ww);
 	if (err)
 		goto err_release;
 
@@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 		if (unlikely(err))
 			goto err_unlock;
 
-		err = ce->ops->pin(ce, vaddr);
+		err = ce->ops->pin(ce);
 		if (err) {
 			intel_context_active_release(ce);
 			goto err_unlock;
@@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
 	return err;
 }
 
-int __intel_context_do_pin(struct intel_context *ce)
+static int __intel_context_do_pin(struct intel_context *ce)
 {
 	struct i915_gem_ww_ctx ww;
 	int err;
@@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
 		 intel_context_get_avg_runtime_ns(ce));
 
 	set_bit(CONTEXT_VALID_BIT, &ce->flags);
-	intel_context_post_unpin(ce);
+	__intel_context_post_unpin(ce);
 	intel_context_put(ce);
 }
 
@@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	child->parent = parent;
 }
 
+static inline int ____intel_context_pin(struct intel_context *ce)
+{
+	if (likely(intel_context_pin_if_active(ce)))
+		return 0;
+
+	return __intel_context_do_pin(ce);
+}
+
+static inline int __intel_context_pin_ww(struct intel_context *ce,
+					 struct i915_gem_ww_ctx *ww)
+{
+	if (likely(intel_context_pin_if_active(ce)))
+		return 0;
+
+	return __intel_context_do_pin_ww(ce, ww);
+}
+
+static inline void __intel_context_unpin(struct intel_context *ce)
+{
+	if (!ce->ops->sched_disable) {
+		__intel_context_do_unpin(ce, 1);
+	} else {
+		/*
+		 * Move ownership of this pin to the scheduling disable which is
+		 * an async operation. When that operation completes the above
+		 * intel_context_sched_disable_unpin is called potentially
+		 * unpinning the context.
+		 */
+		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
+			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
+				ce->ops->sched_disable(ce);
+				break;
+			}
+		}
+	}
+}
+
+/*
+ * FIXME: This is ugly, these branches are only needed for parallel contexts in
+ * GuC submission. Basically the idea is if any of the contexts, that are
+ * configured for parallel submission, are pinned all the contexts need to be
+ * pinned in order to register these contexts with the GuC. We are adding the
+ * layer here while it should probably be pushed to the backend via a vfunc. But
+ * since we already have ce->pin + a layer atop it is confusing. Definitely
+ * needs a bit of rework how to properly layer / structure this code path. What
+ * is in place works but is not ideal.
+ */
+int intel_context_pin(struct intel_context *ce)
+{
+	if (intel_context_is_child(ce)) {
+		if (!atomic_fetch_add(1, &ce->pin_count))
+			return ____intel_context_pin(ce->parent);
+		else
+			return 0;
+	} else {
+		return ____intel_context_pin(ce);
+	}
+}
+
+int intel_context_pin_ww(struct intel_context *ce,
+			 struct i915_gem_ww_ctx *ww)
+{
+	if (intel_context_is_child(ce)) {
+		if (!atomic_fetch_add(1, &ce->pin_count))
+			return __intel_context_pin_ww(ce->parent, ww);
+		else
+			return 0;
+	} else {
+		return __intel_context_pin_ww(ce, ww);
+	}
+}
+
+void intel_context_unpin(struct intel_context *ce)
+{
+	if (intel_context_is_child(ce)) {
+		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
+			__intel_context_unpin(ce->parent);
+	} else {
+		__intel_context_unpin(ce);
+	}
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index ad6ce5ac4824..c208691fc87d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
 	mutex_unlock(&ce->pin_mutex);
 }
 
-int __intel_context_do_pin(struct intel_context *ce);
-int __intel_context_do_pin_ww(struct intel_context *ce,
-			      struct i915_gem_ww_ctx *ww);
-
 static inline bool intel_context_pin_if_active(struct intel_context *ce)
 {
 	return atomic_inc_not_zero(&ce->pin_count);
 }
 
-static inline int intel_context_pin(struct intel_context *ce)
-{
-	if (likely(intel_context_pin_if_active(ce)))
-		return 0;
-
-	return __intel_context_do_pin(ce);
-}
-
-static inline int intel_context_pin_ww(struct intel_context *ce,
-				       struct i915_gem_ww_ctx *ww)
-{
-	if (likely(intel_context_pin_if_active(ce)))
-		return 0;
+int intel_context_pin(struct intel_context *ce);
 
-	return __intel_context_do_pin_ww(ce, ww);
-}
+int intel_context_pin_ww(struct intel_context *ce,
+			 struct i915_gem_ww_ctx *ww);
 
 static inline void __intel_context_pin(struct intel_context *ce)
 {
@@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
 
 static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
 	__intel_context_do_unpin(ce, 2);
 }
 
-static inline void intel_context_unpin(struct intel_context *ce)
-{
-	if (!ce->ops->sched_disable) {
-		__intel_context_do_unpin(ce, 1);
-	} else {
-		/*
-		 * Move ownership of this pin to the scheduling disable which is
-		 * an async operation. When that operation completes the above
-		 * intel_context_sched_disable_unpin is called potentially
-		 * unpinning the context.
-		 */
-		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
-			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
-				ce->ops->sched_disable(ce);
-				break;
-			}
-		}
-	}
-}
+void intel_context_unpin(struct intel_context *ce);
 
 void intel_context_enter_engine(struct intel_context *ce);
 void intel_context_exit_engine(struct intel_context *ce);
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 66b22b370a72..eb82be15b7a2 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -39,8 +39,8 @@ struct intel_context_ops {
 
 	void (*ban)(struct intel_context *ce, struct i915_request *rq);
 
-	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
-	int (*pin)(struct intel_context *ce, void *vaddr);
+	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
+	int (*pin)(struct intel_context *ce);
 	void (*unpin)(struct intel_context *ce);
 	void (*post_unpin)(struct intel_context *ce);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index baa1797af1c8..fc74ca28f245 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
 static int
 __execlists_context_pre_pin(struct intel_context *ce,
 			    struct intel_engine_cs *engine,
-			    struct i915_gem_ww_ctx *ww, void **vaddr)
+			    struct i915_gem_ww_ctx *ww)
 {
 	int err;
 
-	err = lrc_pre_pin(ce, engine, ww, vaddr);
+	err = lrc_pre_pin(ce, engine, ww);
 	if (err)
 		return err;
 
 	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
-		lrc_init_state(ce, engine, *vaddr);
+		lrc_init_state(ce, engine, ce->lrc_reg_state -
+			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
 
 		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
 	}
@@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
 }
 
 static int execlists_context_pre_pin(struct intel_context *ce,
-				     struct i915_gem_ww_ctx *ww,
-				     void **vaddr)
+				     struct i915_gem_ww_ctx *ww)
 {
-	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
+	return __execlists_context_pre_pin(ce, ce->engine, ww);
 }
 
-static int execlists_context_pin(struct intel_context *ce, void *vaddr)
+static int execlists_context_pin(struct intel_context *ce)
 {
-	return lrc_pin(ce, ce->engine, vaddr);
+	return lrc_pin(ce, ce->engine);
 }
 
 static int execlists_context_alloc(struct intel_context *ce)
@@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
 }
 
 static int virtual_context_pre_pin(struct intel_context *ce,
-				   struct i915_gem_ww_ctx *ww,
-				   void **vaddr)
+				   struct i915_gem_ww_ctx *ww)
 {
 	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
 
 	 /* Note: we must use a real engine class for setting up reg state */
-	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
+	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
 }
 
-static int virtual_context_pin(struct intel_context *ce, void *vaddr)
+static int virtual_context_pin(struct intel_context *ce)
 {
 	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
 
-	return lrc_pin(ce, ve->siblings[0], vaddr);
+	return lrc_pin(ce, ve->siblings[0]);
 }
 
 static void virtual_context_enter(struct intel_context *ce)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index bb4af4977920..c466fc966005 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
 int
 lrc_pre_pin(struct intel_context *ce,
 	    struct intel_engine_cs *engine,
-	    struct i915_gem_ww_ctx *ww,
-	    void **vaddr)
+	    struct i915_gem_ww_ctx *ww)
 {
+	void *vaddr;
 	GEM_BUG_ON(!ce->state);
 	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
 
-	*vaddr = i915_gem_object_pin_map(ce->state->obj,
-					 i915_coherent_map_type(ce->engine->i915,
-								ce->state->obj,
-								false) |
-					 I915_MAP_OVERRIDE);
+	vaddr = i915_gem_object_pin_map(ce->state->obj,
+					i915_coherent_map_type(ce->engine->i915,
+							       ce->state->obj,
+							       false) |
+					I915_MAP_OVERRIDE);
 
-	return PTR_ERR_OR_ZERO(*vaddr);
+	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
+
+	return PTR_ERR_OR_ZERO(vaddr);
 }
 
 int
 lrc_pin(struct intel_context *ce,
-	struct intel_engine_cs *engine,
-	void *vaddr)
+	struct intel_engine_cs *engine)
 {
-	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
-
 	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
-		lrc_init_state(ce, engine, vaddr);
+		lrc_init_state(ce, engine,
+			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
 
 	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
 	return 0;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
index 7f697845c4cf..837fcf00270d 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.h
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
@@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
 int
 lrc_pre_pin(struct intel_context *ce,
 	    struct intel_engine_cs *engine,
-	    struct i915_gem_ww_ctx *ww,
-	    void **vaddr);
+	    struct i915_gem_ww_ctx *ww);
 int
 lrc_pin(struct intel_context *ce,
-	struct intel_engine_cs *engine,
-	void *vaddr);
+	struct intel_engine_cs *engine);
 void lrc_unpin(struct intel_context *ce);
 void lrc_post_unpin(struct intel_context *ce);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index 2958e2fae380..f4f301bfb9f7 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
 }
 
 static int ring_context_pre_pin(struct intel_context *ce,
-				struct i915_gem_ww_ctx *ww,
-				void **unused)
+				struct i915_gem_ww_ctx *ww)
 {
 	struct i915_address_space *vm;
 	int err = 0;
@@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
 	return 0;
 }
 
-static int ring_context_pin(struct intel_context *ce, void *unused)
+static int ring_context_pin(struct intel_context *ce)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
index 2c1af030310c..826b5d7a4573 100644
--- a/drivers/gpu/drm/i915/gt/mock_engine.c
+++ b/drivers/gpu/drm/i915/gt/mock_engine.c
@@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
 }
 
 static int mock_context_pre_pin(struct intel_context *ce,
-				struct i915_gem_ww_ctx *ww, void **unused)
+				struct i915_gem_ww_ctx *ww)
 {
 	return 0;
 }
 
-static int mock_context_pin(struct intel_context *ce, void *unused)
+static int mock_context_pin(struct intel_context *ce)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index dec757d319a2..c5c73c42bcf7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 	GEM_BUG_ON(!engine->mask);
 	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	/*
 	 * Ensure LRC + CT vmas are is same region as write barrier is done
@@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 
 static int __guc_context_pre_pin(struct intel_context *ce,
 				 struct intel_engine_cs *engine,
-				 struct i915_gem_ww_ctx *ww,
-				 void **vaddr)
+				 struct i915_gem_ww_ctx *ww)
 {
-	return lrc_pre_pin(ce, engine, ww, vaddr);
+	return lrc_pre_pin(ce, engine, ww);
 }
 
 static int __guc_context_pin(struct intel_context *ce,
-			     struct intel_engine_cs *engine,
-			     void *vaddr)
+			     struct intel_engine_cs *engine)
 {
 	if (i915_ggtt_offset(ce->state) !=
 	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
@@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
 	 * explaination of why.
 	 */
 
-	return lrc_pin(ce, engine, vaddr);
+	return lrc_pin(ce, engine);
+}
+
+static void __guc_context_unpin(struct intel_context *ce)
+{
+	lrc_unpin(ce);
+}
+
+static void __guc_context_post_unpin(struct intel_context *ce)
+{
+	lrc_post_unpin(ce);
 }
 
 static int guc_context_pre_pin(struct intel_context *ce,
-			       struct i915_gem_ww_ctx *ww,
-			       void **vaddr)
+			       struct i915_gem_ww_ctx *ww)
 {
-	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
+	return __guc_context_pre_pin(ce, ce->engine, ww);
 }
 
-static int guc_context_pin(struct intel_context *ce, void *vaddr)
+static int guc_context_pin(struct intel_context *ce)
 {
-	int ret = __guc_context_pin(ce, ce->engine, vaddr);
+	int ret;
 
+	GEM_BUG_ON(intel_context_is_parent(ce) ||
+		   intel_context_is_child(ce));
+
+	ret = __guc_context_pin(ce, ce->engine);
 	if (likely(!ret && !intel_context_is_barrier(ce)))
 		intel_engine_pm_get(ce->engine);
 
@@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(context_enabled(ce));
 
 	unpin_guc_id(guc, ce, true);
-	lrc_unpin(ce);
+	__guc_context_unpin(ce);
 
 	if (likely(!intel_context_is_barrier(ce)))
 		intel_engine_pm_put(ce->engine);
@@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
 
 static void guc_context_post_unpin(struct intel_context *ce)
 {
-	lrc_post_unpin(ce);
+	__guc_context_post_unpin(ce);
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pre_pin(struct intel_context *ce,
+				      struct i915_gem_ww_ctx *ww)
+{
+	struct intel_context *child;
+	int err, i = 0, j = 0;
+
+	for_each_child(ce, child) {
+		err = i915_active_acquire(&child->active);
+		if (unlikely(err))
+			goto unwind_active;
+		++i;
+	}
+
+	for_each_child(ce, child) {
+		err = __guc_context_pre_pin(child, child->engine, ww);
+		if (unlikely(err))
+			goto unwind_pre_pin;
+		++j;
+	}
+
+	err = __guc_context_pre_pin(ce, ce->engine, ww);
+	if (unlikely(err))
+		goto unwind_pre_pin;
+
+	return 0;
+
+unwind_pre_pin:
+	for_each_child(ce, child) {
+		if (!j--)
+			break;
+		__guc_context_post_unpin(child);
+	}
+
+unwind_active:
+	for_each_child(ce, child) {
+		if (!i--)
+			break;
+		i915_active_release(&child->active);
+	}
+
+	return err;
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_post_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	for_each_child(ce, child)
+		__guc_context_post_unpin(child);
+	__guc_context_post_unpin(ce);
+
+	for_each_child(ce, child) {
+		intel_context_get(child);
+		i915_active_release(&child->active);
+		intel_context_put(child);
+	}
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static int guc_parent_context_pin(struct intel_context *ce)
+{
+	int ret, i = 0, j = 0;
+	struct intel_context *child;
+	struct intel_engine_cs *engine;
+	intel_engine_mask_t tmp;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child) {
+		ret = __guc_context_pin(child, child->engine);
+		if (unlikely(ret))
+			goto unwind_pin;
+		++i;
+	}
+	ret = __guc_context_pin(ce, ce->engine);
+	if (unlikely(ret))
+		goto unwind_pin;
+
+	for_each_child(ce, child)
+		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
+			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
+			break;
+		}
+
+	for_each_engine_masked(engine, ce->engine->gt,
+			       ce->engine->mask, tmp)
+		intel_engine_pm_get(engine);
+	for_each_child(ce, child)
+		for_each_engine_masked(engine, child->engine->gt,
+				       child->engine->mask, tmp)
+			intel_engine_pm_get(engine);
+
+	return 0;
+
+unwind_pin:
+	for_each_child(ce, child) {
+		if (++j > i)
+			break;
+		__guc_context_unpin(child);
+	}
+
+	return ret;
+}
+
+/* Future patches will use this function */
+__maybe_unused
+static void guc_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct intel_engine_cs *engine;
+	intel_engine_mask_t tmp;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+	GEM_BUG_ON(context_enabled(ce));
+
+	unpin_guc_id(ce_to_guc(ce), ce, true);
+	for_each_child(ce, child)
+		__guc_context_unpin(child);
+	__guc_context_unpin(ce);
+
+	for_each_engine_masked(engine, ce->engine->gt,
+			       ce->engine->mask, tmp)
+		intel_engine_pm_put(engine);
+	for_each_child(ce, child)
+		for_each_engine_masked(engine, child->engine->gt,
+				       child->engine->mask, tmp)
+			intel_engine_pm_put(engine);
 }
 
 static void __guc_context_sched_enable(struct intel_guc *guc,
@@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
 }
 
 static int guc_virtual_context_pre_pin(struct intel_context *ce,
-				       struct i915_gem_ww_ctx *ww,
-				       void **vaddr)
+				       struct i915_gem_ww_ctx *ww)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
 
-	return __guc_context_pre_pin(ce, engine, ww, vaddr);
+	return __guc_context_pre_pin(ce, engine, ww);
 }
 
-static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
+static int guc_virtual_context_pin(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
-	int ret = __guc_context_pin(ce, engine, vaddr);
+	int ret = __guc_context_pin(ce, engine);
 	intel_engine_mask_t tmp, mask = ce->engine->mask;
 
 	if (likely(!ret))
@@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
 	GEM_BUG_ON(intel_context_is_barrier(ce));
 
 	unpin_guc_id(guc, ce, true);
-	lrc_unpin(ce);
+	__guc_context_unpin(ce);
 
 	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
 		intel_engine_pm_put(engine);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 17/46] drm/i915/guc: Add multi-lrc context registration
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (15 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 18/46] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts Matthew Brost
                   ` (33 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_lrc.c           |   5 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 117 +++++++++++++++++-
 5 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index eb82be15b7a2..9665cb31bab0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -211,9 +211,15 @@ struct intel_context {
 	/* Pointer to parent */
 	struct intel_context *parent;
 
+	/* GuC workqueue head / tail - only applies to parent */
+	u16 guc_wqi_tail;
+	u16 guc_wqi_head;
+
 	/* Number of children if parent */
 	u8 guc_number_children;
 
+	u8 parent_page; /* if set, page num reserved for parent context */
+
 	/*
 	 * GuC priority management
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index c466fc966005..4b65c3a98331 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -861,6 +861,11 @@ __lrc_alloc_state(struct intel_context *ce, struct intel_engine_cs *engine)
 		context_size += PAGE_SIZE;
 	}
 
+	if (intel_context_is_parent(ce)) {
+		ce->parent_page = context_size / PAGE_SIZE;
+		context_size += PAGE_SIZE;
+	}
+
 	obj = i915_gem_object_create_lmem(engine->i915, context_size, 0);
 	if (IS_ERR(obj))
 		obj = i915_gem_object_create_shmem(engine->i915, context_size);
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index d832c8f11c11..0a496acec213 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -142,6 +142,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_COMMAND_TRANSPORT_BUFFER = 0x4505,
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
+	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_LIMIT
 };
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 82534259b7ad..e08fbd40281c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -51,7 +51,7 @@
 
 #define GUC_DOORBELL_INVALID		256
 
-#define GUC_WQ_SIZE			(PAGE_SIZE * 2)
+#define GUC_WQ_SIZE			(PAGE_SIZE / 2)
 
 /* Work queue item header definitions */
 #define WQ_STATUS_ACTIVE		1
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c5c73c42bcf7..98c1c0b7b087 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -439,6 +439,39 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
+/*
+ * When using multi-lrc submission an extra page in the context state is
+ * reserved for the process descriptor and work queue.
+ *
+ * The layout of this page is below:
+ * 0						guc_process_desc
+ * ...						unused
+ * PAGE_SIZE / 2				work queue start
+ * ...						work queue
+ * PAGE_SIZE - 1				work queue end
+ */
+#define WQ_OFFSET	(PAGE_SIZE / 2)
+static inline u32 __get_process_desc_offset(struct intel_context *ce)
+{
+	GEM_BUG_ON(!ce->parent_page);
+
+	return ce->parent_page * PAGE_SIZE;
+}
+
+static inline u32 __get_wq_offset(struct intel_context *ce)
+{
+	return __get_process_desc_offset(ce) + WQ_OFFSET;
+}
+
+static inline struct guc_process_desc *
+__get_process_desc(struct intel_context *ce)
+{
+	return (struct guc_process_desc *)
+		(ce->lrc_reg_state +
+		 ((__get_process_desc_offset(ce) -
+		   LRC_STATE_OFFSET) / sizeof(u32)));
+}
+
 static u32 __get_lrc_desc_offset(struct intel_guc *guc, int index)
 {
 	GEM_BUG_ON(index >= guc->lrcd_reg.max_idx);
@@ -1743,6 +1776,28 @@ static void unpin_guc_id(struct intel_guc *guc,
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
 }
 
+static int __guc_action_register_multi_lrc(struct intel_guc *guc,
+					   struct intel_context *ce,
+					   u32 guc_id,
+					   bool loop)
+{
+	struct intel_context *child;
+	u32 action[4 + MAX_ENGINE_INSTANCE];
+	int len = 0;
+
+	GEM_BUG_ON(ce->guc_number_children > MAX_ENGINE_INSTANCE);
+
+	action[len++] = INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC;
+	action[len++] = guc_id;
+	action[len++] = ce->guc_number_children + 1;
+	action[len++] = __get_lrc_desc_offset(guc, ce->guc_lrcd_reg_idx);
+	for_each_child(ce, child)
+		action[len++] =
+			__get_lrc_desc_offset(guc, child->guc_lrcd_reg_idx);
+
+	return guc_submission_send_busy_loop(guc, action, len, 0, loop);
+}
+
 static int __guc_action_register_context(struct intel_guc *guc,
 					 struct intel_context *ce,
 					 u32 guc_id,
@@ -1763,9 +1818,14 @@ static int register_context(struct intel_context *ce, bool loop)
 	struct intel_guc *guc = ce_to_guc(ce);
 	int ret;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_register(ce);
 
-	ret = __guc_action_register_context(guc, ce, ce->guc_id, loop);
+	if (intel_context_is_parent(ce))
+		ret =  __guc_action_register_multi_lrc(guc, ce, ce->guc_id,
+						       loop);
+	else
+		ret = __guc_action_register_context(guc, ce, ce->guc_id, loop);
 	if (likely(!ret))
 		set_context_registered(ce);
 
@@ -1790,6 +1850,7 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_deregister(ce);
 
 	return __guc_action_deregister_context(guc, guc_id, loop);
@@ -1860,7 +1921,11 @@ static void __free_lrcd_reg_idx(struct intel_guc *guc, struct intel_context *ce)
 
 static void free_lrcd_reg_idx(struct intel_guc *guc, struct intel_context *ce)
 {
+	struct intel_context *child;
+
 	__free_lrcd_reg_idx(guc, ce);
+	for_each_child(ce, child)
+		__free_lrcd_reg_idx(guc, child);
 }
 
 static int guc_lrcd_reg_init(struct intel_guc *guc)
@@ -1901,6 +1966,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	int prio = I915_CONTEXT_DEFAULT_PRIORITY;
 	bool context_registered;
 	intel_wakeref_t wakeref;
+	struct intel_context *child;
 	int ret = 0;
 
 	GEM_BUG_ON(!engine->mask);
@@ -1921,6 +1987,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			return ret;
 		ce->guc_lrcd_reg_idx = ret;
 	}
+	for_each_child(ce, child)
+		if (likely(!child->guc_lrcd_reg_idx)) {
+			ret = alloc_lrcd_reg_idx(guc, !loop);
+			if (unlikely(ret < 0))
+				return ret;
+			child->guc_lrcd_reg_idx = ret;
+		}
 
 	context_registered = lrc_desc_registered(guc, desc_idx);
 
@@ -1944,6 +2017,42 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 	guc_context_policy_init(engine, desc);
 	init_sched_state(ce);
 
+	/*
+	 * Context is a parent, we need to register a process descriptor
+	 * describing a work queue and register all child contexts.
+	 */
+	if (intel_context_is_parent(ce)) {
+		struct guc_process_desc *pdesc;
+
+		ce->guc_wqi_tail = 0;
+		ce->guc_wqi_head = 0;
+
+		desc->process_desc = i915_ggtt_offset(ce->state) +
+			__get_process_desc_offset(ce);
+		desc->wq_addr = i915_ggtt_offset(ce->state) +
+			__get_wq_offset(ce);
+		desc->wq_size = GUC_WQ_SIZE;
+
+		pdesc = __get_process_desc(ce);
+		memset(pdesc, 0, sizeof(*(pdesc)));
+		pdesc->stage_id = ce->guc_id;
+		pdesc->wq_base_addr = desc->wq_addr;
+		pdesc->wq_size_bytes = desc->wq_size;
+		pdesc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+		pdesc->wq_status = WQ_STATUS_ACTIVE;
+
+		for_each_child(ce, child) {
+			desc = __get_lrc_desc(guc, child->guc_lrcd_reg_idx);
+
+			desc->engine_class =
+				engine_class_to_guc_class(engine->class);
+			desc->hw_context_desc = child->lrc.lrca;
+			desc->priority = GUC_CLIENT_PRIORITY_KMD_NORMAL;
+			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
+			guc_context_policy_init(engine, desc);
+		}
+	}
+
 	/*
 	 * The context_lookup xarray is used to determine if the hardware
 	 * context is currently registered. There are two cases in which it
@@ -3653,6 +3762,12 @@ g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 		return NULL;
 	}
 
+	if (unlikely(intel_context_is_child(ce))) {
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"Context is child, desc_idx %u", desc_idx);
+		return NULL;
+	}
+
 	return ce;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 18/46] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (16 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 17/46] drm/i915/guc: Add multi-lrc context registration Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
                   ` (32 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 52 +++++++++++++++----
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 98c1c0b7b087..f23dd716723f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -405,6 +405,18 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
+static inline struct intel_context *
+to_parent(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) ? ce->parent : ce;
+}
+
+static inline struct intel_context *
+request_to_scheduling_context(struct i915_request *rq)
+{
+	return to_parent(rq->context);
+}
+
 static inline bool context_guc_id_invalid(struct intel_context *ce)
 {
 	return ce->guc_id == GUC_INVALID_LRC_ID;
@@ -711,7 +723,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
 static int tasklet_register_context(struct guc_submit_engine *gse,
 				    struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = gse->sched_engine.private_data;
 	int ret = 0;
 
@@ -720,6 +732,7 @@ static int tasklet_register_context(struct guc_submit_engine *gse,
 	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
 	GEM_BUG_ON(request_has_no_guc_id(rq));
 	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
 
 	/*
@@ -2355,6 +2368,7 @@ static void __guc_context_sched_disable(struct intel_guc *guc,
 	GEM_BUG_ON(guc_id == GUC_INVALID_LRC_ID);
 #endif
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	trace_intel_context_sched_disable(ce);
 
 	guc_submission_send_busy_loop(guc, action, ARRAY_SIZE(action),
@@ -2570,6 +2584,8 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	u16 guc_id;
 	bool enabled;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id)) {
 		clr_context_enabled(ce);
@@ -2971,6 +2987,8 @@ static void guc_signal_context_fence(struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	clr_context_wait_for_deregister_to_register(ce);
 	__guc_signal_context_fence(ce);
@@ -3056,14 +3074,26 @@ static bool context_needs_lrc_desc_pin(struct intel_context *ce, bool new_guc_id
 		!submission_disabled(ce_to_guc(ce));
 }
 
+static void clear_lrca_dirty(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
+	for_each_child(ce, child)
+		clear_bit(CONTEXT_LRCA_DIRTY, &child->flags);
+}
+
 static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
 			      struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	int ret = 0;
 
 	lockdep_assert_held(&gse->sched_engine.lock);
 	GEM_BUG_ON(!ce->guc_num_rq_submit_no_id);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	if (atomic_add_unless(&ce->guc_id_ref, ce->guc_num_rq_submit_no_id, 0))
 		goto out;
@@ -3091,7 +3121,7 @@ static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
 		gse->submission_stall_reason = STALL_SCHED_DISABLE;
 	}
 
-	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
+	clear_lrca_dirty(ce);
 out:
 	gse->total_num_rq_with_no_guc_id -= ce->guc_num_rq_submit_no_id;
 	GEM_BUG_ON(gse->total_num_rq_with_no_guc_id < 0);
@@ -3122,7 +3152,7 @@ static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
 
 static int guc_request_alloc(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	struct intel_guc *guc = ce_to_guc(ce);
 	struct guc_submit_engine *gse = ce_to_gse(ce);
 	unsigned long flags;
@@ -3173,11 +3203,12 @@ static int guc_request_alloc(struct i915_request *rq)
 	 * persistent until the generated request is retired. Thus, sealing these
 	 * race conditions.
 	 *
-	 * There is no need for a lock here as the timeline mutex ensures at
-	 * most one context can be executing this code path at once. The
-	 * guc_id_ref is incremented once for every request in flight and
-	 * decremented on each retire. When it is zero, a lock around the
-	 * increment (in pin_guc_id) is needed to seal a race with unpin_guc_id.
+	 * There is no need for a lock here as the timeline mutex (or
+	 * parallel_submit mutex in the case of multi-lrc) ensures at most one
+	 * context can be executing this code path at once. The guc_id_ref is
+	 * incremented once for every request in flight and decremented on each
+	 * retire. When it is zero, a lock around the increment (in pin_guc_id)
+	 * is needed to seal a race with unpin_guc_id.
 	 */
 	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
 		goto out;
@@ -3215,8 +3246,7 @@ static int guc_request_alloc(struct i915_request *rq)
 		}
 	}
 
-	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
-
+	clear_lrca_dirty(ce);
 out:
 	incr_num_rq_not_ready(ce);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (17 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 18/46] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 15:31   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine Matthew Brost
                   ` (31 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.h       |   6 +
 drivers/gpu/drm/i915/gt/intel_reset.c         |   3 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 222 ++++++++++++------
 .../i915/gt/uc/intel_guc_submission_types.h   |  10 +
 5 files changed, 179 insertions(+), 69 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index c208691fc87d..7ce3b3d2edb7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -54,6 +54,12 @@ static inline bool intel_context_is_parent(struct intel_context *ce)
 	return !!ce->guc_number_children;
 }
 
+static inline struct intel_context *
+intel_context_to_parent(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) ? ce->parent : ce;
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index ea763138197f..c3d4baa1b2b8 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -849,6 +849,7 @@ static void reset_finish(struct intel_gt *gt, intel_engine_mask_t awake)
 
 static void nop_submit_request(struct i915_request *request)
 {
+	struct intel_context *ce = intel_context_to_parent(request->context);
 	RQ_TRACE(request, "-EIO\n");
 
 	/*
@@ -857,7 +858,7 @@ static void nop_submit_request(struct i915_request *request)
 	 * this for now.
 	 */
 	if (intel_engine_uses_guc(request->engine))
-		intel_guc_decr_num_rq_not_ready(request->context);
+		intel_guc_decr_num_rq_not_ready(ce);
 
 	request = i915_request_mark_eio(request);
 	if (request) {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index c0c60ccabfa4..30a0f364db8f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -24,6 +24,7 @@ struct __guc_ads_blob;
 
 enum {
 	GUC_SUBMIT_ENGINE_SINGLE_LRC,
+	GUC_SUBMIT_ENGINE_MULTI_LRC,
 	GUC_SUBMIT_ENGINE_MAX
 };
 
@@ -59,8 +60,10 @@ struct intel_guc {
 	struct ida guc_ids;
 	u32 num_guc_ids;
 	u32 max_guc_ids;
-	struct list_head guc_id_list_no_ref;
-	struct list_head guc_id_list_unpinned;
+	unsigned long *guc_ids_bitmap;
+#define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
+	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
+	struct list_head guc_id_list_unpinned[MAX_GUC_ID_ORDER + 1];
 
 	spinlock_t destroy_lock;	/* protects list / worker */
 	struct list_head destroyed_contexts;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f23dd716723f..afb9b4bb8971 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -169,6 +169,15 @@ static void clr_guc_ids_exhausted(struct guc_submit_engine *gse)
 	clear_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
 }
 
+/*
+ * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
+ * and a different allocation algorithm is used (bitmap vs. ida). We believe the
+ * number of multi-lrc contexts in use should be low and 1/16 should be
+ * sufficient. Minimum of 32 ids for multi-lrc.
+ */
+#define NUMBER_MULTI_LRC_GUC_ID(guc) \
+	((guc)->num_guc_ids / 16 > 32 ? (guc)->num_guc_ids / 16 : 32)
+
 /*
  * Below is a set of functions which control the GuC scheduling state which do
  * not require a lock as all state transitions are mutually exclusive. i.e. It
@@ -405,16 +414,10 @@ static inline void decr_context_blocked(struct intel_context *ce)
 	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
 }
 
-static inline struct intel_context *
-to_parent(struct intel_context *ce)
-{
-	return intel_context_is_child(ce) ? ce->parent : ce;
-}
-
 static inline struct intel_context *
 request_to_scheduling_context(struct i915_request *rq)
 {
-	return to_parent(rq->context);
+	return intel_context_to_parent(rq->context);
 }
 
 static inline bool context_guc_id_invalid(struct intel_context *ce)
@@ -1436,7 +1439,7 @@ static void destroy_worker_func(struct work_struct *w);
  */
 int intel_guc_submission_init(struct intel_guc *guc)
 {
-	int ret;
+	int ret, i;
 
 	if (guc_submission_initialized(guc))
 		return 0;
@@ -1448,9 +1451,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
 	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
-	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
+	for (i = 0; i < MAX_GUC_ID_ORDER + 1; ++i) {
+		INIT_LIST_HEAD(&guc->guc_id_list_no_ref[i]);
+		INIT_LIST_HEAD(&guc->guc_id_list_unpinned[i]);
+	}
 	ida_init(&guc->guc_ids);
+	guc->guc_ids_bitmap =
+		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
 
 	spin_lock_init(&guc->destroy_lock);
 
@@ -1476,6 +1483,8 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 
 		i915_sched_engine_put(sched_engine);
 	}
+
+	bitmap_free(guc->guc_ids_bitmap);
 }
 
 static inline void queue_request(struct i915_sched_engine *sched_engine,
@@ -1499,11 +1508,13 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
 static bool too_many_guc_ids_not_ready(struct guc_submit_engine *gse,
 				       struct intel_context *ce)
 {
-	u32 available_guc_ids, guc_ids_consumed;
 	struct intel_guc *guc = gse->sched_engine.private_data;
+	u32 available_guc_ids = intel_context_is_parent(ce) ?
+		NUMBER_MULTI_LRC_GUC_ID(guc) :
+		guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
+	u32 guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
 
-	available_guc_ids = guc->num_guc_ids;
-	guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
 		set_and_update_guc_ids_exhausted(gse);
@@ -1517,17 +1528,26 @@ static void incr_num_rq_not_ready(struct intel_context *ce)
 {
 	struct guc_submit_engine *gse = ce_to_gse(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(!intel_context_is_parent(ce) &&
+		   ce->guc_number_children);
+
 	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
-		atomic_inc(&gse->num_guc_ids_not_ready);
+		atomic_add(ce->guc_number_children + 1,
+			   &gse->num_guc_ids_not_ready);
 }
 
 void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
 {
 	struct guc_submit_engine *gse = ce_to_gse(ce);
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1) {
 		GEM_BUG_ON(!atomic_read(&gse->num_guc_ids_not_ready));
-		atomic_dec(&gse->num_guc_ids_not_ready);
+
+		atomic_sub(ce->guc_number_children + 1,
+			   &gse->num_guc_ids_not_ready);
 	}
 }
 
@@ -1579,20 +1599,42 @@ static void guc_submit_request(struct i915_request *rq)
 
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 
-	intel_guc_decr_num_rq_not_ready(rq->context);
+	intel_guc_decr_num_rq_not_ready(request_to_scheduling_context(rq));
 }
 
-static int new_guc_id(struct intel_guc *guc)
+static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
-			      guc->num_guc_ids, GFP_KERNEL |
-			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	int ret;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (intel_context_is_parent(ce))
+		ret = bitmap_find_free_region(guc->guc_ids_bitmap,
+					      NUMBER_MULTI_LRC_GUC_ID(guc),
+					      order_base_2(ce->guc_number_children
+							   + 1));
+	else
+		ret = ida_simple_get(&guc->guc_ids,
+				     NUMBER_MULTI_LRC_GUC_ID(guc),
+				     guc->num_guc_ids, GFP_KERNEL |
+				     __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+	if (unlikely(ret < 0))
+		return ret;
+
+	ce->guc_id = ret;
+	return 0;
 }
 
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
+	GEM_BUG_ON(intel_context_is_child(ce));
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id);
+		if (intel_context_is_parent(ce))
+			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
+					      order_base_2(ce->guc_number_children
+							   + 1));
+		else
+			ida_simple_remove(&guc->guc_ids, ce->guc_id);
 		clr_lrc_desc_registered(guc, ce->guc_id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1604,6 +1646,8 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&guc->contexts_lock, flags);
 	__release_guc_id(guc, ce);
 	spin_unlock_irqrestore(&guc->contexts_lock, flags);
@@ -1618,54 +1662,93 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
  * schedule disable H2G + a deregister H2G.
  */
 static struct list_head *get_guc_id_list(struct intel_guc *guc,
+					 u8 number_children,
 					 bool unpinned)
 {
+	GEM_BUG_ON(order_base_2(number_children + 1) > MAX_GUC_ID_ORDER);
+
 	if (unpinned)
-		return &guc->guc_id_list_unpinned;
+		return &guc->guc_id_list_unpinned[order_base_2(number_children + 1)];
 	else
-		return &guc->guc_id_list_no_ref;
+		return &guc->guc_id_list_no_ref[order_base_2(number_children + 1)];
 }
 
-static int steal_guc_id(struct intel_guc *guc, bool unpinned)
+static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
+			bool unpinned)
 {
-	struct intel_context *ce;
-	int guc_id;
-	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
+	struct intel_context *cn;
+	u8 number_children = ce->guc_number_children;
 
 	lockdep_assert_held(&guc->contexts_lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	if (!list_empty(guc_id_list)) {
-		ce = list_first_entry(guc_id_list,
-				      struct intel_context,
-				      guc_id_link);
+	do {
+		struct list_head *guc_id_list =
+			get_guc_id_list(guc, number_children, unpinned);
 
-		/* Ensure context getting stolen in expected state */
-		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
-		GEM_BUG_ON(context_guc_id_invalid(ce));
-		GEM_BUG_ON(context_guc_id_stolen(ce));
+		if (!list_empty(guc_id_list)) {
+			u8 cn_o2, ce_o2 =
+				order_base_2(ce->guc_number_children + 1);
 
-		list_del_init(&ce->guc_id_link);
-		guc_id = ce->guc_id;
-		clr_context_registered(ce);
+			cn = list_first_entry(guc_id_list,
+					      struct intel_context,
+					      guc_id_link);
+			cn_o2 = order_base_2(cn->guc_number_children + 1);
+
+			/*
+			 * Corner case where a multi-lrc context steals a guc_id
+			 * from another context that has more guc_id that itself.
+			 */
+			if (cn_o2 != ce_o2) {
+				bitmap_release_region(guc->guc_ids_bitmap,
+						      cn->guc_id,
+						      cn_o2);
+				bitmap_allocate_region(guc->guc_ids_bitmap,
+						       ce->guc_id,
+						       ce_o2);
+			}
+
+			/* Ensure context getting stolen in expected state */
+			GEM_BUG_ON(atomic_read(&cn->guc_id_ref));
+			GEM_BUG_ON(context_guc_id_invalid(cn));
+			GEM_BUG_ON(context_guc_id_stolen(cn));
+			GEM_BUG_ON(ce_to_gse(ce) != ce_to_gse(cn));
+
+			list_del_init(&cn->guc_id_link);
+			ce->guc_id = cn->guc_id;
+
+			/*
+			 * If stealing from the pinned list, defer invalidating
+			 * the guc_id until the retire workqueue processes this
+			 * context.
+			 */
+			clr_context_registered(cn);
+			if (!unpinned) {
+				GEM_BUG_ON(ce_to_gse(cn)->stalled_context);
+				ce_to_gse(cn)->stalled_context =
+					intel_context_get(cn);
+				set_context_guc_id_stolen(cn);
+			} else {
+				set_context_guc_id_invalid(cn);
+			}
+
+			return 0;
+		}
 
 		/*
-		 * If stealing from the pinned list, defer invalidating
-		 * the guc_id until the retire workqueue processes this
-		 * context.
+		 * When using multi-lrc we search the guc_id_lists with the
+		 * least amount of guc_ids required first but will consume a
+		 * block larger of guc_ids if necessary. 2x the children always
+		 * moves you two the next list.
 		 */
-		if (!unpinned) {
-			GEM_BUG_ON(ce_to_gse(ce)->stalled_context);
+		if (!number_children ||
+		    order_base_2(number_children + 1) == MAX_GUC_ID_ORDER)
+			break;
 
-			ce_to_gse(ce)->stalled_context = intel_context_get(ce);
-			set_context_guc_id_stolen(ce);
-		} else {
-			set_context_guc_id_invalid(ce);
-		}
+		number_children *= 2;
+	} while (true);
 
-		return guc_id;
-	} else {
-		return -EAGAIN;
-	}
+	return -EAGAIN;
 }
 
 enum {	/* Return values for pin_guc_id / assign_guc_id */
@@ -1674,17 +1757,18 @@ enum {	/* Return values for pin_guc_id / assign_guc_id */
 	NEW_GUC_ID_ENABLED	= 2,
 };
 
-static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
+static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce,
+			 bool tasklet)
 {
 	int ret;
 
 	lockdep_assert_held(&guc->contexts_lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
-	ret = new_guc_id(guc);
+	ret = new_guc_id(guc, ce);
 	if (unlikely(ret < 0)) {
-		ret = steal_guc_id(guc, true);
-		if (ret >= 0) {
-			*out = ret;
+		ret = steal_guc_id(guc, ce, true);
+		if (!ret) {
 			ret = NEW_GUC_ID_DISABLED;
 		} else if (ret < 0 && tasklet) {
 			/*
@@ -1692,15 +1776,18 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
 			 * enabled if guc_ids are exhausted and we are submitting
 			 * from the tasklet.
 			 */
-			ret = steal_guc_id(guc, false);
-			if (ret >= 0) {
-				*out = ret;
+			ret = steal_guc_id(guc, ce, false);
+			if (!ret)
 				ret = NEW_GUC_ID_ENABLED;
-			}
 		}
-	} else {
-		*out = ret;
-		ret = SAME_GUC_ID;
+	}
+
+	if (!(ret < 0) && intel_context_is_parent(ce)) {
+		struct intel_context *child;
+		int i = 1;
+
+		for_each_child(ce, child)
+			child->guc_id = ce->guc_id + i++;
 	}
 
 	return ret;
@@ -1713,6 +1800,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
 	int ret = 0;
 	unsigned long flags, tries = PIN_GUC_ID_TRIES;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
 
 try_again:
@@ -1724,7 +1812,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
 	}
 
 	if (context_guc_id_invalid(ce)) {
-		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
+		ret = assign_guc_id(guc, ce, tasklet);
 		if (unlikely(ret < 0))
 			goto out_unlock;
 	}
@@ -1770,6 +1858,7 @@ static void unpin_guc_id(struct intel_guc *guc,
 	unsigned long flags;
 
 	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
@@ -1781,7 +1870,8 @@ static void unpin_guc_id(struct intel_guc *guc,
 
 	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
 	    !atomic_read(&ce->guc_id_ref)) {
-		struct list_head *head = get_guc_id_list(guc, unpinned);
+		struct list_head *head =
+			get_guc_id_list(guc, ce->guc_number_children, unpinned);
 
 		list_add_tail(&ce->guc_id_link, head);
 	}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
index 7069b7248f55..a5933e07bdd2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
@@ -22,6 +22,16 @@ struct guc_virtual_engine {
 /*
  * Object which encapsulates the globally operated on i915_sched_engine +
  * the GuC submission state machine described in intel_guc_submission.c.
+ *
+ * Currently we have two instances of these per GuC. One for single-lrc and one
+ * for multi-lrc submission. We split these into two submission engines as they
+ * can operate in parallel allowing a blocking condition on one not to affect
+ * the other. i.e. guc_ids are statically allocated between these two submission
+ * modes. One mode may have guc_ids exhausted which requires blocking while the
+ * other has plenty of guc_ids and can make forward progres.
+ *
+ * In the future if different submission use cases arise we can simply
+ * instantiate another of these objects and assign it to the context.
  */
 struct guc_submit_engine {
 	struct i915_sched_engine sched_engine;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (18 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 15:35   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy Matthew Brost
                   ` (30 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

The heartbeat uses a single instance of a GuC submit engine (GSE) to do
the hang check. As such if a different GSE's state machine hangs, the
heartbeat cannot detect this hang. Add timer to each GSE which in turn
can disable all submissions if it is hung.

Cc: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
 .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
 2 files changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index afb9b4bb8971..2d8296bcc583 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
 	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
+/* 2 seconds seems like a reasonable timeout waiting for a G2H */
+#define MAX_TASKLET_BLOCKED_NS	2000000000
 static void set_tasklet_blocked(struct guc_submit_engine *gse)
 {
 	lockdep_assert_held(&gse->sched_engine.lock);
+	hrtimer_start_range_ns(&gse->hang_timer,
+			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
+			       HRTIMER_MODE_REL_PINNED);
 	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
 static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
 {
 	lockdep_assert_held(&gse->sched_engine.lock);
+	hrtimer_cancel(&gse->hang_timer);
 	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
@@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
 		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
 			GEM_BUG_ON(!guc->ct.enabled);
 			__tasklet_disable_sync_once(&sched_engine->tasklet);
+			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
 			sched_engine->tasklet.callback = NULL;
 		}
 	}
@@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
 	kfree(gse);
 }
 
+static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
+{
+	struct guc_submit_engine *gse =
+		container_of(hrtimer, struct guc_submit_engine, hang_timer);
+	struct intel_guc *guc = gse->sched_engine.private_data;
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+	if (guc->gse_hang_expected)
+		drm_dbg(&guc_to_gt(guc)->i915->drm,
+			"GSE[%i] hung, disabling submission", gse->id);
+	else
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"GSE[%i] hung, disabling submission", gse->id);
+#else
+	drm_err(&guc_to_gt(guc)->i915->drm,
+		"GSE[%i] hung, disabling submission", gse->id);
+#endif
+
+	/*
+	 * Tasklet not making forward progress, disable submission which in turn
+	 * will kick in the heartbeat to do a full GPU reset.
+	 */
+	disable_submission(guc);
+
+	return HRTIMER_NORESTART;
+}
+
 static void guc_submit_engine_init(struct intel_guc *guc,
 				   struct guc_submit_engine *gse,
 				   int id)
@@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
 	sched_engine->retire_inflight_request_prio =
 		guc_retire_inflight_request_prio;
 	sched_engine->private_data = guc;
+	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	gse->hang_timer.function = gse_hang;
 	gse->id = id;
 }
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
index a5933e07bdd2..eae2e9725ede 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
@@ -6,6 +6,8 @@
 #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
 #define _INTEL_GUC_SUBMISSION_TYPES_H_
 
+#include <linux/xarray.h>
+
 #include "gt/intel_engine_types.h"
 #include "gt/intel_context_types.h"
 #include "i915_scheduler_types.h"
@@ -41,6 +43,7 @@ struct guc_submit_engine {
 	unsigned long flags;
 	int total_num_rq_with_no_guc_id;
 	atomic_t num_guc_ids_not_ready;
+	struct hrtimer hang_timer;
 	int id;
 
 	/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (19 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 15:36   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 22/46] drm/i915/guc: Implement multi-lrc submission Matthew Brost
                   ` (29 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Since child contexts do not own the guc_ids or GuC context registration,
child contexts can simply be freed on destroy. Add
guc_child_context_destroy context operation to do this.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 2d8296bcc583..850edeff9230 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2828,6 +2828,13 @@ static void destroy_worker_func(struct work_struct *w)
 		intel_gt_pm_unpark_work_add(gt, destroy_worker);
 }
 
+/* Future patches will use this function */
+__maybe_unused
+static void guc_child_context_destroy(struct kref *kref)
+{
+	__guc_context_destroy(container_of(kref, struct intel_context, ref));
+}
+
 static void guc_context_destroy(struct kref *kref)
 {
 	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 22/46] drm/i915/guc: Implement multi-lrc submission
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (20 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
                   ` (28 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   6 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 224 +++++++++++++++++-
 drivers/gpu/drm/i915/i915_request.h           |   8 +
 3 files changed, 223 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index e08fbd40281c..6910a0cdb8c8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -64,12 +64,14 @@
 #define   WQ_TYPE_PSEUDO		(0x2 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_INORDER		(0x3 << WQ_TYPE_SHIFT)
 #define   WQ_TYPE_NOOP			(0x4 << WQ_TYPE_SHIFT)
-#define WQ_TARGET_SHIFT			10
+#define   WQ_TYPE_MULTI_LRC		(0x5 << WQ_TYPE_SHIFT)
+#define WQ_TARGET_SHIFT			8
 #define WQ_LEN_SHIFT			16
 #define WQ_NO_WCFLUSH_WAIT		(1 << 27)
 #define WQ_PRESENT_WORKLOAD		(1 << 28)
 
-#define WQ_RING_TAIL_SHIFT		20
+#define WQ_GUC_ID_SHIFT			0
+#define WQ_RING_TAIL_SHIFT		18
 #define WQ_RING_TAIL_MAX		0x7FF	/* 2^11 QWords */
 #define WQ_RING_TAIL_MASK		(WQ_RING_TAIL_MAX << WQ_RING_TAIL_SHIFT)
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 850edeff9230..d1d4a1e59e8d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -493,6 +493,29 @@ __get_process_desc(struct intel_context *ce)
 		   LRC_STATE_OFFSET) / sizeof(u32)));
 }
 
+static inline u32 *get_wq_pointer(struct guc_process_desc *desc,
+				  struct intel_context *ce,
+				  u32 wqi_size)
+{
+	/*
+	 * Check for space in work queue. Caching a value of head pointer in
+	 * intel_context structure in order reduce the number accesses to shared
+	 * GPU memory which may be across a PCIe bus.
+	 */
+#define AVAILABLE_SPACE	\
+	CIRC_SPACE(ce->guc_wqi_tail, ce->guc_wqi_head, GUC_WQ_SIZE)
+	if (wqi_size > AVAILABLE_SPACE) {
+		ce->guc_wqi_head = READ_ONCE(desc->head);
+
+		if (wqi_size > AVAILABLE_SPACE)
+			return NULL;
+	}
+#undef AVAILABLE_SPACE
+
+	return ((u32 *)__get_process_desc(ce)) +
+		((WQ_OFFSET + ce->guc_wqi_tail) / sizeof(u32));
+}
+
 static u32 __get_lrc_desc_offset(struct intel_guc *guc, int index)
 {
 	GEM_BUG_ON(index >= guc->lrcd_reg.max_idx);
@@ -643,7 +666,7 @@ static inline bool request_has_no_guc_id(struct i915_request *rq)
 static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 {
 	int err = 0;
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u32 action[3];
 	int len = 0;
 	u32 g2h_len_dw = 0;
@@ -697,6 +720,18 @@ static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
 		trace_intel_context_sched_enable(ce);
 		atomic_inc(&guc->outstanding_submission_g2h);
 		set_context_enabled(ce);
+
+		/*
+		 * Without multi-lrc KMD does the submission step (moving the
+		 * lrc tail) so enabling scheduling is sufficient to submit the
+		 * context. This isn't the case in multi-lrc submission as the
+		 * GuC needs to move the tails, hence the need for another H2G
+		 * to submit a multi-lrc context after enabling scheduling.
+		 */
+		if (intel_context_is_parent(ce)) {
+			action[0] = INTEL_GUC_ACTION_SCHED_CONTEXT;
+			err = intel_guc_send_nb(guc, action, len - 1, 0);
+		}
 	} else if (!enabled) {
 		clr_context_pending_enable(ce);
 		intel_context_put(ce);
@@ -783,6 +818,132 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc_rq(struct i915_request *rq)
+{
+	return intel_context_is_child(rq->context) ||
+		intel_context_is_parent(rq->context);
+}
+
+/*
+ * Multi-lrc requests are not submitted to the GuC until all requests in
+ * the set are ready. With the exception of the last request in the set,
+ * submitting a multi-lrc request is therefore just a status update on
+ * the driver-side and can be safely merged with other requests. When the
+ * last multi-lrc request in a set is detected, we break out of the
+ * submission loop and submit the whole set, thus we never attempt to
+ * merge that one with othe requests.
+ */
+static inline bool can_merge_rq(struct i915_request *rq,
+				struct i915_request *last)
+{
+	return is_multi_lrc_rq(last) || rq->context == last->context;
+}
+
+static inline u32 wq_space_until_wrap(struct intel_context *ce)
+{
+	return (GUC_WQ_SIZE - ce->guc_wqi_tail);
+}
+
+static inline void write_wqi(struct guc_process_desc *desc,
+			     struct intel_context *ce,
+			     u32 wqi_size)
+{
+	ce->guc_wqi_tail = (ce->guc_wqi_tail + wqi_size) & (GUC_WQ_SIZE - 1);
+	WRITE_ONCE(desc->tail, ce->guc_wqi_tail);
+}
+
+static inline int guc_wq_noop_append(struct intel_context *ce)
+{
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	u32 *wqi = get_wq_pointer(desc, ce, wq_space_until_wrap(ce));
+
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi = WQ_TYPE_NOOP |
+		((wq_space_until_wrap(ce) / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	ce->guc_wqi_tail = 0;
+
+	return 0;
+}
+
+static int __guc_wq_item_append(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	struct intel_context *child;
+	struct guc_process_desc *desc = __get_process_desc(ce);
+	unsigned int wqi_size = (ce->guc_number_children + 4) *
+		sizeof(u32);
+	u32 *wqi;
+	int ret;
+
+	/* Ensure context is in correct state updating work queue */
+	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
+	GEM_BUG_ON(request_has_no_guc_id(rq));
+	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
+	GEM_BUG_ON(context_guc_id_invalid(ce));
+	GEM_BUG_ON(context_pending_disable(ce));
+	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
+
+	/* Insert NOOP if this work queue item will wrap the tail pointer. */
+	if (wqi_size > wq_space_until_wrap(ce)) {
+		ret = guc_wq_noop_append(ce);
+		if (ret)
+			return ret;
+	}
+
+	wqi = get_wq_pointer(desc, ce, wqi_size);
+	if (!wqi)
+		return -EBUSY;
+
+	*wqi++ = WQ_TYPE_MULTI_LRC |
+		((wqi_size / sizeof(u32) - 1) << WQ_LEN_SHIFT);
+	*wqi++ = ce->lrc.lrca;
+	*wqi++ = (ce->guc_id << WQ_GUC_ID_SHIFT) |
+		 ((ce->ring->tail / sizeof(u64)) << WQ_RING_TAIL_SHIFT);
+	*wqi++ = 0;	/* fence_id */
+	for_each_child(ce, child)
+		*wqi++ = child->ring->tail / sizeof(u64);
+
+	write_wqi(desc, ce, wqi_size);
+
+	return 0;
+}
+
+static int gse_wq_item_append(struct guc_submit_engine *gse,
+			      struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+	int ret = 0;
+
+	if (likely(!intel_context_is_banned(ce))) {
+		ret = __guc_wq_item_append(rq);
+
+		if (unlikely(ret == -EBUSY)) {
+			gse->stalled_rq = rq;
+			gse->submission_stall_reason = STALL_MOVE_LRC_TAIL;
+		}
+	}
+
+	return ret;
+}
+
+static inline bool multi_lrc_submit(struct i915_request *rq)
+{
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	intel_ring_set_tail(rq->ring, rq->tail);
+
+	/*
+	 * We expect the front end (execbuf IOCTL) to set this flag on the last
+	 * request generated from a multi-BB submission. This indicates to the
+	 * backend (GuC interface) that we should submit this context thus
+	 * submitting all the requests generated in parallel.
+	 */
+	return test_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL, &rq->fence.flags) ||
+		intel_context_is_banned(ce);
+}
+
 static void kick_retire_wq(struct guc_submit_engine *gse)
 {
 	queue_work(system_unbound_wq, &gse->retire_worker);
@@ -826,7 +987,7 @@ static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 			struct i915_request *rq, *rn;
 
 			priolist_for_each_request_consume(rq, rn, p) {
-				if (last && rq->context != last->context)
+				if (last && !can_merge_rq(rq, last))
 					goto done;
 
 				list_del_init(&rq->sched.link);
@@ -835,7 +996,22 @@ static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 
 				trace_i915_request_in(rq, 0);
 				last = rq;
-				submit = true;
+
+				if (is_multi_lrc_rq(rq)) {
+					/*
+					 * We need to coalesce all multi-lrc
+					 * requests in a relationship into a
+					 * single H2G. We are guaranteed that
+					 * all of these requests will be
+					 * submitted sequentially.
+					 */
+					if (multi_lrc_submit(rq)) {
+						submit = true;
+						goto done;
+					}
+				} else {
+					submit = true;
+				}
 			}
 
 			rb_erase_cached(&p->node, &sched_engine->queue);
@@ -845,7 +1021,7 @@ static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 
 done:
 	if (submit) {
-		struct intel_context *ce = last->context;
+		struct intel_context *ce = request_to_scheduling_context(last);
 
 		if (ce->guc_num_rq_submit_no_id) {
 			ret = tasklet_pin_guc_id(gse, last);
@@ -867,7 +1043,17 @@ static int gse_dequeue_one_context(struct guc_submit_engine *gse)
 		}
 
 move_lrc_tail:
-		guc_set_lrc_tail(last);
+		if (is_multi_lrc_rq(last)) {
+			ret = gse_wq_item_append(gse, last);
+			if (ret == -EBUSY) {
+				goto schedule_tasklet;
+			} else if (ret != 0) {
+				GEM_WARN_ON(ret);	/* Unexpected */
+				goto deadlk;
+			}
+		} else {
+			guc_set_lrc_tail(last);
+		}
 
 add_request:
 		ret = gse_add_request(gse, last);
@@ -1575,14 +1761,22 @@ static bool need_tasklet(struct guc_submit_engine *gse, struct intel_context *ce
 static int gse_bypass_tasklet_submit(struct guc_submit_engine *gse,
 				     struct i915_request *rq)
 {
-	int ret;
+	int ret = 0;
 
 	__i915_request_submit(rq);
 
 	trace_i915_request_in(rq, 0);
 
-	guc_set_lrc_tail(rq);
-	ret = gse_add_request(gse, rq);
+	if (is_multi_lrc_rq(rq)) {
+		if (multi_lrc_submit(rq)) {
+			ret = gse_wq_item_append(gse, rq);
+			if (!ret)
+				ret = gse_add_request(gse, rq);
+		}
+	} else {
+		guc_set_lrc_tail(rq);
+		ret = gse_add_request(gse, rq);
+	}
 
 	if (unlikely(ret == -EPIPE))
 		disable_submission(gse->sched_engine.private_data);
@@ -1599,7 +1793,7 @@ static void guc_submit_request(struct i915_request *rq)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&sched_engine->lock, flags);
 
-	if (need_tasklet(gse, rq->context))
+	if (need_tasklet(gse, request_to_scheduling_context(rq)))
 		queue_request(sched_engine, rq, rq_prio(rq));
 	else if (gse_bypass_tasklet_submit(gse, rq) == -EBUSY)
 		kick_tasklet(gse);
@@ -2988,9 +3182,10 @@ static inline bool new_guc_prio_higher(u8 old_guc_prio, u8 new_guc_prio)
 
 static void add_to_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 	u8 new_guc_prio = map_i915_prio_to_guc_prio(rq_prio(rq));
 
+	GEM_BUG_ON(intel_context_is_child(ce));
 	GEM_BUG_ON(rq->guc_prio == GUC_PRIO_FINI);
 
 	spin_lock(&ce->guc_active.lock);
@@ -3026,7 +3221,9 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
 
 static void remove_from_context(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	spin_lock_irq(&ce->guc_active.lock);
 
@@ -3231,7 +3428,8 @@ static int tasklet_pin_guc_id(struct guc_submit_engine *gse,
 	GEM_BUG_ON(gse->total_num_rq_with_no_guc_id < 0);
 
 	list_for_each_entry_reverse(rq, &ce->guc_active.requests, sched.link)
-		if (request_has_no_guc_id(rq)) {
+		if (request_has_no_guc_id(rq) &&
+		    request_to_scheduling_context(rq) == ce) {
 			--ce->guc_num_rq_submit_no_id;
 			clear_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
 				  &rq->fence.flags);
@@ -3551,7 +3749,7 @@ static void guc_bump_inflight_request_prio(struct i915_request *rq,
 
 static void guc_retire_inflight_request_prio(struct i915_request *rq)
 {
-	struct intel_context *ce = rq->context;
+	struct intel_context *ce = request_to_scheduling_context(rq);
 
 	spin_lock(&ce->guc_active.lock);
 	guc_prio_fini(rq, ce);
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 807f76750cf4..d6d5bf0a5eb5 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -145,6 +145,14 @@ enum {
 	 * tasklet that the guc_id isn't pinned.
 	 */
 	I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
+
+	/*
+	 * I915_FENCE_FLAG_SUBMIT_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) should
+	 * trigger a submission to the GuC rather than just moving the context
+	 * tail.
+	 */
+	I915_FENCE_FLAG_SUBMIT_PARALLEL,
 };
 
 /**
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (21 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 22/46] drm/i915/guc: Implement multi-lrc submission Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 16:32   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 24/46] drm/i915/guc: Implement multi-lrc reset Matthew Brost
                   ` (27 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   3 +-
 drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
 5 files changed, 105 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index bb4c14656067..98ef2d0f7a39 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -487,6 +487,8 @@ void intel_context_fini(struct intel_context *ce)
 {
 	struct intel_context *child, *next;
 
+	if (ce->last_rq)
+		i915_request_put(ce->last_rq);
 	if (ce->timeline)
 		intel_timeline_put(ce->timeline);
 	i915_vm_put(ce->vm);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 7ce3b3d2edb7..a302599e436a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -60,6 +60,11 @@ intel_context_to_parent(struct intel_context *ce)
 	return intel_context_is_child(ce) ? ce->parent : ce;
 }
 
+static inline bool intel_context_is_parallel(struct intel_context *ce)
+{
+	return intel_context_is_child(ce) || intel_context_is_parent(ce);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 9665cb31bab0..f4fc81f64921 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -225,6 +225,9 @@ struct intel_context {
 	 */
 	u8 guc_prio;
 	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
+
+	/* Last request submitted on a parent */
+	struct i915_request *last_rq;
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d1d4a1e59e8d..1cb382f7d79d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -820,8 +820,7 @@ static inline int rq_prio(const struct i915_request *rq)
 
 static inline bool is_multi_lrc_rq(struct i915_request *rq)
 {
-	return intel_context_is_child(rq->context) ||
-		intel_context_is_parent(rq->context);
+	return intel_context_is_parallel(rq->context);
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index ce446716d092..2e51c8999088 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
 	return ret;
 }
 
+static inline bool is_parallel_rq(struct i915_request *rq)
+{
+	return intel_context_is_parallel(rq->context);
+}
+
+static inline struct intel_context *request_to_parent(struct i915_request *rq)
+{
+	return intel_context_to_parent(rq->context);
+}
+
 static struct i915_request *
-__i915_request_add_to_timeline(struct i915_request *rq)
+__i915_request_ensure_parallel_ordering(struct i915_request *rq,
+					struct intel_timeline *timeline)
 {
-	struct intel_timeline *timeline = i915_request_timeline(rq);
 	struct i915_request *prev;
 
-	/*
-	 * Dependency tracking and request ordering along the timeline
-	 * is special cased so that we can eliminate redundant ordering
-	 * operations while building the request (we know that the timeline
-	 * itself is ordered, and here we guarantee it).
-	 *
-	 * As we know we will need to emit tracking along the timeline,
-	 * we embed the hooks into our request struct -- at the cost of
-	 * having to have specialised no-allocation interfaces (which will
-	 * be beneficial elsewhere).
-	 *
-	 * A second benefit to open-coding i915_request_await_request is
-	 * that we can apply a slight variant of the rules specialised
-	 * for timelines that jump between engines (such as virtual engines).
-	 * If we consider the case of virtual engine, we must emit a dma-fence
-	 * to prevent scheduling of the second request until the first is
-	 * complete (to maximise our greedy late load balancing) and this
-	 * precludes optimising to use semaphores serialisation of a single
-	 * timeline across engines.
-	 */
+	GEM_BUG_ON(!is_parallel_rq(rq));
+
+	prev = request_to_parent(rq)->last_rq;
+	if (prev) {
+		if (!__i915_request_is_complete(prev)) {
+			i915_sw_fence_await_sw_fence(&rq->submit,
+						     &prev->submit,
+						     &rq->submitq);
+
+			if (rq->engine->sched_engine->schedule)
+				__i915_sched_node_add_dependency(&rq->sched,
+								 &prev->sched,
+								 &rq->dep,
+								 0);
+		}
+		i915_request_put(prev);
+	}
+
+	request_to_parent(rq)->last_rq = i915_request_get(rq);
+
+	return to_request(__i915_active_fence_set(&timeline->last_request,
+						  &rq->fence));
+}
+
+static struct i915_request *
+__i915_request_ensure_ordering(struct i915_request *rq,
+			       struct intel_timeline *timeline)
+{
+	struct i915_request *prev;
+
+	GEM_BUG_ON(is_parallel_rq(rq));
+
 	prev = to_request(__i915_active_fence_set(&timeline->last_request,
 						  &rq->fence));
+
 	if (prev && !__i915_request_is_complete(prev)) {
 		bool uses_guc = intel_engine_uses_guc(rq->engine);
+		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
+					  rq->engine->mask);
+		bool same_context = prev->context == rq->context;
 
 		/*
 		 * The requests are supposed to be kept in order. However,
@@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 		 * is used as a barrier for external modification to this
 		 * context.
 		 */
-		GEM_BUG_ON(prev->context == rq->context &&
+		GEM_BUG_ON(same_context &&
 			   i915_seqno_passed(prev->fence.seqno,
 					     rq->fence.seqno));
 
-		if ((!uses_guc &&
-		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
-		    (uses_guc && prev->context == rq->context))
+		if ((same_context && uses_guc) || (!uses_guc && pow2))
 			i915_sw_fence_await_sw_fence(&rq->submit,
 						     &prev->submit,
 						     &rq->submitq);
@@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 							 0);
 	}
 
+	return prev;
+}
+
+static struct i915_request *
+__i915_request_add_to_timeline(struct i915_request *rq)
+{
+	struct intel_timeline *timeline = i915_request_timeline(rq);
+	struct i915_request *prev;
+
+	/*
+	 * Dependency tracking and request ordering along the timeline
+	 * is special cased so that we can eliminate redundant ordering
+	 * operations while building the request (we know that the timeline
+	 * itself is ordered, and here we guarantee it).
+	 *
+	 * As we know we will need to emit tracking along the timeline,
+	 * we embed the hooks into our request struct -- at the cost of
+	 * having to have specialised no-allocation interfaces (which will
+	 * be beneficial elsewhere).
+	 *
+	 * A second benefit to open-coding i915_request_await_request is
+	 * that we can apply a slight variant of the rules specialised
+	 * for timelines that jump between engines (such as virtual engines).
+	 * If we consider the case of virtual engine, we must emit a dma-fence
+	 * to prevent scheduling of the second request until the first is
+	 * complete (to maximise our greedy late load balancing) and this
+	 * precludes optimising to use semaphores serialisation of a single
+	 * timeline across engines.
+	 *
+	 * We do not order parallel submission requests on the timeline as each
+	 * parallel submission context has its own timeline and the ordering
+	 * rules for parallel requests are that they must be submitted in the
+	 * order received from the execbuf IOCTL. So rather than using the
+	 * timeline we store a pointer to last request submitted in the
+	 * relationship in the gem context and insert a submission fence
+	 * between that request and request passed into this function or
+	 * alternatively we use completion fence if gem context has a single
+	 * timeline and this is the first submission of an execbuf IOCTL.
+	 */
+	if (likely(!is_parallel_rq(rq)))
+		prev = __i915_request_ensure_ordering(rq, timeline);
+	else
+		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+
 	/*
 	 * Make sure that no request gazumped us - if it was allocated after
 	 * our i915_request_alloc() and called __i915_request_add() before
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 24/46] drm/i915/guc: Implement multi-lrc reset
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (22 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
                   ` (26 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its' children. The parent context owns the reset replaying / canceling
requests as needed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 12 ++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 57 ++++++++++++++-----
 2 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 98ef2d0f7a39..c327fd1c24c2 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -595,20 +595,22 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 
 struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 {
+	struct intel_context *parent = intel_context_is_child(ce) ?
+		ce->parent : ce;
 	struct i915_request *rq, *active = NULL;
 	unsigned long flags;
 
 	GEM_BUG_ON(!intel_engine_uses_guc(ce->engine));
 
-	spin_lock_irqsave(&ce->guc_active.lock, flags);
-	list_for_each_entry_reverse(rq, &ce->guc_active.requests,
+	spin_lock_irqsave(&parent->guc_active.lock, flags);
+	list_for_each_entry_reverse(rq, &parent->guc_active.requests,
 				    sched.link) {
-		if (i915_request_completed(rq))
+		if (i915_request_completed(rq) && rq->context == ce)
 			break;
 
-		active = rq;
+		active = (rq->context == ce) ? rq : active;
 	}
-	spin_unlock_irqrestore(&ce->guc_active.lock, flags);
+	spin_unlock_irqrestore(&parent->guc_active.lock, flags);
 
 	return active;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1cb382f7d79d..30df1c8db491 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -818,6 +818,11 @@ static inline int rq_prio(const struct i915_request *rq)
 	return rq->sched.attr.priority;
 }
 
+static inline bool is_multi_lrc(struct intel_context *ce)
+{
+	return intel_context_is_parallel(ce);
+}
+
 static inline bool is_multi_lrc_rq(struct i915_request *rq)
 {
 	return intel_context_is_parallel(rq->context);
@@ -1381,6 +1386,8 @@ __unwind_incomplete_requests(struct intel_context *ce)
 		ce->engine->sched_engine;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	spin_lock_irqsave(&sched_engine->lock, flags);
 	spin_lock(&ce->guc_active.lock);
 	list_for_each_entry_safe(rq, rn,
@@ -1413,8 +1420,12 @@ __unwind_incomplete_requests(struct intel_context *ce)
 
 static void __guc_reset_context(struct intel_context *ce, bool stalled)
 {
+	bool local_stalled;
 	struct i915_request *rq;
 	u32 head;
+	int i, number_children = ce->guc_number_children;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
 
 	intel_context_get(ce);
 
@@ -1426,22 +1437,32 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
 	 */
 	clr_context_enabled(ce);
 
-	rq = intel_context_find_active_request(ce);
-	if (!rq) {
-		head = ce->ring->tail;
-		stalled = false;
-		goto out_replay;
-	}
+	for (i = 0; i < number_children + 1; ++i) {
+		if (!intel_context_is_pinned(ce))
+			goto next_context;
+
+		local_stalled = false;
+		rq = intel_context_find_active_request(ce);
+		if (!rq) {
+			head = ce->ring->tail;
+			goto out_replay;
+		}
 
-	if (!i915_request_started(rq))
-		stalled = false;
+		GEM_BUG_ON(i915_active_is_idle(&ce->active));
+		head = intel_ring_wrap(ce->ring, rq->head);
 
-	GEM_BUG_ON(i915_active_is_idle(&ce->active));
-	head = intel_ring_wrap(ce->ring, rq->head);
-	__i915_request_reset(rq, stalled);
+		if (i915_request_started(rq))
+			local_stalled = true;
 
+		__i915_request_reset(rq, local_stalled && stalled);
 out_replay:
-	guc_reset_state(ce, head, stalled);
+		guc_reset_state(ce, head, local_stalled && stalled);
+next_context:
+		if (i != number_children)
+			ce = list_next_entry(ce, guc_child_link);
+	}
+	ce = intel_context_to_parent(ce);
+
 	__unwind_incomplete_requests(ce);
 	ce->guc_num_rq_submit_no_id = 0;
 	intel_context_put(ce);
@@ -1458,9 +1479,11 @@ void intel_guc_submission_reset(struct intel_guc *guc, bool stalled)
 	}
 
 	xa_for_each(&guc->context_lookup, index, ce)
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			__guc_reset_context(ce, stalled);
 
+	/* GuC is blown away, drop all references to contexts */
 	xa_destroy(&guc->context_lookup);
 }
 
@@ -1513,7 +1536,8 @@ gse_cancel_requests(struct guc_submit_engine *gse)
 		struct i915_priolist *p = to_priolist(rb);
 
 		priolist_for_each_request_consume(rq, rn, p) {
-			struct intel_context *ce = rq->context;
+			struct intel_context *ce =
+				request_to_scheduling_context(rq);
 
 			list_del_init(&rq->sched.link);
 
@@ -1543,7 +1567,8 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
 	int i;
 
 	xa_for_each(&guc->context_lookup, index, ce)
-		if (intel_context_is_pinned(ce))
+		if (intel_context_is_pinned(ce) &&
+		    !intel_context_is_child(ce))
 			guc_cancel_context_requests(ce);
 
 	for (i = 0; i < GUC_SUBMIT_ENGINE_MAX; ++i)
@@ -2823,6 +2848,8 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 	intel_wakeref_t wakeref;
 	unsigned long flags;
 
+	GEM_BUG_ON(intel_context_is_child(ce));
+
 	gse_flush_submissions(ce_to_gse(ce));
 
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (23 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 24/46] drm/i915/guc: Implement multi-lrc reset Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 16:36   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface Matthew Brost
                   ` (25 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
 1 file changed, 39 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 30df1c8db491..44a7582c9aed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 		gse_log_submission_info(guc->gse[i], p, i);
 }
 
+static inline void guc_log_context(struct drm_printer *p,
+				   struct intel_context *ce)
+{
+	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
+	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
+	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
+		   ce->ring->head,
+		   ce->lrc_reg_state[CTX_RING_HEAD]);
+	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
+		   ce->ring->tail,
+		   ce->lrc_reg_state[CTX_RING_TAIL]);
+	drm_printf(p, "\t\tContext Pin Count: %u\n",
+		   atomic_read(&ce->pin_count));
+	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
+		   atomic_read(&ce->guc_id_ref));
+	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
+		   atomic_read(&ce->guc_num_rq_not_ready));
+	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
+		   ce->guc_state.sched_state,
+		   atomic_read(&ce->guc_sched_state_no_lock));
+}
+
 void intel_guc_submission_print_context_info(struct intel_guc *guc,
 					     struct drm_printer *p)
 {
 	struct intel_context *ce;
 	unsigned long index;
 	xa_for_each(&guc->context_lookup, index, ce) {
-		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
-		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
-		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
-			   ce->ring->head,
-			   ce->lrc_reg_state[CTX_RING_HEAD]);
-		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
-			   ce->ring->tail,
-			   ce->lrc_reg_state[CTX_RING_TAIL]);
-		drm_printf(p, "\t\tContext Pin Count: %u\n",
-			   atomic_read(&ce->pin_count));
-		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
-			   atomic_read(&ce->guc_id_ref));
-		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
-			   atomic_read(&ce->guc_num_rq_not_ready));
-		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
-			   ce->guc_state.sched_state,
-			   atomic_read(&ce->guc_sched_state_no_lock));
+		GEM_BUG_ON(intel_context_is_child(ce));
 
+		guc_log_context(p, ce);
 		guc_log_context_priority(p, ce);
+
+		if (intel_context_is_parent(ce)) {
+			struct guc_process_desc *desc = __get_process_desc(ce);
+			struct intel_context *child;
+
+			drm_printf(p, "\t\tWQI Head: %u\n",
+				   READ_ONCE(desc->head));
+			drm_printf(p, "\t\tWQI Tail: %u\n",
+				   READ_ONCE(desc->tail));
+			drm_printf(p, "\t\tWQI Status: %u\n\n",
+				   READ_ONCE(desc->wq_status));
+
+			for_each_child(ce, child)
+				guc_log_context(p, child);
+		}
 	}
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (24 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 16:37   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 27/46] drm/i915/doc: Update parallel submit doc to point to i915_drm.h Matthew Brost
                   ` (24 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 157 +++++++++++++++++-
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
 .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 111 +++++++++++--
 include/uapi/drm/i915_drm.h                   | 128 ++++++++++++++
 9 files changed, 417 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index cff72679ad7c..2b0dd3ff4db8 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -515,9 +515,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
 	return 0;
 }
 
+static int
+set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
+				      void *data)
+{
+	struct i915_context_engines_parallel_submit __user *ext =
+		container_of_user(base, typeof(*ext), base);
+	const struct set_proto_ctx_engines *set = data;
+	struct drm_i915_private *i915 = set->i915;
+	u64 flags;
+	int err = 0, n, i, j;
+	u16 slot, width, num_siblings;
+	struct intel_engine_cs **siblings = NULL;
+	intel_engine_mask_t prev_mask;
+
+	/* Disabling for now */
+	return -ENODEV;
+
+	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
+		return -ENODEV;
+
+	if (get_user(slot, &ext->engine_index))
+		return -EFAULT;
+
+	if (get_user(width, &ext->width))
+		return -EFAULT;
+
+	if (get_user(num_siblings, &ext->num_siblings))
+		return -EFAULT;
+
+	if (slot >= set->num_engines) {
+		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
+			slot, set->num_engines);
+		return -EINVAL;
+	}
+
+	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
+		drm_dbg(&i915->drm,
+			"Invalid placement[%d], already occupied\n", slot);
+		return -EINVAL;
+	}
+
+	if (get_user(flags, &ext->flags))
+		return -EFAULT;
+
+	if (flags) {
+		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
+		return -EINVAL;
+	}
+
+	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
+		err = check_user_mbz(&ext->mbz64[n]);
+		if (err)
+			return err;
+	}
+
+	if (width < 2) {
+		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
+		return -EINVAL;
+	}
+
+	if (num_siblings < 1) {
+		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
+	siblings = kmalloc_array(num_siblings * width,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return -ENOMEM;
+
+	/* Create contexts / engines */
+	for (i = 0; i < width; ++i) {
+		intel_engine_mask_t current_mask = 0;
+		struct i915_engine_class_instance prev_engine;
+
+		for (j = 0; j < num_siblings; ++j) {
+			struct i915_engine_class_instance ci;
+
+			n = i * num_siblings + j;
+			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
+				err = -EFAULT;
+				goto out_err;
+			}
+
+			siblings[n] =
+				intel_engine_lookup_user(i915, ci.engine_class,
+							 ci.engine_instance);
+			if (!siblings[n]) {
+				drm_dbg(&i915->drm,
+					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
+					n, ci.engine_class, ci.engine_instance);
+				err = -EINVAL;
+				goto out_err;
+			}
+
+			if (n) {
+				if (prev_engine.engine_class !=
+				    ci.engine_class) {
+					drm_dbg(&i915->drm,
+						"Mismatched class %d, %d\n",
+						prev_engine.engine_class,
+						ci.engine_class);
+					err = -EINVAL;
+					goto out_err;
+				}
+			}
+
+			prev_engine = ci;
+			current_mask |= siblings[n]->logical_mask;
+		}
+
+		if (i > 0) {
+			if (current_mask != prev_mask << 1) {
+				drm_dbg(&i915->drm,
+					"Non contiguous logical mask 0x%x, 0x%x\n",
+					prev_mask, current_mask);
+				err = -EINVAL;
+				goto out_err;
+			}
+		}
+		prev_mask = current_mask;
+	}
+
+	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
+	set->engines[slot].num_siblings = num_siblings;
+	set->engines[slot].width = width;
+	set->engines[slot].siblings = siblings;
+
+	return 0;
+
+out_err:
+	kfree(siblings);
+
+	return err;
+}
+
 static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
 	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
 	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
+	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
+		set_proto_ctx_engines_parallel_submit,
 };
 
 static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
@@ -938,7 +1078,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 	e = alloc_engines(num_engines);
 	for (n = 0; n < num_engines; n++) {
-		struct intel_context *ce;
+		struct intel_context *ce, *child;
 		int ret;
 
 		switch (pe[n].type) {
@@ -948,7 +1088,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 
 		case I915_GEM_ENGINE_TYPE_BALANCED:
 			ce = intel_engine_create_virtual(pe[n].siblings,
-							 pe[n].num_siblings);
+							 pe[n].num_siblings, 0);
+			break;
+
+		case I915_GEM_ENGINE_TYPE_PARALLEL:
+			ce = intel_engine_create_parallel(pe[n].siblings,
+							  pe[n].num_siblings,
+							  pe[n].width);
 			break;
 
 		case I915_GEM_ENGINE_TYPE_INVALID:
@@ -969,6 +1115,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
 			err = ERR_PTR(ret);
 			goto free_engines;
 		}
+		for_each_child(ce, child) {
+			ret = intel_context_set_gem(child, ctx, pe->sseu);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto free_engines;
+			}
+		}
 	}
 	e->num_engines = num_engines;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 94c03a97cb77..7b096d83bca1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -78,6 +78,9 @@ enum i915_gem_engine_type {
 
 	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
 	I915_GEM_ENGINE_TYPE_BALANCED,
+
+	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
+	I915_GEM_ENGINE_TYPE_PARALLEL,
 };
 
 /**
@@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
 	/** @num_siblings: Number of balanced siblings */
 	unsigned int num_siblings;
 
+	/** @width: Width of each sibling */
+	unsigned int width;
+
 	/** @siblings: Balanced siblings */
 	struct intel_engine_cs **siblings;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index f4fc81f64921..9cdbea752014 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -55,9 +55,13 @@ struct intel_context_ops {
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
-	/* virtual engine/context interface */
+	/* virtual/parallel engine/context interface */
 	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
-						unsigned int count);
+						unsigned int count,
+						unsigned long flags);
+	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
+						 unsigned int num_siblings,
+						 unsigned int width);
 	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
 					       unsigned int sibling);
 };
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 87579affb952..43f16a8347ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+#define FORCE_VIRTUAL	BIT(0)
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count);
+			    unsigned int count, unsigned long flags);
+
+static inline struct intel_context *
+intel_engine_create_parallel(struct intel_engine_cs **engines,
+			     unsigned int num_engines,
+			     unsigned int width)
+{
+	GEM_BUG_ON(!engines[0]->cops->create_parallel);
+	return engines[0]->cops->create_parallel(engines, num_engines, width);
+}
 
 static inline bool
 intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 4d790f9a65dd..f66c75c77584 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 
 struct intel_context *
 intel_engine_create_virtual(struct intel_engine_cs **siblings,
-			    unsigned int count)
+			    unsigned int count, unsigned long flags)
 {
 	if (count == 0)
 		return ERR_PTR(-EINVAL);
 
-	if (count == 1)
+	if (count == 1 && !(flags & FORCE_VIRTUAL))
 		return intel_context_create(siblings[0]);
 
 	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
-	return siblings[0]->cops->create_virtual(siblings, count);
+	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
 struct i915_request *
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index fc74ca28f245..769480e026bb 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags);
 
 static struct i915_request *
 __active_request(const struct intel_timeline * const tl,
@@ -3785,7 +3786,8 @@ static void virtual_submit_request(struct i915_request *rq)
 }
 
 static struct intel_context *
-execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+			 unsigned long flags)
 {
 	struct virtual_engine *ve;
 	unsigned int n;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index f12ffe797639..e876a9d88a5c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
 
 	for (n = 0; n < nctx; n++) {
-		ve[n] = intel_engine_create_virtual(siblings, nsibling);
+		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ve[n])) {
 			err = PTR_ERR(ve[n]);
 			nctx = n;
@@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
 	 * restrict it to our desired engine within the virtual engine.
 	 */
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_close;
@@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
 		i915_request_add(rq);
 	}
 
-	ce = intel_engine_create_virtual(siblings, nsibling);
+	ce = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ce)) {
 		err = PTR_ERR(ce);
 		goto out;
@@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
 
 	/* XXX We do not handle oversubscription and fairness with normal rq */
 	for (n = 0; n < nsibling; n++) {
-		ce = intel_engine_create_virtual(siblings, nsibling);
+		ce = intel_engine_create_virtual(siblings, nsibling, 0);
 		if (IS_ERR(ce)) {
 			err = PTR_ERR(ce);
 			goto out;
@@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
 	if (err)
 		goto out_scratch;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_scratch;
@@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	if (igt_spinner_init(&spin, gt))
 		return -ENOMEM;
 
-	ve = intel_engine_create_virtual(siblings, nsibling);
+	ve = intel_engine_create_virtual(siblings, nsibling, 0);
 	if (IS_ERR(ve)) {
 		err = PTR_ERR(ve);
 		goto out_spin;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 44a7582c9aed..89528624710a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -82,7 +82,8 @@
  */
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags);
 
 #define GUC_REQUEST_SIZE 64 /* bytes */
 
@@ -2514,8 +2515,6 @@ static void guc_context_post_unpin(struct intel_context *ce)
 	__guc_context_post_unpin(ce);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pre_pin(struct intel_context *ce,
 				      struct i915_gem_ww_ctx *ww)
 {
@@ -2559,8 +2558,6 @@ static int guc_parent_context_pre_pin(struct intel_context *ce,
 	return err;
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_post_unpin(struct intel_context *ce)
 {
 	struct intel_context *child;
@@ -2576,8 +2573,6 @@ static void guc_parent_context_post_unpin(struct intel_context *ce)
 	}
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static int guc_parent_context_pin(struct intel_context *ce)
 {
 	int ret, i = 0, j = 0;
@@ -2623,8 +2618,6 @@ static int guc_parent_context_pin(struct intel_context *ce)
 	return ret;
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_parent_context_unpin(struct intel_context *ce)
 {
 	struct intel_context *child;
@@ -3048,8 +3041,6 @@ static void destroy_worker_func(struct work_struct *w)
 		intel_gt_pm_unpark_work_add(gt, destroy_worker);
 }
 
-/* Future patches will use this function */
-__maybe_unused
 static void guc_child_context_destroy(struct kref *kref)
 {
 	__guc_context_destroy(container_of(kref, struct intel_context, ref));
@@ -3272,6 +3263,11 @@ static void remove_from_context(struct i915_request *rq)
 	i915_request_notify_execute_cb_imm(rq);
 }
 
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width);
+
 static const struct intel_context_ops guc_context_ops = {
 	.alloc = guc_context_alloc,
 
@@ -3293,6 +3289,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.destroy = guc_context_destroy,
 
 	.create_virtual = guc_create_virtual,
+	.create_parallel = guc_create_parallel,
 };
 
 static void __guc_signal_context_fence(struct intel_context *ce)
@@ -3782,6 +3779,91 @@ static void guc_retire_inflight_request_prio(struct i915_request *rq)
 	spin_unlock(&ce->guc_active.lock);
 }
 
+static const struct intel_context_ops virtual_parent_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.pre_pin = guc_parent_context_pre_pin,
+	.pin = guc_parent_context_pin,
+	.unpin = guc_parent_context_unpin,
+	.post_unpin = guc_parent_context_post_unpin,
+
+	.ban = guc_context_ban,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.sched_disable = guc_context_sched_disable,
+
+	.destroy = guc_context_destroy,
+
+	.get_sibling = guc_virtual_get_sibling,
+};
+
+static const struct intel_context_ops virtual_child_context_ops = {
+	.alloc = guc_virtual_context_alloc,
+
+	.enter = guc_virtual_context_enter,
+	.exit = guc_virtual_context_exit,
+
+	.destroy = guc_child_context_destroy,
+};
+
+static struct intel_context *
+guc_create_parallel(struct intel_engine_cs **engines,
+		    unsigned int num_siblings,
+		    unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+	int ret;
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_engine_create_virtual(siblings, num_siblings,
+						 FORCE_VIRTUAL);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+		} else {
+			intel_context_bind_parent_child(parent, ce);
+			ret = intel_context_alloc_state(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto unwind;
+			}
+		}
+	}
+
+	parent->ops = &virtual_parent_context_ops;
+	for_each_child(parent, ce)
+		ce->ops = &virtual_child_context_ops;
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent) {
+		for_each_child(parent, ce)
+			intel_context_put(ce);
+		intel_context_put(parent);
+	}
+	kfree(siblings);
+	return err;
+}
+
 static void sanitize_hwsp(struct intel_engine_cs *engine)
 {
 	struct intel_timeline *tl;
@@ -4578,7 +4660,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 }
 
 static struct intel_context *
-guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
+guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
+		   unsigned long flags)
 {
 	struct guc_virtual_engine *ve;
 	struct intel_guc *guc;
@@ -4591,7 +4674,9 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
 		return ERR_PTR(-ENOMEM);
 
 	guc = &siblings[0]->gt->uc.guc;
-	sched_engine = guc_to_sched_engine(guc, GUC_SUBMIT_ENGINE_SINGLE_LRC);
+	sched_engine = guc_to_sched_engine(guc, (flags & FORCE_VIRTUAL) ?
+					   GUC_SUBMIT_ENGINE_MULTI_LRC :
+					   GUC_SUBMIT_ENGINE_SINGLE_LRC);
 
 	ve->base.i915 = siblings[0]->i915;
 	ve->base.gt = siblings[0]->gt;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index ef72e07fe08c..a16f0f8908de 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1821,6 +1821,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -2046,6 +2047,132 @@ struct i915_context_engines_bond {
 	struct i915_engine_class_instance engines[N__]; \
 } __attribute__((packed)) name__
 
+/**
+ * struct i915_context_engines_parallel_submit - Configure engine for
+ * parallel submission.
+ *
+ * Setup a slot in the context engine map to allow multiple BBs to be submitted
+ * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
+ * in parallel. Multiple hardware contexts are created internally in the i915
+ * run these BBs. Once a slot is configured for N BBs only N BBs can be
+ * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
+ * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
+ * many BBs there are based on the slot's configuration. The N BBs are the last
+ * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
+ *
+ * The default placement behavior is to create implicit bonds between each
+ * context if each context maps to more than 1 physical engine (e.g. context is
+ * a virtual engine). Also we only allow contexts of same engine class and these
+ * contexts must be in logically contiguous order. Examples of the placement
+ * behavior described below. Lastly, the default is to not allow BBs to
+ * preempted mid BB rather insert coordinated preemption on all hardware
+ * contexts between each set of BBs. Flags may be added in the future to change
+ * both of these default behaviors.
+ *
+ * Returns -EINVAL if hardware context placement configuration is invalid or if
+ * the placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * interface.
+ *
+ * .. code-block:: none
+ *
+ *	Example 1 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=1,
+ *		     engines=CS[0],CS[1])
+ *
+ *	Results in the following valid placement:
+ *	CS[0], CS[1]
+ *
+ *	Example 2 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[2],CS[1],CS[3])
+ *
+ *	Results in the following valid placements:
+ *	CS[0], CS[1]
+ *	CS[2], CS[3]
+ *
+ *	This can also be thought of as 2 virtual engines described by 2-D array
+ *	in the engines the field with bonds placed between each index of the
+ *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
+ *	CS[3].
+ *	VE[0] = CS[0], CS[2]
+ *	VE[1] = CS[1], CS[3]
+ *
+ *	Example 3 pseudo code:
+ *	CS[X] = generic engine of same class, logical instance X
+ *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ *	set_engines(INVALID)
+ *	set_parallel(engine_index=0, width=2, num_siblings=2,
+ *		     engines=CS[0],CS[1],CS[1],CS[3])
+ *
+ *	Results in the following valid and invalid placements:
+ *	CS[0], CS[1]
+ *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
+ */
+struct i915_context_engines_parallel_submit {
+	/**
+	 * @base: base user extension.
+	 */
+	struct i915_user_extension base;
+
+	/**
+	 * @engine_index: slot for parallel engine
+	 */
+	__u16 engine_index;
+
+	/**
+	 * @width: number of contexts per parallel engine
+	 */
+	__u16 width;
+
+	/**
+	 * @num_siblings: number of siblings per context
+	 */
+	__u16 num_siblings;
+
+	/**
+	 * @mbz16: reserved for future use; must be zero
+	 */
+	__u16 mbz16;
+
+	/**
+	 * @flags: all undefined flags must be zero, currently not defined flags
+	 */
+	__u64 flags;
+
+	/**
+	 * @mbz64: reserved for future use; must be zero
+	 */
+	__u64 mbz64[3];
+
+	/**
+	 * @engines: 2-d array of engine instances to configure parallel engine
+	 *
+	 * length = width (i) * num_siblings (j)
+	 * index = j + i * num_siblings
+	 */
+	struct i915_engine_class_instance engines[0];
+
+} __packed;
+
+#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
+	struct i915_user_extension base; \
+	__u16 engine_index; \
+	__u16 width; \
+	__u16 num_siblings; \
+	__u16 mbz16; \
+	__u64 flags; \
+	__u64 mbz64[3]; \
+	struct i915_engine_class_instance engines[N__]; \
+} __attribute__((packed)) name__
+
 /**
  * DOC: Context Engine Map uAPI
  *
@@ -2105,6 +2232,7 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 27/46] drm/i915/doc: Update parallel submit doc to point to i915_drm.h
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (25 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 28/46] drm/i915/guc: Add basic GuC multi-lrc selftest Matthew Brost
                   ` (23 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Update parallel submit doc to point to i915_drm.h

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ------------------
 Documentation/gpu/rfc/i915_scheduler.rst      |   4 +-
 2 files changed, 2 insertions(+), 124 deletions(-)
 delete mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
deleted file mode 100644
index 8cbe2c4e0172..000000000000
--- a/Documentation/gpu/rfc/i915_parallel_execbuf.h
+++ /dev/null
@@ -1,122 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2021 Intel Corporation
- */
-
-#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
-
-/**
- * struct drm_i915_context_engines_parallel_submit - Configure engine for
- * parallel submission.
- *
- * Setup a slot in the context engine map to allow multiple BBs to be submitted
- * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
- * in parallel. Multiple hardware contexts are created internally in the i915
- * run these BBs. Once a slot is configured for N BBs only N BBs can be
- * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
- * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
- * many BBs there are based on the slot's configuration. The N BBs are the last
- * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
- *
- * The default placement behavior is to create implicit bonds between each
- * context if each context maps to more than 1 physical engine (e.g. context is
- * a virtual engine). Also we only allow contexts of same engine class and these
- * contexts must be in logically contiguous order. Examples of the placement
- * behavior described below. Lastly, the default is to not allow BBs to
- * preempted mid BB rather insert coordinated preemption on all hardware
- * contexts between each set of BBs. Flags may be added in the future to change
- * both of these default behaviors.
- *
- * Returns -EINVAL if hardware context placement configuration is invalid or if
- * the placement configuration isn't supported on the platform / submission
- * interface.
- * Returns -ENODEV if extension isn't supported on the platform / submission
- * interface.
- *
- * .. code-block:: none
- *
- *	Example 1 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=1,
- *		     engines=CS[0],CS[1])
- *
- *	Results in the following valid placement:
- *	CS[0], CS[1]
- *
- *	Example 2 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[2],CS[1],CS[3])
- *
- *	Results in the following valid placements:
- *	CS[0], CS[1]
- *	CS[2], CS[3]
- *
- *	This can also be thought of as 2 virtual engines described by 2-D array
- *	in the engines the field with bonds placed between each index of the
- *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
- *	CS[3].
- *	VE[0] = CS[0], CS[2]
- *	VE[1] = CS[1], CS[3]
- *
- *	Example 3 pseudo code:
- *	CS[X] = generic engine of same class, logical instance X
- *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
- *	set_engines(INVALID)
- *	set_parallel(engine_index=0, width=2, num_siblings=2,
- *		     engines=CS[0],CS[1],CS[1],CS[3])
- *
- *	Results in the following valid and invalid placements:
- *	CS[0], CS[1]
- *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
- */
-struct drm_i915_context_engines_parallel_submit {
-	/**
-	 * @base: base user extension.
-	 */
-	struct i915_user_extension base;
-
-	/**
-	 * @engine_index: slot for parallel engine
-	 */
-	__u16 engine_index;
-
-	/**
-	 * @width: number of contexts per parallel engine
-	 */
-	__u16 width;
-
-	/**
-	 * @num_siblings: number of siblings per context
-	 */
-	__u16 num_siblings;
-
-	/**
-	 * @mbz16: reserved for future use; must be zero
-	 */
-	__u16 mbz16;
-
-	/**
-	 * @flags: all undefined flags must be zero, currently not defined flags
-	 */
-	__u64 flags;
-
-	/**
-	 * @mbz64: reserved for future use; must be zero
-	 */
-	__u64 mbz64[3];
-
-	/**
-	 * @engines: 2-d array of engine instances to configure parallel engine
-	 *
-	 * length = width (i) * num_siblings (j)
-	 * index = j + i * num_siblings
-	 */
-	struct i915_engine_class_instance engines[0];
-
-} __packed;
-
diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index cbda75065dad..d630f15ab795 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -135,8 +135,8 @@ Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
 drm_i915_context_engines_parallel_submit to the uAPI to implement this
 extension.
 
-.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
-        :functions: drm_i915_context_engines_parallel_submit
+.. kernel-doc:: include/uapi/drm/i915_drm.h
+        :functions: i915_context_engines_parallel_submit
 
 Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
 -------------------------------------------------------------------
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 28/46] drm/i915/guc: Add basic GuC multi-lrc selftest
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (26 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 27/46] drm/i915/doc: Update parallel submit doc to point to i915_drm.h Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 29/46] drm/i915/guc: Extend GuC flow control selftest for multi-lrc Matthew Brost
                   ` (22 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   1 +
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   | 168 ++++++++++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 3 files changed, 170 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 89528624710a..bc6cb9adca92 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4772,5 +4772,6 @@ bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve)
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_guc_multi_lrc.c"
 #include "selftest_guc_flow_control.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
new file mode 100644
index 000000000000..82eb666bba51
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright �� 2019 Intel Corporation
+ */
+
+#include "selftests/igt_spinner.h"
+#include "selftests/igt_reset.h"
+#include "selftests/intel_scheduler_helpers.h"
+#include "gt/intel_engine_heartbeat.h"
+#include "gem/selftests/mock_context.h"
+
+static void logical_sort(struct intel_engine_cs **engines, int num_engines)
+{
+	struct intel_engine_cs *sorted[MAX_ENGINE_INSTANCE + 1];
+	int i, j;
+
+	for (i = 0; i < num_engines; ++i)
+		for (j = 0; j < MAX_ENGINE_INSTANCE + 1; ++j) {
+			if (engines[j]->logical_mask & BIT(i)) {
+				sorted[i] = engines[j];
+				break;
+			}
+		}
+
+	memcpy(*engines, *sorted,
+	       sizeof(struct intel_engine_cs *) * num_engines);
+}
+
+static struct intel_context *
+multi_lrc_create_parent(struct intel_gt *gt, u8 class,
+			unsigned long flags)
+{
+	struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i = 0;
+
+	for_each_engine(engine, gt, id) {
+		if (engine->class != class)
+			continue;
+
+		siblings[i++] = engine;
+	}
+
+	if (i <= 1)
+		return ERR_PTR(0);
+
+	logical_sort(siblings, i);
+
+	return intel_engine_create_parallel(siblings, 1, i);
+}
+
+static void multi_lrc_context_put(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/*
+	 * Only the parent gets the creation ref put in the uAPI, the parent
+	 * itself is responsible for creation ref put on the children.
+	 */
+	intel_context_put(ce);
+}
+
+static struct i915_request *
+multi_lrc_nop_request(struct intel_context *ce)
+{
+	struct intel_context *child;
+	struct i915_request *rq, *child_rq;
+	int i = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	rq = intel_context_create_request(ce);
+	if (IS_ERR(rq))
+		return rq;
+
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	for_each_child(ce, child) {
+		child_rq = intel_context_create_request(child);
+		if (IS_ERR(child_rq))
+			goto child_error;
+
+		if (++i == ce->guc_number_children)
+			set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+				&child_rq->fence.flags);
+		i915_request_add(child_rq);
+	}
+
+	return rq;
+
+child_error:
+	i915_request_put(rq);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
+{
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, class, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating contexts: %ld", PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d",
+			 VIDEO_DECODE_CLASS);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on request: %d", ret);
+
+	i915_request_put(rq);
+
+	if (ret >= 0) {
+		ret = intel_gt_wait_for_idle(gt, HZ * 5);
+		if (ret < 0)
+			pr_err("GT failed to idle: %d\n", ret);
+	}
+
+out:
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
+static int intel_guc_multi_lrc_basic(void *arg)
+{
+	struct intel_gt *gt = arg;
+	unsigned int class;
+	int ret;
+
+	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
+		ret = __intel_guc_multi_lrc_basic(gt, class);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int intel_guc_multi_lrc(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(intel_guc_multi_lrc_basic),
+	};
+	struct intel_gt *gt = &i915->gt;
+
+	if (intel_gt_is_wedged(gt))
+		return 0;
+
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		return 0;
+
+	return intel_gt_live_subtests(tests, gt);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index d9bd732b741a..2ddb72bbab69 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -48,5 +48,6 @@ selftest(execlists, intel_execlists_live_selftests)
 selftest(ring_submission, intel_ring_submission_live_selftests)
 selftest(perf, i915_perf_live_selftests)
 selftest(guc_flow_control, intel_guc_flow_control)
+selftest(guc_multi_lrc, intel_guc_multi_lrc)
 /* Here be dragons: keep last to run last! */
 selftest(late_gt_pm, intel_gt_pm_late_selftests)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 29/46] drm/i915/guc: Extend GuC flow control selftest for multi-lrc
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (27 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 28/46] drm/i915/guc: Add basic GuC multi-lrc selftest Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 30/46] drm/i915/guc: Implement no mid batch preemption " Matthew Brost
                   ` (21 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Prove multi-lrc and single-lrc are independent.
Prove multi-lrc guc_ids flow control works.
Prove multi-lrc hanging the tastlet can recover from a GPU reset.

Cc: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../i915/gt/uc/selftest_guc_flow_control.c    | 299 ++++++++++++++++++
 .../drm/i915/gt/uc/selftest_guc_multi_lrc.c   |  15 +-
 2 files changed, 312 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
index f31ab2674b2b..9cfecf9d368e 100644
--- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
@@ -110,6 +110,65 @@ static int nop_request_wait(struct intel_engine_cs *engine, bool kernel,
 	return ret;
 }
 
+static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
+{
+	struct intel_guc *guc = &gt->uc.guc;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	struct guc_submit_engine *gse = guc->gse[GUC_SUBMIT_ENGINE_MULTI_LRC];
+	unsigned int reset_count = i915_reset_count(global);
+	u64 tasklets_submit_count = gse->tasklets_submit_count;
+	struct intel_context *parent;
+	struct i915_request *rq;
+	int ret;
+
+	parent = multi_lrc_create_parent(gt, VIDEO_DECODE_CLASS, 0);
+	if (IS_ERR(parent)) {
+		pr_err("Failed creating multi-lrc contexts: %ld",
+		       PTR_ERR(parent));
+		return PTR_ERR(parent);
+	} else if (!parent) {
+		pr_debug("Not enough engines in class: %d",
+			 VIDEO_DECODE_CLASS);
+		return 0;
+	}
+
+	rq = multi_lrc_nop_request(parent, NULL);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		pr_err("Failed creating multi-lrc requests: %d", ret);
+		goto out;
+	}
+
+	ret = intel_selftest_wait_for_rq(rq);
+	if (ret)
+		pr_err("Failed waiting on multi-lrc request: %d", ret);
+
+	i915_request_put(rq);
+	if (ret)
+		goto out;
+
+	if (!flow_control &&
+	    gse->tasklets_submit_count != tasklets_submit_count) {
+		pr_err("Flow control for multi-lrc unexpectedly kicked in\n");
+		ret = -EINVAL;
+	}
+
+	if (flow_control &&
+	    gse->tasklets_submit_count == tasklets_submit_count) {
+		pr_err("Flow control for multi-lrc did not kick in\n");
+		ret = -EINVAL;
+	}
+
+	if (i915_reset_count(global) != reset_count) {
+		pr_err("Unexpected GPU reset during multi-lrc submit\n");
+		ret = -EINVAL;
+	}
+
+out:
+	multi_lrc_context_put(parent);
+	return ret;
+}
+
 #define NUM_GUC_ID		256
 #define NUM_CONTEXT		1024
 #define NUM_RQ_PER_CONTEXT	2
@@ -240,6 +299,13 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
 		goto err_spin_rq;
 	}
 
+	/* Ensure Multi-LRC not blocked */
+	ret = multi_lrc_not_blocked(gt, !limit_guc_ids);
+	if (ret < 0) {
+		pr_err("Multi-lrc can't make progress: %d\n", ret);
+		goto err_spin_rq;
+	}
+
 	/* Inject hang in flow control state machine */
 	if (hang) {
 		guc->gse_hang_expected = true;
@@ -559,6 +625,237 @@ static int intel_guc_flow_control_bad_desc_h2g(void *arg)
 	return __intel_guc_flow_control_deadlock_h2g(arg, true);
 }
 
+#define NUM_CONTEXT_MULTI_LRC	256
+
+static int
+__intel_guc_flow_control_multi_lrc_guc(void *arg, bool limit_guc_ids, bool hang)
+{
+	struct intel_gt *gt = arg;
+	struct intel_guc *guc = &gt->uc.guc;
+	struct guc_submit_engine *gse = guc->gse[GUC_SUBMIT_ENGINE_MULTI_LRC];
+	struct intel_context **contexts;
+	int ret = 0;
+	int i, j, k;
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *spin_rq, *last = NULL;
+	intel_wakeref_t wakeref;
+	struct intel_engine_cs *engine;
+	struct i915_gpu_error *global = &gt->i915->gpu_error;
+	unsigned int reset_count;
+	u64 tasklets_submit_count = gse->tasklets_submit_count;
+	u32 old_beat;
+
+	if (limit_guc_ids)
+		guc->num_guc_ids = NUM_GUC_ID;
+
+	contexts = kmalloc(sizeof(*contexts) * NUM_CONTEXT, GFP_KERNEL);
+	if (!contexts) {
+		pr_err("Context array allocation failed\n");
+		return -ENOMEM;
+	}
+
+	wakeref = intel_runtime_pm_get(gt->uncore->rpm);
+
+	ce = intel_context_create(intel_selftest_find_any_engine(gt));
+	if (IS_ERR(ce)) {
+		ret = PTR_ERR(ce);
+		pr_err("Failed to create context: %d\n", ret);
+		goto err;
+	}
+
+	reset_count = i915_reset_count(global);
+	engine = ce->engine;
+
+	old_beat = engine->props.heartbeat_interval_ms;
+	if (hang) {
+		ret = intel_engine_set_heartbeat(engine, HEARTBEAT_INTERVAL);
+		if (ret) {
+			pr_err("Failed to boost heatbeat interval: %d\n", ret);
+			intel_context_put(ce);
+			goto err;
+		}
+	}
+
+	/* Create spinner to block requests in below loop */
+	ret = igt_spinner_init(&spin, engine->gt);
+	if (ret) {
+		pr_err("Failed to create spinner: %d\n", ret);
+		intel_context_put(ce);
+		goto err_heartbeat;
+	}
+	spin_rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	intel_context_put(ce);
+	if (IS_ERR(spin_rq)) {
+		ret = PTR_ERR(spin_rq);
+		pr_err("Failed to create spinner request: %d\n", ret);
+		goto err_spin_rq;
+	}
+	ret = __request_add_spin(spin_rq, &spin);
+	if (ret) {
+		pr_err("Failed to add Spinner request: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	for (i = 0; i < NUM_RQ_PER_CONTEXT; ++i) {
+		for (j = 0; j < NUM_CONTEXT_MULTI_LRC; ++j) {
+			for (k = 0; k < NUM_RQ_PER_CONTEXT; ++k) {
+				bool first_pass = !i && !k;
+
+				if (last)
+					i915_request_put(last);
+				last = NULL;
+				if (first_pass)
+					contexts[j] = multi_lrc_create_parent(gt, VIDEO_DECODE_CLASS, 0);
+				ce = contexts[j];
+
+				if (IS_ERR(ce)) {
+					ret = PTR_ERR(ce);
+					pr_err("Failed to create context: %d\n", ret);
+					goto err_spin_rq;
+				} else if (!ce) {
+					ret = 0;
+					goto err_spin_rq;
+				}
+
+				last = multi_lrc_nop_request(ce, spin_rq);
+				if (first_pass)
+					multi_lrc_context_put(ce);
+				if (IS_ERR(last)) {
+					ret = PTR_ERR(last);
+					pr_err("Failed to create request: %d\n", ret);
+					goto err_spin_rq;
+				}
+			}
+		}
+	}
+
+	/* Verify GuC submit engine state */
+	if (limit_guc_ids && !guc_ids_exhausted(gse)) {
+		pr_err("guc_ids not exhausted\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+	if (!limit_guc_ids && guc_ids_exhausted(gse)) {
+		pr_err("guc_ids exhausted\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+
+	/* Ensure no DoS from unready requests */
+	ret = multi_lrc_not_blocked(gt, true);
+	if (ret < 0) {
+		pr_err("Multi-lrc DoS: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Ensure Single-LRC not blocked, not in flow control */
+	ret = nop_request_wait(engine, false, !limit_guc_ids);
+	if (ret < 0) {
+		pr_err("User NOP request DoS: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Inject hang in flow control state machine */
+	if (hang) {
+		guc->gse_hang_expected = true;
+		guc->inject_bad_sched_disable = true;
+	}
+
+	/* Release blocked requests */
+	igt_spinner_end(&spin);
+	ret = intel_selftest_wait_for_rq(spin_rq);
+	if (ret) {
+		pr_err("Spin request failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	i915_request_put(spin_rq);
+	igt_spinner_fini(&spin);
+	spin_rq = NULL;
+
+	/* Wait for last request / GT to idle */
+	ret = i915_request_wait(last, 0, hang ? HZ * 30 : HZ * 5);
+	if (ret < 0) {
+		pr_err("Last request failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	i915_request_put(last);
+	last = NULL;
+	ret = intel_gt_wait_for_idle(gt, HZ * 5);
+	if (ret < 0) {
+		pr_err("GT failed to idle: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+	/* Check state after idle */
+	if (guc_ids_exhausted(gse)) {
+		pr_err("guc_ids exhausted after last request signaled\n");
+		ret = -EINVAL;
+		goto err_spin_rq;
+	}
+	if (hang) {
+		if (i915_reset_count(global) == reset_count) {
+			pr_err("Failed to record a GPU reset\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+	} else {
+		if (i915_reset_count(global) != reset_count) {
+			pr_err("Unexpected GPU reset\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+		if (gse->tasklets_submit_count == tasklets_submit_count) {
+			pr_err("Flow control failed to kick in\n");
+			ret = -EINVAL;
+			goto err_spin_rq;
+		}
+	}
+
+	/* Verify requests can be submitted after flow control */
+	ret = nop_request_wait(engine, true, false);
+	if (ret < 0) {
+		pr_err("Kernel NOP failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+	ret = nop_request_wait(engine, false, false);
+	if (ret < 0) {
+		pr_err("User NOP failed to complete: %d\n", ret);
+		goto err_spin_rq;
+	}
+
+err_spin_rq:
+	if (spin_rq) {
+		igt_spinner_end(&spin);
+		intel_selftest_wait_for_rq(spin_rq);
+		i915_request_put(spin_rq);
+		igt_spinner_fini(&spin);
+		intel_gt_wait_for_idle(gt, HZ * 5);
+	}
+err_heartbeat:
+	if (last)
+		i915_request_put(last);
+	intel_engine_set_heartbeat(engine, old_beat);
+err:
+	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
+	guc->num_guc_ids = guc->max_guc_ids;
+	guc->gse_hang_expected = false;
+	guc->inject_bad_sched_disable = false;
+	kfree(contexts);
+
+	return ret;
+}
+
+static int intel_guc_flow_control_multi_lrc_guc_ids(void *arg)
+{
+	return __intel_guc_flow_control_multi_lrc_guc(arg, true, false);
+}
+
+static int intel_guc_flow_control_multi_lrc_hang(void *arg)
+{
+	return __intel_guc_flow_control_multi_lrc_guc(arg, true, true);
+}
+
 int intel_guc_flow_control(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
@@ -566,6 +863,8 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
 		SUBTEST(intel_guc_flow_control_guc_ids),
 		SUBTEST(intel_guc_flow_control_lrcd_reg),
 		SUBTEST(intel_guc_flow_control_hang_state_machine),
+		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
+		SUBTEST(intel_guc_flow_control_multi_lrc_hang),
 		SUBTEST(intel_guc_flow_control_deadlock_h2g),
 		SUBTEST(intel_guc_flow_control_bad_desc_h2g),
 	};
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
index 82eb666bba51..21b4a79778ef 100644
--- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_multi_lrc.c
@@ -62,11 +62,12 @@ static void multi_lrc_context_put(struct intel_context *ce)
 }
 
 static struct i915_request *
-multi_lrc_nop_request(struct intel_context *ce)
+multi_lrc_nop_request(struct intel_context *ce, struct i915_request *from)
 {
 	struct intel_context *child;
 	struct i915_request *rq, *child_rq;
 	int i = 0;
+	int ret;
 
 	GEM_BUG_ON(!intel_context_is_parent(ce));
 
@@ -74,6 +75,16 @@ multi_lrc_nop_request(struct intel_context *ce)
 	if (IS_ERR(rq))
 		return rq;
 
+	if (from) {
+		ret = i915_sw_fence_await_dma_fence(&rq->submit,
+						    &from->fence, 0,
+						    I915_FENCE_GFP);
+		if (ret < 0) {
+			i915_request_put(rq);
+			return ERR_PTR(ret);
+		}
+	}
+
 	i915_request_get(rq);
 	i915_request_add(rq);
 
@@ -112,7 +123,7 @@ static int __intel_guc_multi_lrc_basic(struct intel_gt *gt, unsigned int class)
 		return 0;
 	}
 
-	rq = multi_lrc_nop_request(parent);
+	rq = multi_lrc_nop_request(parent, NULL);
 	if (IS_ERR(rq)) {
 		ret = PTR_ERR(rq);
 		pr_err("Failed creating requests: %d", ret);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 30/46] drm/i915/guc: Implement no mid batch preemption for multi-lrc
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (28 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 29/46] drm/i915/guc: Extend GuC flow control selftest for multi-lrc Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 31/46] drm/i915: Move secure execbuf check to execbuf2 Matthew Brost
                   ` (20 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between to parent and child is needed. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled via by
default if a context is configured by set parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
 4 files changed, 287 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index c327fd1c24c2..f396993374da 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -629,7 +629,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	GEM_BUG_ON(intel_context_is_child(child));
 	GEM_BUG_ON(intel_context_is_parent(child));
 
-	parent->guc_number_children++;
+	child->guc_child_index = parent->guc_number_children++;
 	list_add_tail(&child->guc_child_link,
 		      &parent->guc_child_list);
 	child->parent = parent;
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 9cdbea752014..fdc4890335b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -222,6 +222,9 @@ struct intel_context {
 	/* Number of children if parent */
 	u8 guc_number_children;
 
+	/* Child index if child */
+	u8 guc_child_index;
+
 	u8 parent_page; /* if set, page num reserved for parent context */
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 6910a0cdb8c8..a2fa0e9b9559 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -181,7 +181,7 @@ struct guc_process_desc {
 	u32 wq_status;
 	u32 engine_presence;
 	u32 priority;
-	u32 reserved[30];
+	u32 reserved[36];
 } __packed;
 
 #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bc6cb9adca92..d61c45d1ac2c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -11,6 +11,7 @@
 #include "gt/intel_context.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
+#include "gt/intel_gpu_commands.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
@@ -463,10 +464,14 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 
 /*
  * When using multi-lrc submission an extra page in the context state is
- * reserved for the process descriptor and work queue.
+ * reserved for the process descriptor, work queue, and preempt BB boundary
+ * handshake between the parent + childlren contexts.
  *
  * The layout of this page is below:
  * 0						guc_process_desc
+ * + sizeof(struct guc_process_desc)		child go
+ * + CACHELINE_BYTES				child join ...
+ * + CACHELINE_BYTES ...
  * ...						unused
  * PAGE_SIZE / 2				work queue start
  * ...						work queue
@@ -2185,6 +2190,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
 	return __guc_action_deregister_context(guc, guc_id, loop);
 }
 
+static inline void clear_children_join_go_memory(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+	u8 i;
+
+	for (i = 0; i < ce->guc_number_children + 1; ++i)
+		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
+}
+
+static inline u32 get_children_go_value(struct intel_context *ce)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[0];
+}
+
+static inline u32 get_children_join_value(struct intel_context *ce,
+					  u8 child_index)
+{
+	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
+
+	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
+}
+
 static void guc_context_policy_init(struct intel_engine_cs *engine,
 				    struct guc_lrc_desc *desc)
 {
@@ -2380,6 +2409,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
 			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
 			guc_context_policy_init(engine, desc);
 		}
+
+		clear_children_join_go_memory(ce);
 	}
 
 	/*
@@ -3808,6 +3839,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
 	.destroy = guc_child_context_destroy,
 };
 
+/*
+ * The below override of the breadcrumbs is enabled when the user configures a
+ * context for parallel submission (multi-lrc, parent-child).
+ *
+ * The overridden breadcrumbs implements an algorithm which allows the GuC to
+ * safely preempt all the hw contexts configured for parallel submission
+ * between each BB. The contract between the i915 and GuC is if the parent
+ * context can be preempted, all the children can be preempted, and the GuC will
+ * always try to preempt the parent before the children. A handshake between the
+ * parent / children breadcrumbs ensures the i915 holds up its end of the deal
+ * creating a window to preempt between each set of BBs.
+ */
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags);
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags);
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs);
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs);
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -3851,6 +3907,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
 	for_each_child(parent, ce)
 		ce->ops = &virtual_child_context_ops;
 
+	parent->engine->emit_bb_start =
+		emit_bb_start_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb =
+		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
+	parent->engine->emit_fini_breadcrumb_dw =
+		12 + 4 * parent->guc_number_children;
+	for_each_child(parent, ce) {
+		ce->engine->emit_bb_start =
+			emit_bb_start_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb =
+			emit_fini_breadcrumb_child_no_preempt_mid_batch;
+		ce->engine->emit_fini_breadcrumb_dw = 16;
+	}
+
 	kfree(siblings);
 	return parent;
 
@@ -4212,6 +4282,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
 	guc->submission_selected = __guc_submission_selected(guc);
 }
 
+static inline u32 get_children_go_addr(struct intel_context *ce)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return i915_ggtt_offset(ce->state) +
+		__get_process_desc_offset(ce) +
+		sizeof(struct guc_process_desc);
+}
+
+static inline u32 get_children_join_addr(struct intel_context *ce,
+					 u8 child_index)
+{
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
+}
+
+#define PARENT_GO_BB			1
+#define PARENT_GO_FINI_BREADCRUMB	0
+#define CHILD_GO_BB			1
+#define CHILD_GO_FINI_BREADCRUMB	0
+static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
+						     u64 offset, u32 len,
+						     const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Wait on chidlren */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_BB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_BB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
+						    u64 offset, u32 len,
+						    const unsigned int flags)
+{
+	struct intel_context *ce = rq->context;
+	u32 *cs;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	cs = intel_ring_begin(rq, 12);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_BB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_BB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Turn off preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+
+	/* Jump to batch */
+	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
+		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+	u8 i;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	/* Wait on children */
+	for (i = 0; i < ce->guc_number_children; ++i) {
+		*cs++ = (MI_SEMAPHORE_WAIT |
+			 MI_SEMAPHORE_GLOBAL_GTT |
+			 MI_SEMAPHORE_POLL |
+			 MI_SEMAPHORE_SAD_EQ_SDD);
+		*cs++ = PARENT_GO_FINI_BREADCRUMB;
+		*cs++ = get_children_join_addr(ce, i);
+		*cs++ = 0;
+	}
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Tell children go */
+	cs = gen8_emit_ggtt_write(cs,
+				  CHILD_GO_FINI_BREADCRUMB,
+				  get_children_go_addr(ce),
+				  0);
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	/* Turn on preemption */
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	*cs++ = MI_NOOP;
+
+	/* Signal parent */
+	cs = gen8_emit_ggtt_write(cs,
+				  PARENT_GO_FINI_BREADCRUMB,
+				  get_children_join_addr(ce->parent,
+							 ce->guc_child_index),
+				  0);
+
+	/* Wait parent on for go */
+	*cs++ = (MI_SEMAPHORE_WAIT |
+		 MI_SEMAPHORE_GLOBAL_GTT |
+		 MI_SEMAPHORE_POLL |
+		 MI_SEMAPHORE_SAD_EQ_SDD);
+	*cs++ = CHILD_GO_FINI_BREADCRUMB;
+	*cs++ = get_children_go_addr(ce->parent);
+	*cs++ = 0;
+
+	/* Emit fini breadcrumb */
+	cs = gen8_emit_ggtt_write(cs,
+				  rq->fence.seqno,
+				  i915_request_active_timeline(rq)->hwsp_offset,
+				  0);
+
+	/* User interrupt */
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
+
+	rq->tail = intel_ring_offset(rq, cs);
+
+	return cs;
+}
+
 static inline struct intel_context *
 g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
 {
@@ -4653,6 +4921,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 			drm_printf(p, "\t\tWQI Status: %u\n\n",
 				   READ_ONCE(desc->wq_status));
 
+			drm_printf(p, "\t\tNumber Children: %u\n\n",
+				   ce->guc_number_children);
+			if (ce->engine->emit_bb_start ==
+			    emit_bb_start_parent_no_preempt_mid_batch) {
+				u8 i;
+
+				drm_printf(p, "\t\tChildren Go: %u\n\n",
+					   get_children_go_value(ce));
+				for (i = 0; i < ce->guc_number_children; ++i)
+					drm_printf(p, "\t\tChildren Join: %u\n",
+						   get_children_join_value(ce, i));
+			}
+
 			for_each_child(ce, child)
 				guc_log_context(p, child);
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 31/46] drm/i915: Move secure execbuf check to execbuf2
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (29 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 30/46] drm/i915/guc: Implement no mid batch preemption " Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 32/46] drm/i915: Move input/exec fence handling to i915_gem_execbuffer2 Matthew Brost
                   ` (19 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Goal is to remove all input sanity checks from the core submission.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 35 +++++++++++--------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 1ed7475de454..70d352fc543f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3184,19 +3184,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.num_fences = 0;
 
 	eb.batch_flags = 0;
-	if (args->flags & I915_EXEC_SECURE) {
-		if (GRAPHICS_VER(i915) >= 11)
-			return -ENODEV;
-
-		/* Return -EPERM to trigger fallback code on old binaries. */
-		if (!HAS_SECURE_BATCHES(i915))
-			return -EPERM;
-
-		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
-			return -EPERM;
-
+	if (args->flags & I915_EXEC_SECURE)
 		eb.batch_flags |= I915_DISPATCH_SECURE;
-	}
 	if (args->flags & I915_EXEC_IS_PINNED)
 		eb.batch_flags |= I915_DISPATCH_PINNED;
 
@@ -3414,6 +3403,18 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		return -EINVAL;
 	}
 
+	if (args->flags & I915_EXEC_SECURE) {
+		if (GRAPHICS_VER(i915) >= 11)
+			return -ENODEV;
+
+		/* Return -EPERM to trigger fallback code on old binaries. */
+		if (!HAS_SECURE_BATCHES(i915))
+			return -EPERM;
+
+		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
+			return -EPERM;
+	}
+
 	err = i915_gem_check_execbuffer(args);
 	if (err)
 		return err;
@@ -3430,8 +3431,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 			   u64_to_user_ptr(args->buffers_ptr),
 			   sizeof(*exec2_list) * count)) {
 		drm_dbg(&i915->drm, "copy %zd exec entries failed\n", count);
-		kvfree(exec2_list);
-		return -EFAULT;
+		err = -EFAULT;
+		goto err_copy;
 	}
 
 	err = i915_gem_do_execbuffer(dev, file, args, exec2_list);
@@ -3476,6 +3477,12 @@ end:;
 
 	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
 	kvfree(exec2_list);
+
+	return err;
+
+err_copy:
+	kvfree(exec2_list);
+
 	return err;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 32/46] drm/i915: Move input/exec fence handling to i915_gem_execbuffer2
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (30 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 31/46] drm/i915: Move secure execbuf check to execbuf2 Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 33/46] drm/i915: Move output " Matthew Brost
                   ` (18 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Move the job of creating an input/exec fences (from a file descriptor)
out of i915_gem_do_execbuffer.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 75 +++++++++++--------
 1 file changed, 43 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 70d352fc543f..0416bcb551b0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3146,11 +3146,12 @@ static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
 		       struct drm_i915_gem_execbuffer2 *args,
-		       struct drm_i915_gem_exec_object2 *exec)
+		       struct drm_i915_gem_exec_object2 *exec,
+		       struct dma_fence *in_fence,
+		       struct dma_fence *exec_fence)
 {
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct i915_execbuffer eb;
-	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
 	struct i915_vma *batch;
 	int out_fence_fd = -1;
@@ -3197,25 +3198,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err)
 		goto err_ext;
 
-#define IN_FENCES (I915_EXEC_FENCE_IN | I915_EXEC_FENCE_SUBMIT)
-	if (args->flags & IN_FENCES) {
-		if ((args->flags & IN_FENCES) == IN_FENCES)
-			return -EINVAL;
-
-		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
-		if (!in_fence) {
-			err = -EINVAL;
-			goto err_ext;
-		}
-	}
-#undef IN_FENCES
-
 	if (args->flags & I915_EXEC_FENCE_OUT) {
 		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
-		if (out_fence_fd < 0) {
-			err = out_fence_fd;
-			goto err_in_fence;
-		}
+		if (out_fence_fd < 0)
+			goto err_ext;
 	}
 
 	err = eb_create(&eb);
@@ -3277,13 +3263,16 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err_ext;
 	}
 
+	if (exec_fence) {
+		err = i915_request_await_execution(eb.request,
+						   exec_fence);
+		if (err < 0)
+			goto err_request;
+	}
+
 	if (in_fence) {
-		if (args->flags & I915_EXEC_FENCE_SUBMIT)
-			err = i915_request_await_execution(eb.request,
-							   in_fence);
-		else
-			err = i915_request_await_dma_fence(eb.request,
-							   in_fence);
+		err = i915_request_await_dma_fence(eb.request,
+						   in_fence);
 		if (err < 0)
 			goto err_request;
 	}
@@ -3363,8 +3352,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 err_out_fence:
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
-err_in_fence:
-	dma_fence_put(in_fence);
 err_ext:
 	put_fence_array(eb.fences, eb.num_fences);
 	return err;
@@ -3395,6 +3382,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_gem_execbuffer2 *args = data;
 	struct drm_i915_gem_exec_object2 *exec2_list;
+	struct dma_fence *in_fence = NULL;
+	struct dma_fence *exec_fence = NULL;
 	const size_t count = args->buffer_count;
 	int err;
 
@@ -3419,13 +3408,33 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	if (err)
 		return err;
 
+	if (args->flags & I915_EXEC_FENCE_IN) {
+		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
+		if (!in_fence)
+			return -EINVAL;
+	}
+
+	if (args->flags & I915_EXEC_FENCE_SUBMIT) {
+		if (in_fence) {
+			err = -EINVAL;
+			goto err_exec_fence;
+		}
+
+		exec_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
+		if (!exec_fence) {
+			err = -EINVAL;
+			goto err_exec_fence;
+		}
+	}
+
 	/* Allocate extra slots for use by the command parser */
 	exec2_list = kvmalloc_array(count + 2, eb_element_size(),
 				    __GFP_NOWARN | GFP_KERNEL);
 	if (exec2_list == NULL) {
 		drm_dbg(&i915->drm, "Failed to allocate exec list for %zd buffers\n",
 			count);
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto err_alloc;
 	}
 	if (copy_from_user(exec2_list,
 			   u64_to_user_ptr(args->buffers_ptr),
@@ -3435,7 +3444,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		goto err_copy;
 	}
 
-	err = i915_gem_do_execbuffer(dev, file, args, exec2_list);
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, in_fence,
+				     exec_fence);
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
@@ -3476,12 +3486,13 @@ end:;
 	}
 
 	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
-	kvfree(exec2_list);
-
-	return err;
 
 err_copy:
 	kvfree(exec2_list);
+err_alloc:
+	dma_fence_put(exec_fence);
+err_exec_fence:
+	dma_fence_put(in_fence);
 
 	return err;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 33/46] drm/i915: Move output fence handling to i915_gem_execbuffer2
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (31 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 32/46] drm/i915: Move input/exec fence handling to i915_gem_execbuffer2 Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 34/46] drm/i915: Return output fence from i915_gem_do_execbuffer Matthew Brost
                   ` (17 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Move the job of creating a new file descriptor and passing it back to
userspace to i915_gem_execbuffer2.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 45 ++++++++++---------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 0416bcb551b0..66f1819fcebc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3148,13 +3148,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec,
 		       struct dma_fence *in_fence,
-		       struct dma_fence *exec_fence)
+		       struct dma_fence *exec_fence,
+		       int out_fence_fd)
 {
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct i915_execbuffer eb;
 	struct sync_file *out_fence = NULL;
 	struct i915_vma *batch;
-	int out_fence_fd = -1;
 	int err;
 
 	BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
@@ -3198,15 +3198,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err)
 		goto err_ext;
 
-	if (args->flags & I915_EXEC_FENCE_OUT) {
-		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
-		if (out_fence_fd < 0)
-			goto err_ext;
-	}
-
 	err = eb_create(&eb);
 	if (err)
-		goto err_out_fence;
+		goto err_ext;
 
 	GEM_BUG_ON(!eb.lut_size);
 
@@ -3283,7 +3277,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err_request;
 	}
 
-	if (out_fence_fd != -1) {
+	if (out_fence_fd >= 0) {
 		out_fence = sync_file_create(&eb.request->fence);
 		if (!out_fence) {
 			err = -ENOMEM;
@@ -3313,14 +3307,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		signal_fence_array(&eb);
 
 	if (out_fence) {
-		if (err == 0) {
+		if (err == 0)
 			fd_install(out_fence_fd, out_fence->file);
-			args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
-			args->rsvd2 |= (u64)out_fence_fd << 32;
-			out_fence_fd = -1;
-		} else {
+		else
 			fput(out_fence->file);
-		}
 	}
 
 	if (unlikely(eb.gem_context->syncobj)) {
@@ -3349,9 +3339,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	i915_gem_context_put(eb.gem_context);
 err_destroy:
 	eb_destroy(&eb);
-err_out_fence:
-	if (out_fence_fd != -1)
-		put_unused_fd(out_fence_fd);
 err_ext:
 	put_fence_array(eb.fences, eb.num_fences);
 	return err;
@@ -3384,6 +3371,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_gem_exec_object2 *exec2_list;
 	struct dma_fence *in_fence = NULL;
 	struct dma_fence *exec_fence = NULL;
+	int out_fence_fd = -1;
 	const size_t count = args->buffer_count;
 	int err;
 
@@ -3427,6 +3415,14 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		}
 	}
 
+	if (args->flags & I915_EXEC_FENCE_OUT) {
+		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
+		if (out_fence_fd < 0) {
+			err = out_fence_fd;
+			goto err_out_fence;
+		}
+	}
+
 	/* Allocate extra slots for use by the command parser */
 	exec2_list = kvmalloc_array(count + 2, eb_element_size(),
 				    __GFP_NOWARN | GFP_KERNEL);
@@ -3445,7 +3441,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	}
 
 	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, in_fence,
-				     exec_fence);
+				     exec_fence, out_fence_fd);
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
@@ -3485,11 +3481,20 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 end:;
 	}
 
+	if (!err && out_fence_fd >= 0) {
+		args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
+		args->rsvd2 |= (u64)out_fence_fd << 32;
+		out_fence_fd = -1;
+	}
+
 	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
 
 err_copy:
 	kvfree(exec2_list);
 err_alloc:
+	if (out_fence_fd >= 0)
+		put_unused_fd(out_fence_fd);
+err_out_fence:
 	dma_fence_put(exec_fence);
 err_exec_fence:
 	dma_fence_put(in_fence);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 34/46] drm/i915: Return output fence from i915_gem_do_execbuffer
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (32 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 33/46] drm/i915: Move output " Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 35/46] drm/i915: Store batch index in struct i915_execbuffer Matthew Brost
                   ` (16 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Move the job of creating a new sync fence and installing it onto a file
descriptor to i915_gem_execbuffer2.

Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 39 +++++++++----------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 66f1819fcebc..40311583f03d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3149,11 +3149,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_i915_gem_exec_object2 *exec,
 		       struct dma_fence *in_fence,
 		       struct dma_fence *exec_fence,
-		       int out_fence_fd)
+		       struct dma_fence **out_fence)
 {
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct i915_execbuffer eb;
-	struct sync_file *out_fence = NULL;
 	struct i915_vma *batch;
 	int err;
 
@@ -3277,14 +3276,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err_request;
 	}
 
-	if (out_fence_fd >= 0) {
-		out_fence = sync_file_create(&eb.request->fence);
-		if (!out_fence) {
-			err = -ENOMEM;
-			goto err_request;
-		}
-	}
-
 	/*
 	 * Whilst this request exists, batch_obj will be on the
 	 * active_list, and so will hold the active reference. Only when this
@@ -3306,12 +3297,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (eb.fences)
 		signal_fence_array(&eb);
 
-	if (out_fence) {
-		if (err == 0)
-			fd_install(out_fence_fd, out_fence->file);
-		else
-			fput(out_fence->file);
-	}
+	if (!err && out_fence)
+		*out_fence = dma_fence_get(&eb.request->fence);
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
@@ -3369,6 +3356,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_gem_execbuffer2 *args = data;
 	struct drm_i915_gem_exec_object2 *exec2_list;
+	struct dma_fence **out_fence_p = NULL;
+	struct dma_fence *out_fence = NULL;
 	struct dma_fence *in_fence = NULL;
 	struct dma_fence *exec_fence = NULL;
 	int out_fence_fd = -1;
@@ -3421,6 +3410,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 			err = out_fence_fd;
 			goto err_out_fence;
 		}
+		out_fence_p = &out_fence;
 	}
 
 	/* Allocate extra slots for use by the command parser */
@@ -3441,7 +3431,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	}
 
 	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, in_fence,
-				     exec_fence, out_fence_fd);
+				     exec_fence, out_fence_p);
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
@@ -3482,9 +3472,18 @@ end:;
 	}
 
 	if (!err && out_fence_fd >= 0) {
-		args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
-		args->rsvd2 |= (u64)out_fence_fd << 32;
-		out_fence_fd = -1;
+		struct sync_file *sync_fence;
+
+		sync_fence = sync_file_create(out_fence);
+		if (sync_fence) {
+			fd_install(out_fence_fd, sync_fence->file);
+			args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
+			args->rsvd2 |= (u64)out_fence_fd << 32;
+			out_fence_fd = -1;
+		}
+		dma_fence_put(out_fence);
+	} else if (out_fence) {
+		dma_fence_put(out_fence);
 	}
 
 	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 35/46] drm/i915: Store batch index in struct i915_execbuffer
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (33 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 34/46] drm/i915: Return output fence from i915_gem_do_execbuffer Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 36/46] drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index Matthew Brost
                   ` (15 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

This will help with upcoming extensions where more than 1 batch can be
submitted in a single execbuf IOCTL.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 40311583f03d..1f1f477e46b4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -252,6 +252,9 @@ struct i915_execbuffer {
 	struct eb_vma *batch; /** identity of the batch obj/vma */
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/* batch_index in vma list */
+	unsigned int batch_index;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
@@ -361,6 +364,11 @@ static int eb_create(struct i915_execbuffer *eb)
 		eb->lut_size = -eb->buffer_count;
 	}
 
+	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
+		eb->batch_index = 0;
+	else
+		eb->batch_index = eb->args->buffer_count - 1;
+
 	return 0;
 }
 
@@ -735,14 +743,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	} while (1);
 }
 
-static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
-{
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		return 0;
-	else
-		return eb->buffer_count - 1;
-}
-
 static int eb_select_context(struct i915_execbuffer *eb)
 {
 	struct i915_gem_context *ctx;
@@ -852,7 +852,6 @@ static struct i915_vma *eb_lookup_vma(struct i915_execbuffer *eb, u32 handle)
 static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
 	struct drm_i915_private *i915 = eb->i915;
-	unsigned int batch = eb_batch_index(eb);
 	unsigned int i;
 	int err = 0;
 
@@ -873,7 +872,7 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, batch, vma);
+		eb_add_vma(eb, i, eb->batch_index, vma);
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 36/46] drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (34 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 35/46] drm/i915: Store batch index in struct i915_execbuffer Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 37/46] drm/i915: Teach execbuf there can be more than one batch in the objects list Matthew Brost
                   ` (14 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Allow specifying the batch directly over what is inferred from passed in
execbuf flags.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 1f1f477e46b4..707e12725f74 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3146,6 +3146,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec,
+		       int batch_index,
 		       struct dma_fence *in_fence,
 		       struct dma_fence *exec_fence,
 		       struct dma_fence **out_fence)
@@ -3202,6 +3203,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	GEM_BUG_ON(!eb.lut_size);
 
+	if (batch_index >= 0)
+		eb.batch_index = batch_index;
+
 	err = eb_select_context(&eb);
 	if (unlikely(err))
 		goto err_destroy;
@@ -3429,7 +3433,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		goto err_copy;
 	}
 
-	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, in_fence,
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, -1, in_fence,
 				     exec_fence, out_fence_p);
 
 	/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 37/46] drm/i915: Teach execbuf there can be more than one batch in the objects list
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (35 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 36/46] drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 38/46] drm/i915: Only track object dependencies on first request Matthew Brost
                   ` (13 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

In case of multiple batches all batches will be at the beginning in the
exex objects array or at the end based on the existing execbuffer2 flag.

Batches not executed in the current execbuf call will not be processed
for relocations or but will be pinned in same manner as the current
batch.

This will enable multiple do_execbuf calls with a single exec object
array in a later patch.

Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 31 +++++++++++++------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 707e12725f74..2835ef8734e5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -255,6 +255,9 @@ struct i915_execbuffer {
 	/* batch_index in vma list */
 	unsigned int batch_index;
 
+	/* number of batches in execbuf IOCTL */
+	unsigned int num_batches;
+
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
@@ -511,6 +514,14 @@ static bool platform_has_relocs_enabled(const struct i915_execbuffer *eb)
 	return false;
 }
 
+static inline bool
+is_batch_buffer(struct i915_execbuffer *eb, unsigned int buffer_idx)
+{
+	return eb->args->flags & I915_EXEC_BATCH_FIRST ?
+		buffer_idx <= eb->num_batches :
+		buffer_idx >= eb->args->buffer_count - eb->num_batches;
+}
+
 static int
 eb_validate_vma(struct i915_execbuffer *eb,
 		struct drm_i915_gem_exec_object2 *entry,
@@ -562,11 +573,10 @@ eb_validate_vma(struct i915_execbuffer *eb,
 
 static void
 eb_add_vma(struct i915_execbuffer *eb,
-	   unsigned int i, unsigned batch_idx,
-	   struct i915_vma *vma)
+	   unsigned int buffer_idx, struct i915_vma *vma)
 {
-	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
-	struct eb_vma *ev = &eb->vma[i];
+	struct drm_i915_gem_exec_object2 *entry = &eb->exec[buffer_idx];
+	struct eb_vma *ev = &eb->vma[buffer_idx];
 
 	ev->vma = vma;
 	ev->exec = entry;
@@ -591,14 +601,15 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if (i == batch_idx) {
+	if (is_batch_buffer(eb, buffer_idx)) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
 			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-		eb->batch = ev;
+		if (buffer_idx == eb->batch_index)
+			eb->batch = ev;
 	}
 }
 
@@ -872,7 +883,7 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 			goto err;
 		}
 
-		eb_add_vma(eb, i, eb->batch_index, vma);
+		eb_add_vma(eb, i, vma);
 
 		if (i915_gem_object_is_userptr(vma->obj)) {
 			err = i915_gem_object_userptr_submit_init(vma->obj);
@@ -3147,6 +3158,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec,
 		       int batch_index,
+		       unsigned int num_batches,
 		       struct dma_fence *in_fence,
 		       struct dma_fence *exec_fence,
 		       struct dma_fence **out_fence)
@@ -3203,6 +3215,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	GEM_BUG_ON(!eb.lut_size);
 
+	eb.num_batches = num_batches;
 	if (batch_index >= 0)
 		eb.batch_index = batch_index;
 
@@ -3433,8 +3446,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		goto err_copy;
 	}
 
-	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, -1, in_fence,
-				     exec_fence, out_fence_p);
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, -1, 1,
+				     in_fence, exec_fence, out_fence_p);
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 38/46] drm/i915: Only track object dependencies on first request
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (36 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 37/46] drm/i915: Teach execbuf there can be more than one batch in the objects list Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc Matthew Brost
                   ` (12 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Only track object dependencies on the first request generated from the
execbuf, this help with the upcoming multi-bb execbuf extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 2835ef8734e5..b224b28530d1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2239,7 +2239,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_move_to_gpu(struct i915_execbuffer *eb)
+static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
@@ -2281,7 +2281,7 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
-		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
+		if (err == 0 && first && !(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
 				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
 		}
@@ -2525,14 +2525,15 @@ static int eb_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch,
+		     bool first)
 {
 	int err;
 
 	if (intel_context_nopreempt(eb->context))
 		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
 
-	err = eb_move_to_gpu(eb);
+	err = eb_move_to_gpu(eb, first);
 	if (err)
 		return err;
 
@@ -3304,7 +3305,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
 
 	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb, batch, true);
 
 err_request:
 	i915_request_get(eb.request);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (37 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 38/46] drm/i915: Only track object dependencies on first request Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 16:39   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2 Matthew Brost
                   ` (11 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Submitting to a subset of hardware contexts is not allowed, so use the
copy engine for GPU relocations when using a parallel context.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b224b28530d1..b6143973ac67 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1386,7 +1386,8 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	if (err)
 		goto err_unmap;
 
-	if (engine == eb->context->engine) {
+	if (engine == eb->context->engine &&
+	    !intel_context_is_parallel(eb->context)) {
 		rq = i915_request_create(eb->context);
 	} else {
 		struct intel_context *ce = eb->reloc_context;
@@ -1483,7 +1484,8 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
 		if (eb_use_cmdparser(eb))
 			return ERR_PTR(-EWOULDBLOCK);
 
-		if (!reloc_can_use_engine(engine)) {
+		if (!reloc_can_use_engine(engine) ||
+		    intel_context_is_parallel(eb->context)) {
 			engine = engine->gt->engine_class[COPY_ENGINE_CLASS][0];
 			if (!engine)
 				return ERR_PTR(-ENODEV);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (38 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 17:02   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission Matthew Brost
                   ` (10 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

For contexts with width set to two or more, we add a mode to execbuf2
which implies there are N batch buffers in the buffer list, each of
which will be sent to one of the engines from the engine map array
(I915_CONTEXT_PARAM_ENGINES, I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT).

Those N batches can either be first N, or last N objects in the list as
controlled by the existing execbuffer2 flag.

The N batches will be submitted to consecutive engines from the previously
configured allowed engine array starting at index 0.

Input and output fences are fully supported, with the latter getting
signalled when all batch buffers have completed.

Last, it isn't safe for subsequent batches to touch any objects written
to by a multi-BB submission until all the batches in that submission
complete. As such all batches in a multi-BB submission must be combined
into a single composite fence and put into the dma reseveration excl
fence slot.

Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 262 +++++++++++++++---
 drivers/gpu/drm/i915/gt/intel_context.c       |   5 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +
 drivers/gpu/drm/i915/i915_vma.c               |  13 +-
 drivers/gpu/drm/i915/i915_vma.h               |  16 +-
 5 files changed, 266 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b6143973ac67..ecdb583cc2eb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -252,6 +252,9 @@ struct i915_execbuffer {
 	struct eb_vma *batch; /** identity of the batch obj/vma */
 	struct i915_vma *trampoline; /** trampoline used for chaining */
 
+	/** used for excl fence in dma_resv objects when > 1 BB submitted */
+	struct dma_fence *composite_fence;
+
 	/* batch_index in vma list */
 	unsigned int batch_index;
 
@@ -367,11 +370,6 @@ static int eb_create(struct i915_execbuffer *eb)
 		eb->lut_size = -eb->buffer_count;
 	}
 
-	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
-		eb->batch_index = 0;
-	else
-		eb->batch_index = eb->args->buffer_count - 1;
-
 	return 0;
 }
 
@@ -2241,7 +2239,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	return err;
 }
 
-static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first)
+static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first, bool last)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i = count;
@@ -2289,8 +2287,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first)
 		}
 
 		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request,
-						      flags | __EXEC_OBJECT_NO_RESERVE);
+			err = _i915_vma_move_to_active(vma, eb->request,
+						       flags | __EXEC_OBJECT_NO_RESERVE,
+						       !last ?
+						       NULL :
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->request->fence,
+						       eb->composite_fence ?
+						       eb->composite_fence :
+						       &eb->request->fence);
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -2528,14 +2534,14 @@ static int eb_parse(struct i915_execbuffer *eb)
 }
 
 static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch,
-		     bool first)
+		     bool first, bool last)
 {
 	int err;
 
 	if (intel_context_nopreempt(eb->context))
 		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
 
-	err = eb_move_to_gpu(eb, first);
+	err = eb_move_to_gpu(eb, first, last);
 	if (err)
 		return err;
 
@@ -2748,7 +2754,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
 }
 
 static int
-eb_select_engine(struct i915_execbuffer *eb)
+eb_select_engine(struct i915_execbuffer *eb, unsigned int batch_number)
 {
 	struct intel_context *ce;
 	unsigned int idx;
@@ -2763,6 +2769,18 @@ eb_select_engine(struct i915_execbuffer *eb)
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
+	if (batch_number > 0) {
+		struct intel_context *parent = ce;
+
+		GEM_BUG_ON(!intel_context_is_parent(parent));
+
+		for_each_child(parent, ce)
+			if (!--batch_number)
+				break;
+		intel_context_put(parent);
+		intel_context_get(ce);
+	}
+
 	intel_gt_pm_get(ce->engine->gt);
 
 	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
@@ -3155,13 +3173,49 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
 				    eb);
 }
 
+static int setup_composite_fence(struct i915_execbuffer *eb,
+				 struct dma_fence **out_fence,
+				 unsigned int num_batches)
+{
+	struct dma_fence_array *fence_array;
+	struct dma_fence **fences = kmalloc(num_batches * sizeof(*fences),
+					    GFP_KERNEL);
+	struct intel_context *parent = intel_context_to_parent(eb->context);
+	int i;
+
+	if (!fences)
+		return -ENOMEM;
+
+	for (i = 0; i < num_batches; ++i)
+		fences[i] = out_fence[i];
+
+	fence_array = dma_fence_array_create(num_batches,
+					     fences,
+					     parent->fence_context,
+					     ++parent->seqno,
+					     false);
+	if (!fence_array) {
+		kfree(fences);
+		return -ENOMEM;
+	}
+
+	/* Move ownership to the dma_fence_array created above */
+	for (i = 0; i < num_batches; ++i)
+		dma_fence_get(fences[i]);
+
+	eb->composite_fence = &fence_array->base;
+
+	return 0;
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec,
-		       int batch_index,
+		       unsigned int batch_index,
 		       unsigned int num_batches,
+		       unsigned int batch_number,
 		       struct dma_fence *in_fence,
 		       struct dma_fence *exec_fence,
 		       struct dma_fence **out_fence)
@@ -3170,6 +3224,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct i915_vma *batch;
 	int err;
+	bool first = batch_number == 0;
+	bool last = batch_number + 1 == num_batches;
 
 	BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
 	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS &
@@ -3194,6 +3250,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.batch_start_offset = args->batch_start_offset;
 	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
+	eb.composite_fence = NULL;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
@@ -3219,14 +3276,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	GEM_BUG_ON(!eb.lut_size);
 
 	eb.num_batches = num_batches;
-	if (batch_index >= 0)
-		eb.batch_index = batch_index;
+	eb.batch_index = batch_index;
 
 	err = eb_select_context(&eb);
 	if (unlikely(err))
 		goto err_destroy;
 
-	err = eb_select_engine(&eb);
+	err = eb_select_engine(&eb, batch_number);
 	if (unlikely(err))
 		goto err_context;
 
@@ -3275,6 +3331,23 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err_ext;
 	}
 
+	if (out_fence) {
+		/* Move ownership to caller (i915_gem_execbuffer2_ioctl) */
+		out_fence[batch_number] = dma_fence_get(&eb.request->fence);
+
+		/*
+		 * Need to create a composite fence (dma_fence_array,
+		 * eb.composite_fence) for the excl fence of the dma_resv
+		 * objects as each BB can write to the object. Since we create
+		 */
+		if (num_batches > 1 && last) {
+			err = setup_composite_fence(&eb, out_fence,
+						    num_batches);
+			if (err < 0)
+				goto err_request;
+		}
+	}
+
 	if (exec_fence) {
 		err = i915_request_await_execution(eb.request,
 						   exec_fence);
@@ -3307,17 +3380,27 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
 
 	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch, true);
+	err = eb_submit(&eb, batch, first, last);
 
 err_request:
+	if (last)
+		set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
+			&eb.request->fence.flags);
+
 	i915_request_get(eb.request);
 	err = eb_request_add(&eb, err);
 
 	if (eb.fences)
 		signal_fence_array(&eb);
 
-	if (!err && out_fence)
-		*out_fence = dma_fence_get(&eb.request->fence);
+	/*
+	 * Ownership of the composite fence (dma_fence_array,
+	 * eb.composite_fence) has been moved to the dma_resv objects these BB
+	 * write to in i915_vma_move_to_active. It is ok to release the creation
+	 * reference of this fence now.
+	 */
+	if (eb.composite_fence)
+		dma_fence_put(eb.composite_fence);
 
 	if (unlikely(eb.gem_context->syncobj)) {
 		drm_syncobj_replace_fence(eb.gem_context->syncobj,
@@ -3368,6 +3451,17 @@ static bool check_buffer_count(size_t count)
 	return !(count < 1 || count > INT_MAX || count > SIZE_MAX / sz - 1);
 }
 
+/* Release fences from the dma_fence_get in i915_gem_do_execbuffer. */
+static inline void put_out_fences(struct dma_fence **out_fences,
+				  unsigned int num_batches)
+{
+	int i;
+
+	for (i = 0; i < num_batches; ++i)
+		if (out_fences[i])
+			dma_fence_put(out_fences[i]);
+}
+
 int
 i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 			   struct drm_file *file)
@@ -3375,13 +3469,16 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_gem_execbuffer2 *args = data;
 	struct drm_i915_gem_exec_object2 *exec2_list;
-	struct dma_fence **out_fence_p = NULL;
-	struct dma_fence *out_fence = NULL;
+	struct dma_fence **out_fences = NULL;
 	struct dma_fence *in_fence = NULL;
 	struct dma_fence *exec_fence = NULL;
 	int out_fence_fd = -1;
 	const size_t count = args->buffer_count;
 	int err;
+	struct i915_gem_context *ctx;
+	struct intel_context *parent = NULL;
+	unsigned int num_batches = 1, i;
+	bool is_parallel = false;
 
 	if (!check_buffer_count(count)) {
 		drm_dbg(&i915->drm, "execbuf2 with %zd buffers\n", count);
@@ -3404,10 +3501,39 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	if (err)
 		return err;
 
+	ctx = i915_gem_context_lookup(file->driver_priv, args->rsvd1);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	if (i915_gem_context_user_engines(ctx)) {
+		parent = i915_gem_context_get_engine(ctx, args->flags &
+						     I915_EXEC_RING_MASK);
+		if (IS_ERR(parent)) {
+			err = PTR_ERR(parent);
+			goto err_context;
+		}
+
+		if (intel_context_is_parent(parent)) {
+			if (args->batch_len) {
+				err = -EINVAL;
+				goto err_context;
+			}
+
+			num_batches = parent->guc_number_children + 1;
+			if (num_batches > count) {
+				i915_gem_context_put(ctx);
+				goto err_parent;
+			}
+			is_parallel = true;
+		}
+	}
+
 	if (args->flags & I915_EXEC_FENCE_IN) {
 		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
-		if (!in_fence)
-			return -EINVAL;
+		if (!in_fence) {
+			err = -EINVAL;
+			goto err_parent;
+		}
 	}
 
 	if (args->flags & I915_EXEC_FENCE_SUBMIT) {
@@ -3423,13 +3549,25 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		}
 	}
 
-	if (args->flags & I915_EXEC_FENCE_OUT) {
-		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
-		if (out_fence_fd < 0) {
-			err = out_fence_fd;
+	/*
+	 * We always allocate out fences when doing multi-BB submission as
+	 * this is required to create an excl fence for any dma buf objects
+	 * these BBs touch.
+	 */
+	if (args->flags & I915_EXEC_FENCE_OUT || is_parallel) {
+		out_fences = kcalloc(num_batches, sizeof(*out_fences),
+				     GFP_KERNEL);
+		if (!out_fences) {
+			err = -ENOMEM;
 			goto err_out_fence;
 		}
-		out_fence_p = &out_fence;
+		if (args->flags & I915_EXEC_FENCE_OUT) {
+			out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
+			if (out_fence_fd < 0) {
+				err = out_fence_fd;
+				goto err_out_fence;
+			}
+		}
 	}
 
 	/* Allocate extra slots for use by the command parser */
@@ -3449,8 +3587,35 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 		goto err_copy;
 	}
 
-	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, -1, 1,
-				     in_fence, exec_fence, out_fence_p);
+	/*
+	 * Downstream submission code expects all parallel submissions to occur
+	 * in intel_context sequence, thus only 1 submission can happen at a
+	 * time.
+	 */
+	if (is_parallel)
+		mutex_lock(&parent->parallel_submit);
+
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
+				     args->flags & I915_EXEC_BATCH_FIRST ?
+				     0 : count - num_batches,
+				     num_batches,
+				     0,
+				     in_fence,
+				     exec_fence,
+				     out_fences);
+
+	for (i = 1; err == 0 && i < num_batches; i++)
+		err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
+					     args->flags & I915_EXEC_BATCH_FIRST ?
+					     i : count - num_batches + i,
+					     num_batches,
+					     i,
+					     NULL,
+					     NULL,
+					     out_fences);
+
+	if (is_parallel)
+		mutex_unlock(&parent->parallel_submit);
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
@@ -3491,8 +3656,31 @@ end:;
 	}
 
 	if (!err && out_fence_fd >= 0) {
+		struct dma_fence *out_fence = NULL;
 		struct sync_file *sync_fence;
 
+		if (is_parallel) {
+			struct dma_fence_array *fence_array;
+
+			/*
+			 * The dma_fence_array now owns out_fences (from
+			 * dma_fence_get in i915_gem_do_execbuffer) assuming
+			 * successful creation of dma_fence_array.
+			 */
+			fence_array = dma_fence_array_create(num_batches,
+							     out_fences,
+							     parent->fence_context,
+							     ++parent->seqno,
+							     false);
+			if (!fence_array)
+				goto put_out_fences;
+
+			out_fence = &fence_array->base;
+			out_fences = NULL;
+		} else {
+			out_fence = out_fences[0];
+		}
+
 		sync_fence = sync_file_create(out_fence);
 		if (sync_fence) {
 			fd_install(out_fence_fd, sync_fence->file);
@@ -3500,9 +3688,15 @@ end:;
 			args->rsvd2 |= (u64)out_fence_fd << 32;
 			out_fence_fd = -1;
 		}
+
+		/*
+		 * The sync_file now owns out_fence, drop the creation
+		 * reference.
+		 */
 		dma_fence_put(out_fence);
-	} else if (out_fence) {
-		dma_fence_put(out_fence);
+	} else if (out_fences) {
+put_out_fences:
+		put_out_fences(out_fences, num_batches);
 	}
 
 	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
@@ -3513,9 +3707,15 @@ end:;
 	if (out_fence_fd >= 0)
 		put_unused_fd(out_fence_fd);
 err_out_fence:
+	kfree(out_fences);
 	dma_fence_put(exec_fence);
 err_exec_fence:
 	dma_fence_put(in_fence);
+err_parent:
+	if (parent)
+		intel_context_put(parent);
+err_context:
+	i915_gem_context_put(ctx);
 
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index f396993374da..2c07f5f22c94 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -472,6 +472,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id_link);
 
+	mutex_init(&ce->parallel_submit);
+	ce->fence_context = dma_fence_context_alloc(1);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
@@ -498,6 +501,8 @@ void intel_context_fini(struct intel_context *ce)
 		for_each_child_safe(ce, child, next)
 			intel_context_put(child);
 
+	mutex_destroy(&ce->parallel_submit);
+
 	mutex_destroy(&ce->pin_mutex);
 	i915_active_fini(&ce->active);
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index fdc4890335b7..8af9ace4c052 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -235,6 +235,15 @@ struct intel_context {
 
 	/* Last request submitted on a parent */
 	struct i915_request *last_rq;
+
+	/* Parallel submission mutex */
+	struct mutex parallel_submit;
+
+	/* Fence context for parallel submission */
+	u64 fence_context;
+
+	/* Seqno for parallel submission */
+	u32 seqno;
 };
 
 #endif /* __INTEL_CONTEXT_TYPES__ */
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 4b7fc4647e46..ed4e790276a9 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1234,9 +1234,11 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	return i915_active_add_request(&vma->active, rq);
 }
 
-int i915_vma_move_to_active(struct i915_vma *vma,
-			    struct i915_request *rq,
-			    unsigned int flags)
+int _i915_vma_move_to_active(struct i915_vma *vma,
+			     struct i915_request *rq,
+			     unsigned int flags,
+			     struct dma_fence *shared_fence,
+			     struct dma_fence *excl_fence)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
@@ -1257,7 +1259,7 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 			intel_frontbuffer_put(front);
 		}
 
-		dma_resv_add_excl_fence(vma->resv, &rq->fence);
+		dma_resv_add_excl_fence(vma->resv, excl_fence);
 		obj->write_domain = I915_GEM_DOMAIN_RENDER;
 		obj->read_domains = 0;
 	} else {
@@ -1267,7 +1269,8 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 				return err;
 		}
 
-		dma_resv_add_shared_fence(vma->resv, &rq->fence);
+		if (shared_fence)
+			dma_resv_add_shared_fence(vma->resv, shared_fence);
 		obj->write_domain = 0;
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index ed69f66c7ab0..a36da651dbff 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -57,9 +57,19 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
 
 int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
 					   struct i915_request *rq);
-int __must_check i915_vma_move_to_active(struct i915_vma *vma,
-					 struct i915_request *rq,
-					 unsigned int flags);
+
+int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
+					  struct i915_request *rq,
+					  unsigned int flags,
+					  struct dma_fence *shared_fence,
+					  struct dma_fence *excl_fence);
+static inline int __must_check
+i915_vma_move_to_active(struct i915_vma *vma,
+			struct i915_request *rq,
+			unsigned int flags)
+{
+	return _i915_vma_move_to_active(vma, rq, flags, &rq->fence, &rq->fence);
+}
 
 #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (39 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2 Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 17:07   ` Daniel Vetter
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 42/46] drm/i915: Hold all parallel requests until last request, properly handle error Matthew Brost
                   ` (9 subsequent siblings)
  50 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Certain VMA functions in the execbuf IOCTL only need to be called on
first or last BB of a multi-BB submission. eb_relocate() on the first
and eb_release_vmas() on the last. Doing so will save CPU / GPU cycles.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 127 +++++++++++-------
 .../i915/gem/selftests/i915_gem_execbuffer.c  |  14 +-
 2 files changed, 83 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index ecdb583cc2eb..70784779872a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -270,7 +270,7 @@ struct i915_execbuffer {
 	/** list of vma that have execobj.relocation_count */
 	struct list_head relocs;
 
-	struct i915_gem_ww_ctx ww;
+	struct i915_gem_ww_ctx *ww;
 
 	/**
 	 * Track the most recently used object for relocations, as we
@@ -448,7 +448,7 @@ eb_pin_vma(struct i915_execbuffer *eb,
 		pin_flags |= PIN_GLOBAL;
 
 	/* Attempt to reuse the current location if available */
-	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, pin_flags);
+	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, pin_flags);
 	if (err == -EDEADLK)
 		return err;
 
@@ -457,11 +457,11 @@ eb_pin_vma(struct i915_execbuffer *eb,
 			return err;
 
 		/* Failing that pick any _free_ space if suitable */
-		err = i915_vma_pin_ww(vma, &eb->ww,
-					     entry->pad_to_size,
-					     entry->alignment,
-					     eb_pin_flags(entry, ev->flags) |
-					     PIN_USER | PIN_NOEVICT);
+		err = i915_vma_pin_ww(vma, eb->ww,
+				      entry->pad_to_size,
+				      entry->alignment,
+				      eb_pin_flags(entry, ev->flags) |
+				      PIN_USER | PIN_NOEVICT);
 		if (unlikely(err))
 			return err;
 	}
@@ -643,9 +643,9 @@ static int eb_reserve_vma(struct i915_execbuffer *eb,
 			return err;
 	}
 
-	err = i915_vma_pin_ww(vma, &eb->ww,
-			   entry->pad_to_size, entry->alignment,
-			   eb_pin_flags(entry, ev->flags) | pin_flags);
+	err = i915_vma_pin_ww(vma, eb->ww,
+			      entry->pad_to_size, entry->alignment,
+			      eb_pin_flags(entry, ev->flags) | pin_flags);
 	if (err)
 		return err;
 
@@ -940,7 +940,7 @@ static int eb_lock_vmas(struct i915_execbuffer *eb)
 		struct eb_vma *ev = &eb->vma[i];
 		struct i915_vma *vma = ev->vma;
 
-		err = i915_gem_object_lock(vma->obj, &eb->ww);
+		err = i915_gem_object_lock(vma->obj, eb->ww);
 		if (err)
 			return err;
 	}
@@ -1020,12 +1020,13 @@ eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
 	}
 }
 
-static void eb_release_vmas(struct i915_execbuffer *eb, bool final)
+static void eb_release_vmas(struct i915_execbuffer *eb, bool final,
+			    bool unreserve)
 {
 	const unsigned int count = eb->buffer_count;
 	unsigned int i;
 
-	for (i = 0; i < count; i++) {
+	for (i = 0; unreserve && i < count; i++) {
 		struct eb_vma *ev = &eb->vma[i];
 		struct i915_vma *vma = ev->vma;
 
@@ -1237,7 +1238,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 		if (err)
 			return ERR_PTR(err);
 
-		vma = i915_gem_object_ggtt_pin_ww(obj, &eb->ww, NULL, 0, 0,
+		vma = i915_gem_object_ggtt_pin_ww(obj, eb->ww, NULL, 0, 0,
 						  PIN_MAPPABLE |
 						  PIN_NONBLOCK /* NOWARN */ |
 						  PIN_NOEVICT);
@@ -1361,7 +1362,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	}
 	eb->reloc_pool = NULL;
 
-	err = i915_gem_object_lock(pool->obj, &eb->ww);
+	err = i915_gem_object_lock(pool->obj, eb->ww);
 	if (err)
 		goto err_pool;
 
@@ -1380,7 +1381,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 		goto err_unmap;
 	}
 
-	err = i915_vma_pin_ww(batch, &eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
+	err = i915_vma_pin_ww(batch, eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
 	if (err)
 		goto err_unmap;
 
@@ -1402,7 +1403,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 			eb->reloc_context = ce;
 		}
 
-		err = intel_context_pin_ww(ce, &eb->ww);
+		err = intel_context_pin_ww(ce, eb->ww);
 		if (err)
 			goto err_unpin;
 
@@ -2017,8 +2018,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	}
 
 	/* We may process another execbuffer during the unlock... */
-	eb_release_vmas(eb, false);
-	i915_gem_ww_ctx_fini(&eb->ww);
+	eb_release_vmas(eb, false, true);
+	i915_gem_ww_ctx_fini(eb->ww);
 
 	if (rq) {
 		/* nonblocking is always false */
@@ -2062,7 +2063,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 		err = eb_reinit_userptr(eb);
 
 err_relock:
-	i915_gem_ww_ctx_init(&eb->ww, true);
+	i915_gem_ww_ctx_init(eb->ww, true);
 	if (err)
 		goto out;
 
@@ -2119,8 +2120,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 
 err:
 	if (err == -EDEADLK) {
-		eb_release_vmas(eb, false);
-		err = i915_gem_ww_ctx_backoff(&eb->ww);
+		eb_release_vmas(eb, false, true);
+		err = i915_gem_ww_ctx_backoff(eb->ww);
 		if (!err)
 			goto repeat_validate;
 	}
@@ -2152,7 +2153,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
 	return err;
 }
 
-static int eb_relocate_parse(struct i915_execbuffer *eb)
+static int eb_relocate_parse(struct i915_execbuffer *eb, bool first)
 {
 	int err;
 	struct i915_request *rq = NULL;
@@ -2189,14 +2190,16 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 	/* only throttle once, even if we didn't need to throttle */
 	throttle = false;
 
-	err = eb_validate_vmas(eb);
-	if (err == -EAGAIN)
-		goto slow;
-	else if (err)
-		goto err;
+	if (first) {
+		err = eb_validate_vmas(eb);
+		if (err == -EAGAIN)
+			goto slow;
+		else if (err)
+			goto err;
+	}
 
 	/* The objects are in their final locations, apply the relocations. */
-	if (eb->args->flags & __EXEC_HAS_RELOC) {
+	if (eb->args->flags & __EXEC_HAS_RELOC && first) {
 		struct eb_vma *ev;
 
 		list_for_each_entry(ev, &eb->relocs, reloc_link) {
@@ -2211,13 +2214,13 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
 			goto slow;
 	}
 
-	if (!err)
+	if (!err && first)
 		err = eb_parse(eb);
 
 err:
 	if (err == -EDEADLK) {
-		eb_release_vmas(eb, false);
-		err = i915_gem_ww_ctx_backoff(&eb->ww);
+		eb_release_vmas(eb, false, true);
+		err = i915_gem_ww_ctx_backoff(eb->ww);
 		if (!err)
 			goto retry;
 	}
@@ -2398,7 +2401,7 @@ shadow_batch_pin(struct i915_execbuffer *eb,
 	if (IS_ERR(vma))
 		return vma;
 
-	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, flags);
+	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, flags);
 	if (err)
 		return ERR_PTR(err);
 
@@ -2412,7 +2415,7 @@ static struct i915_vma *eb_dispatch_secure(struct i915_execbuffer *eb, struct i9
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
 	if (eb->batch_flags & I915_DISPATCH_SECURE)
-		return i915_gem_object_ggtt_pin_ww(vma->obj, &eb->ww, NULL, 0, 0, 0);
+		return i915_gem_object_ggtt_pin_ww(vma->obj, eb->ww, NULL, 0, 0, 0);
 
 	return NULL;
 }
@@ -2458,7 +2461,7 @@ static int eb_parse(struct i915_execbuffer *eb)
 		eb->batch_pool = pool;
 	}
 
-	err = i915_gem_object_lock(pool->obj, &eb->ww);
+	err = i915_gem_object_lock(pool->obj, eb->ww);
 	if (err)
 		goto err;
 
@@ -2666,7 +2669,7 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
 	 * GGTT space, so do this first before we reserve a seqno for
 	 * ourselves.
 	 */
-	err = intel_context_pin_ww(ce, &eb->ww);
+	err = intel_context_pin_ww(ce, eb->ww);
 	if (err)
 		return ERR_PTR(err);
 
@@ -3218,7 +3221,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		       unsigned int batch_number,
 		       struct dma_fence *in_fence,
 		       struct dma_fence *exec_fence,
-		       struct dma_fence **out_fence)
+		       struct dma_fence **out_fence,
+		       struct i915_gem_ww_ctx *ww)
 {
 	struct drm_i915_private *i915 = to_i915(dev);
 	struct i915_execbuffer eb;
@@ -3239,7 +3243,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 	eb.exec = exec;
 	eb.vma = (struct eb_vma *)(exec + args->buffer_count + 1);
-	eb.vma[0].vma = NULL;
+	if (first)
+		eb.vma[0].vma = NULL;
 	eb.reloc_pool = eb.batch_pool = NULL;
 	eb.reloc_context = NULL;
 
@@ -3251,6 +3256,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.batch_len = args->batch_len;
 	eb.trampoline = NULL;
 	eb.composite_fence = NULL;
+	eb.ww = ww;
 
 	eb.fences = NULL;
 	eb.num_fences = 0;
@@ -3269,9 +3275,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err)
 		goto err_ext;
 
-	err = eb_create(&eb);
-	if (err)
-		goto err_ext;
+	if (first) {
+		err = eb_create(&eb);
+		if (err)
+			goto err_ext;
+	} else {
+		eb.lut_size = -eb.buffer_count;
+	}
+
 
 	GEM_BUG_ON(!eb.lut_size);
 
@@ -3286,15 +3297,22 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (unlikely(err))
 		goto err_context;
 
-	err = eb_lookup_vmas(&eb);
-	if (err) {
-		eb_release_vmas(&eb, true);
-		goto err_engine;
+	if (first) {
+		err = eb_lookup_vmas(&eb);
+		if (err) {
+			eb_release_vmas(&eb, true, true);
+			goto err_engine;
+		}
+
+	} else {
+		eb.batch = &eb.vma[eb.batch_index];
 	}
 
-	i915_gem_ww_ctx_init(&eb.ww, true);
 
-	err = eb_relocate_parse(&eb);
+	if (first)
+		i915_gem_ww_ctx_init(eb.ww, true);
+
+	err = eb_relocate_parse(&eb, first);
 	if (err) {
 		/*
 		 * If the user expects the execobject.offset and
@@ -3307,7 +3325,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_vma;
 	}
 
-	ww_acquire_done(&eb.ww.ctx);
+	if (first)
+		ww_acquire_done(&eb.ww->ctx);
 
 	batch = eb.batch->vma;
 
@@ -3410,11 +3429,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	i915_request_put(eb.request);
 
 err_vma:
-	eb_release_vmas(&eb, true);
+	eb_release_vmas(&eb, true, err || last);
 	if (eb.trampoline)
 		i915_vma_unpin(eb.trampoline);
 	WARN_ON(err == -EDEADLK);
-	i915_gem_ww_ctx_fini(&eb.ww);
+	if (err || last)
+		i915_gem_ww_ctx_fini(eb.ww);
 
 	if (eb.batch_pool)
 		intel_gt_buffer_pool_put(eb.batch_pool);
@@ -3476,6 +3496,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	const size_t count = args->buffer_count;
 	int err;
 	struct i915_gem_context *ctx;
+	struct i915_gem_ww_ctx ww;
 	struct intel_context *parent = NULL;
 	unsigned int num_batches = 1, i;
 	bool is_parallel = false;
@@ -3602,7 +3623,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 				     0,
 				     in_fence,
 				     exec_fence,
-				     out_fences);
+				     out_fences,
+				     &ww);
 
 	for (i = 1; err == 0 && i < num_batches; i++)
 		err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
@@ -3612,7 +3634,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 					     i,
 					     NULL,
 					     NULL,
-					     out_fences);
+					     out_fences,
+					     &ww);
 
 	if (is_parallel)
 		mutex_unlock(&parent->parallel_submit);
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
index 16162fc2782d..710d2700e5b4 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
@@ -32,11 +32,11 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
 	if (IS_ERR(vma))
 		return PTR_ERR(vma);
 
-	err = i915_gem_object_lock(obj, &eb->ww);
+	err = i915_gem_object_lock(obj, eb->ww);
 	if (err)
 		return err;
 
-	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, PIN_USER | PIN_HIGH);
+	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, PIN_USER | PIN_HIGH);
 	if (err)
 		return err;
 
@@ -106,10 +106,12 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
 static int igt_gpu_reloc(void *arg)
 {
 	struct i915_execbuffer eb;
+	struct i915_gem_ww_ctx ww;
 	struct drm_i915_gem_object *scratch;
 	int err = 0;
 	u32 *map;
 
+	eb.ww = &ww;
 	eb.i915 = arg;
 
 	scratch = i915_gem_object_create_internal(eb.i915, 4096);
@@ -141,20 +143,20 @@ static int igt_gpu_reloc(void *arg)
 		eb.reloc_pool = NULL;
 		eb.reloc_context = NULL;
 
-		i915_gem_ww_ctx_init(&eb.ww, false);
+		i915_gem_ww_ctx_init(eb.ww, false);
 retry:
-		err = intel_context_pin_ww(eb.context, &eb.ww);
+		err = intel_context_pin_ww(eb.context, eb.ww);
 		if (!err) {
 			err = __igt_gpu_reloc(&eb, scratch);
 
 			intel_context_unpin(eb.context);
 		}
 		if (err == -EDEADLK) {
-			err = i915_gem_ww_ctx_backoff(&eb.ww);
+			err = i915_gem_ww_ctx_backoff(eb.ww);
 			if (!err)
 				goto retry;
 		}
-		i915_gem_ww_ctx_fini(&eb.ww);
+		i915_gem_ww_ctx_fini(eb.ww);
 
 		if (eb.reloc_pool)
 			intel_gt_buffer_pool_put(eb.reloc_pool);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 42/46] drm/i915: Hold all parallel requests until last request, properly handle error
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (40 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 43/46] drm/i915/guc: Handle errors in multi-lrc requests Matthew Brost
                   ` (8 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Hold all parallel requests, via a submit fence, until the last request
is generated. If an error occurs in the middle of generating the
requests, skip the requests signal the backend of the error via a
request flag.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 40 +++++++++++++++++--
 drivers/gpu/drm/i915/i915_request.h           |  9 +++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 70784779872a..64af5c704ca7 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3351,7 +3351,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	}
 
 	if (out_fence) {
-		/* Move ownership to caller (i915_gem_execbuffer2_ioctl) */
+		/*
+		 * Move ownership to caller (i915_gem_execbuffer2_ioctl), this
+		 * must be done before anything in this function can jump to the
+		 * 'err_request' label so the caller can safely cleanup any
+		 * errors.
+		 */
 		out_fence[batch_number] = dma_fence_get(&eb.request->fence);
 
 		/*
@@ -3402,10 +3407,21 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	err = eb_submit(&eb, batch, first, last);
 
 err_request:
-	if (last)
+	if (last || err)
 		set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
 			&eb.request->fence.flags);
 
+	/*
+	 * If the execbuf IOCTL is generating more than 1 request, we hold all
+	 * the requests until the last request has been generated in case any of
+	 * the requests hit an error. If an error is hit the caller is
+	 * responsible for flaging all the requests generated with an error. The
+	 * caller is always responsible for releasing the fence on the first
+	 * request.
+	 */
+	if (intel_context_is_parallel(eb.context) && first)
+		i915_sw_fence_await(&eb.request->submit);
+
 	i915_request_get(eb.request);
 	err = eb_request_add(&eb, err);
 
@@ -3498,7 +3514,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 	struct i915_gem_context *ctx;
 	struct i915_gem_ww_ctx ww;
 	struct intel_context *parent = NULL;
-	unsigned int num_batches = 1, i;
+	unsigned int num_batches = 1, i = 0, j;
 	bool is_parallel = false;
 
 	if (!check_buffer_count(count)) {
@@ -3637,8 +3653,24 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
 					     out_fences,
 					     &ww);
 
-	if (is_parallel)
+	if (is_parallel) {
+		/*
+		 * Mark all requests generated with an error if any of the
+		 * requests encountered an error.
+		 */
+		for (j = 0; err && j < i; ++j)
+			if (out_fences[j]) {
+				__i915_request_skip(to_request(out_fences[j]));
+				set_bit(I915_FENCE_FLAG_SKIP_PARALLEL,
+					&out_fences[j]->flags);
+			}
+
+		/* Release fence on first request generated */
+		if (out_fences[0])
+			i915_sw_fence_complete(&to_request(out_fences[0])->submit);
+
 		mutex_unlock(&parent->parallel_submit);
+	}
 
 	/*
 	 * Now that we have begun execution of the batchbuffer, we ignore
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index d6d5bf0a5eb5..7f3f66ddf21b 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -153,6 +153,15 @@ enum {
 	 * tail.
 	 */
 	I915_FENCE_FLAG_SUBMIT_PARALLEL,
+
+	/*
+	 * I915_FENCE_FLAG_SKIP_PARALLEL - request with a context in a
+	 * parent-child relationship (parallel submission, multi-lrc) that
+	 * hit an error while generating requests in the execbuf IOCTL.
+	 * Indicates this request should be skipped as another request in
+	 * submission / relationship encoutered an error.
+	 */
+	I915_FENCE_FLAG_SKIP_PARALLEL,
 };
 
 /**
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 43/46] drm/i915/guc: Handle errors in multi-lrc requests
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (41 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 42/46] drm/i915: Hold all parallel requests until last request, properly handle error Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 44/46] drm/i915: Enable multi-bb execbuf Matthew Brost
                   ` (7 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arrises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forwad progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 61 ++++++++++++++++++-
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d61c45d1ac2c..cd1893edf43a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4394,8 +4394,8 @@ static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
-						 u32 *cs)
+__emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						   u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 	u8 i;
@@ -4423,6 +4423,41 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 				  get_children_go_addr(ce),
 				  0);
 
+	return cs;
+}
+
+/*
+ * If this true, a submission of multi-lrc requests had an error and the
+ * requests need to be skipped. The front end (execuf IOCTL) should've called
+ * i915_request_skip which squashes the BB but we still need to emit the fini
+ * breadrcrumbs seqno write. At this point we don't know how many of the
+ * requests in the multi-lrc submission were generated so we can't do the
+ * handshake between the parent and children (e.g. if 4 requests should be
+ * generated but 2nd hit an error only 1 would be seen by the GuC backend).
+ * Simply skip the handshake, but still emit the breadcrumbd seqno, if an error
+ * has occurred on any of the requests in submission / relationship.
+ */
+static inline bool skip_handshake(struct i915_request *rq)
+{
+	return test_bit(I915_FENCE_FLAG_SKIP_PARALLEL, &rq->fence.flags);
+}
+
+static u32 *
+emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
+						 u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_parent_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
@@ -4439,7 +4474,8 @@ emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
 }
 
 static u32 *
-emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
+__emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						  u32 *cs)
 {
 	struct intel_context *ce = rq->context;
 
@@ -4465,6 +4501,25 @@ emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs
 	*cs++ = get_children_go_addr(ce->parent);
 	*cs++ = 0;
 
+	return cs;
+}
+
+static u32 *
+emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
+						u32 *cs)
+{
+	struct intel_context *ce = rq->context;
+
+	GEM_BUG_ON(!intel_context_is_child(ce));
+
+	if (unlikely(skip_handshake(rq))) {
+		memset(cs, 0, sizeof(u32) *
+		       (ce->engine->emit_fini_breadcrumb_dw - 6));
+		cs += ce->engine->emit_fini_breadcrumb_dw - 6;
+	} else {
+		cs = __emit_fini_breadcrumb_child_no_preempt_mid_batch(rq, cs);
+	}
+
 	/* Emit fini breadcrumb */
 	cs = gen8_emit_ggtt_write(cs,
 				  rq->fence.seqno,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 44/46] drm/i915: Enable multi-bb execbuf
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (42 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 43/46] drm/i915/guc: Handle errors in multi-lrc requests Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 45/46] drm/i915/execlists: Weak parallel submission support for execlists Matthew Brost
                   ` (6 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Enable multi-bb execbuf by enabling the set_parallel extension.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 2b0dd3ff4db8..ac886b82796d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -529,9 +529,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	/* Disabling for now */
-	return -ENODEV;
-
 	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
 		return -ENODEV;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 45/46] drm/i915/execlists: Weak parallel submission support for execlists
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (43 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 44/46] drm/i915: Enable multi-bb execbuf Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts Matthew Brost
                   ` (5 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Basically doing as little as possible to support this
interface for execlists - basically just passing submit fences between
each request generated and virtual engines are not allowed. This is on
par with what is there for the existing (hopefully soon deprecated)
bonding interface.

As with the GuC implementation the pinning interface laying is broken.
This will get cleaned up once the pre_pin / post_unpin laying violations
get fixed. Rather than try to fix it here, just do what the GuC is doing
in the meantime.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   9 +-
 drivers/gpu/drm/i915/gt/intel_context.c       |   1 -
 .../drm/i915/gt/intel_execlists_submission.c  | 201 +++++++++++++++++-
 3 files changed, 205 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index ac886b82796d..b199d59bd2c4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -529,9 +529,6 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	struct intel_engine_cs **siblings = NULL;
 	intel_engine_mask_t prev_mask;
 
-	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
-		return -ENODEV;
-
 	if (get_user(slot, &ext->engine_index))
 		return -EFAULT;
 
@@ -541,6 +538,12 @@ set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
 	if (get_user(num_siblings, &ext->num_siblings))
 		return -EFAULT;
 
+	if (!intel_uc_uses_guc_submission(&i915->gt.uc) && num_siblings != 1) {
+		drm_dbg(&i915->drm, "Only 1 sibling (%d) supported in non-GuC mode\n",
+			num_siblings);
+		return -EINVAL;
+	}
+
 	if (slot >= set->num_engines) {
 		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
 			slot, set->num_engines);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2c07f5f22c94..8e90a4a0b7b0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -627,7 +627,6 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	 * Callers responsibility to validate that this function is used
 	 * correctly but we use GEM_BUG_ON here ensure that they do.
 	 */
-	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
 	GEM_BUG_ON(intel_context_is_pinned(parent));
 	GEM_BUG_ON(intel_context_is_child(parent));
 	GEM_BUG_ON(intel_context_is_pinned(child));
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 769480e026bb..5e0f4983de75 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -927,8 +927,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
 {
-	return (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
-		intel_context_force_single_submission(ce));
+	return intel_context_force_single_submission(ce);
 }
 
 static bool can_merge_ctx(const struct intel_context *prev,
@@ -2602,6 +2601,203 @@ static void execlists_context_cancel_request(struct intel_context *ce,
 				      current->comm);
 }
 
+static int execlists_parent_context_pre_pin(struct intel_context *ce,
+					    struct i915_gem_ww_ctx *ww)
+{
+	struct intel_context *child;
+	int err, i = 0, j = 0;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child) {
+		err = i915_active_acquire(&child->active);
+		if (unlikely(err))
+			goto unwind_active;
+		++i;
+	}
+
+	for_each_child(ce, child) {
+		err = __execlists_context_pre_pin(child, child->engine, ww);
+		if (unlikely(err))
+			goto unwind_pre_pin;
+		++j;
+	}
+
+	err = __execlists_context_pre_pin(ce, ce->engine, ww);
+	if (unlikely(err))
+		goto unwind_pre_pin;
+
+	return 0;
+
+unwind_pre_pin:
+	for_each_child(ce, child) {
+		if (!j--)
+			break;
+		lrc_post_unpin(child);
+	}
+
+unwind_active:
+	for_each_child(ce, child) {
+		if (!i--)
+			break;
+		i915_active_release(&child->active);
+	}
+
+	return err;
+}
+
+static void execlists_parent_context_post_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		lrc_post_unpin(child);
+	lrc_post_unpin(ce);
+
+	for_each_child(ce, child) {
+		intel_context_get(child);
+		i915_active_release(&child->active);
+		intel_context_put(child);
+	}
+}
+
+static int execlists_parent_context_pin(struct intel_context *ce)
+{
+	int ret, i = 0, j = 0;
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child) {
+		ret = lrc_pin(child, child->engine);
+		if (unlikely(ret))
+			goto unwind_pin;
+		++i;
+	}
+	ret = lrc_pin(ce, ce->engine);
+	if (unlikely(ret))
+		goto unwind_pin;
+
+	return 0;
+
+unwind_pin:
+	for_each_child(ce, child) {
+		if (++j > i)
+			break;
+		lrc_unpin(child);
+	}
+
+	return ret;
+}
+
+static void execlists_parent_context_unpin(struct intel_context *ce)
+{
+	struct intel_context *child;
+
+	GEM_BUG_ON(!intel_context_is_parent(ce));
+
+	for_each_child(ce, child)
+		lrc_unpin(child);
+	lrc_unpin(ce);
+}
+
+static const struct intel_context_ops parent_context_ops = {
+	.flags = COPS_HAS_INFLIGHT,
+
+	.alloc = execlists_context_alloc,
+
+	.cancel_request = execlists_context_cancel_request,
+
+	.pre_pin = execlists_parent_context_pre_pin,
+	.pin = execlists_parent_context_pin,
+	.unpin = execlists_parent_context_unpin,
+	.post_unpin = execlists_parent_context_post_unpin,
+
+	.enter = intel_context_enter_engine,
+	.exit = intel_context_exit_engine,
+
+	.destroy = lrc_destroy,
+};
+
+static const struct intel_context_ops child_context_ops = {
+	.flags = COPS_HAS_INFLIGHT,
+
+	.alloc = execlists_context_alloc,
+
+	.cancel_request = execlists_context_cancel_request,
+
+	.enter = intel_context_enter_engine,
+	.exit = intel_context_exit_engine,
+
+	.destroy = lrc_destroy,
+};
+
+static struct intel_context *
+execlists_create_parallel(struct intel_engine_cs **engines,
+			  unsigned int num_siblings,
+			  unsigned int width)
+{
+	struct intel_engine_cs **siblings = NULL;
+	struct intel_context *parent = NULL, *ce, *err;
+	int i, j;
+	int ret;
+
+	GEM_BUG_ON(num_siblings != 1);
+
+	siblings = kmalloc_array(num_siblings,
+				 sizeof(*siblings),
+				 GFP_KERNEL);
+	if (!siblings)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < width; ++i) {
+		for (j = 0; j < num_siblings; ++j)
+			siblings[j] = engines[i * num_siblings + j];
+
+		ce = intel_context_create(siblings[0]);
+		if (!ce) {
+			err = ERR_PTR(-ENOMEM);
+			goto unwind;
+		}
+
+		if (i == 0) {
+			parent = ce;
+		} else {
+			intel_context_bind_parent_child(parent, ce);
+			ret = intel_context_alloc_state(ce);
+			if (ret) {
+				err = ERR_PTR(ret);
+				goto unwind;
+			}
+		}
+	}
+
+	intel_context_set_nopreempt(parent);
+	intel_context_set_single_submission(parent);
+	for_each_child(parent, ce) {
+		intel_context_set_nopreempt(ce);
+		intel_context_set_single_submission(ce);
+	}
+
+	parent->ops = &parent_context_ops;
+	for_each_child(parent, ce)
+		ce->ops = &child_context_ops;
+
+	kfree(siblings);
+	return parent;
+
+unwind:
+	if (parent) {
+		for_each_child(parent, ce)
+			intel_context_put(ce);
+		intel_context_put(parent);
+	}
+	kfree(siblings);
+	return err;
+}
+
 static const struct intel_context_ops execlists_context_ops = {
 	.flags = COPS_HAS_INFLIGHT,
 
@@ -2620,6 +2816,7 @@ static const struct intel_context_ops execlists_context_ops = {
 	.reset = lrc_reset,
 	.destroy = lrc_destroy,
 
+	.create_parallel = execlists_create_parallel,
 	.create_virtual = execlists_create_virtual,
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (44 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 45/46] drm/i915/execlists: Weak parallel submission support for execlists Matthew Brost
@ 2021-08-03 22:29 ` Matthew Brost
  2021-08-09 17:17   ` Daniel Vetter
  2021-08-12 19:26   ` Daniel Vetter
  2021-08-03 22:51 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev2) Patchwork
                   ` (4 subsequent siblings)
  50 siblings, 2 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-03 22:29 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Some workloads use lots of contexts that continually pin / unpin
contexts. With GuC submission an unpin translates to a schedule disable
H2G which puts pressure on both the i915 and GuC. A schedule disable can
also block future requests from being submitted until the operation
completes. None of this is ideal.

Add a configurable, via debugfs, delay period before the schedule
disable is issued. Default delay period is 1 second. The delay period is
skipped if more than 3/4 of the guc_ids are in use.

This patch also updates the selftests to turn off this delay period as
this extra time would likely cause many selftests to fail. Follow up
patches will fix all the selftests and enable the delay period.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
 .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
 .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
 .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
 .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
 drivers/gpu/drm/i915/i915_selftest.h          |   2 +
 drivers/gpu/drm/i915/i915_trace.h             |  10 +
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
 drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
 drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
 18 files changed, 405 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index b199d59bd2c4..1553287e5491 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
 		int err;
 
 		/* serialises with execbuf */
-		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
+		intel_context_close(ce);
 		if (!intel_context_pin_if_active(ce))
 			continue;
 
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
index 13b088cc787e..a666d7e610f5 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
@@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_gem_coherency),
 	};
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
index ffae7df5e4d7..2c92afa9d608 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
@@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
 	};
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
index b20f5621f62b..4745c78a48de 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
@@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_mmap_gpu),
 	};
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
index 740ee8086a27..ae1361c7c4cf 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
@@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_gem_huge),
 	};
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 8e90a4a0b7b0..96643040defd 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id_link);
 
+	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
+
 	mutex_init(&ce->parallel_submit);
 	ce->fence_context = dma_fence_context_alloc(1);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index a302599e436a..f4c9036f7f03 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
 	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
 }
 
+static inline void intel_context_close(struct intel_context *ce)
+{
+	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
+
+	trace_intel_context_close(ce);
+	if (ce->ops->close)
+		ce->ops->close(ce);
+}
+
 static inline bool intel_context_is_closed(const struct intel_context *ce)
 {
 	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 8af9ace4c052..53f00657a45c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -11,6 +11,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/types.h>
+#include <linux/ktime.h>
 
 #include "i915_active_types.h"
 #include "i915_sw_fence.h"
@@ -38,6 +39,7 @@ struct intel_context_ops {
 	int (*alloc)(struct intel_context *ce);
 
 	void (*ban)(struct intel_context *ce, struct i915_request *rq);
+	void (*close)(struct intel_context *ce);
 
 	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
 	int (*pin)(struct intel_context *ce);
@@ -203,6 +205,12 @@ struct intel_context {
 	 */
 	struct list_head guc_id_link;
 
+	/*
+	 * GuC schedule disable link / time
+	 */
+	struct list_head guc_sched_disable_link;
+	ktime_t guc_sched_disable_time;
+
 	/* GuC context blocked fence */
 	struct i915_sw_fence guc_blocked;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 30a0f364db8f..90b5b657d411 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -60,6 +60,7 @@ struct intel_guc {
 	struct ida guc_ids;
 	u32 num_guc_ids;
 	u32 max_guc_ids;
+	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
 	unsigned long *guc_ids_bitmap;
 #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
 	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
@@ -69,6 +70,12 @@ struct intel_guc {
 	struct list_head destroyed_contexts;
 	struct intel_gt_pm_unpark_work destroy_worker;
 
+	spinlock_t sched_disable_lock;	/* protects schedule disable list */
+	struct list_head sched_disable_list;
+	struct hrtimer sched_disable_timer;
+#define SCHED_DISABLE_DELAY_NS	1000000000
+	u64 sched_disable_delay_ns;
+
 	bool submission_supported;
 	bool submission_selected;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index 7c479c5e7b3a..53a6f3da6cce 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
 }
 DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
 
+static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->sched_disable_delay_ns;
+
+	return 0;
+}
+
+static int guc_sched_disable_delay_ns_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	guc->sched_disable_delay_ns = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
+			guc_sched_disable_delay_ns_get,
+			guc_sched_disable_delay_ns_set, "%lld\n");
+
 void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
 {
 	static const struct debugfs_gt_file files[] = {
 		{ "guc_info", &guc_info_fops, NULL },
 		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
 		{ "guc_num_id", &guc_num_id_fops, NULL },
+		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
 	};
 
 	if (!intel_guc_is_supported(guc))
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cd1893edf43a..dc0d6a099bee 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
 	return (timeout < 0) ? timeout : 0;
 }
 
+static void sched_disable_contexts_flush(struct intel_guc *guc);
+
 int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
 {
 	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
 		return 0;
 
+	sched_disable_contexts_flush(guc);
+
 	return intel_guc_wait_for_pending_msg(guc,
 					      &guc->outstanding_submission_g2h,
 					      true, timeout);
@@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
 static void guc_signal_context_fence(struct intel_context *ce);
 static void guc_cancel_context_requests(struct intel_context *ce);
 static void guc_blocked_fence_complete(struct intel_context *ce);
+static void sched_disable_context_delete(struct intel_context *ce);
 
 static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 {
@@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
 		deregister = context_wait_for_deregister_to_register(ce);
 		banned = context_banned(ce);
 		init_sched_state(ce);
+		sched_disable_context_delete(ce);
 
 		if (pending_enable || destroyed || deregister) {
 			atomic_dec(&guc->outstanding_submission_g2h);
@@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 
 	intel_gt_park_heartbeats(guc_to_gt(guc));
 	disable_submission(guc);
+	hrtimer_cancel(&guc->sched_disable_timer);
 	guc->interrupts.disable(guc);
 
 	/* Flush IRQ handler */
@@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
 
 static void destroy_worker_func(struct work_struct *w);
 
+static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->destroyed_contexts);
 	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
 
+	spin_lock_init(&guc->sched_disable_lock);
+	INIT_LIST_HEAD(&guc->sched_disable_list);
+	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL);
+	guc->sched_disable_timer.function = sched_disable_timer_func;
+	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
+
 	return 0;
 }
 
@@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(ret < 0))
 		return ret;
 
+	if (intel_context_is_parent(ce))
+		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
+			order_base_2(ce->guc_number_children + 1);
+	else
+		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
+
 	ce->guc_id = ret;
 	return 0;
 }
@@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	GEM_BUG_ON(intel_context_is_child(ce));
 	if (!context_guc_id_invalid(ce)) {
-		if (intel_context_is_parent(ce))
+		if (intel_context_is_parent(ce)) {
+			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
+				order_base_2(ce->guc_number_children + 1);
 			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
 					      order_base_2(ce->guc_number_children
 							   + 1));
-		else
+		} else {
+			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
 			ida_simple_remove(&guc->guc_ids, ce->guc_id);
+		}
 		clr_lrc_desc_registered(guc, ce->guc_id);
+
 		set_context_guc_id_invalid(ce);
 	}
 	if (!list_empty(&ce->guc_id_link))
@@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
 			 * from another context that has more guc_id that itself.
 			 */
 			if (cn_o2 != ce_o2) {
+				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
+					order_base_2(cn->guc_number_children + 1);
 				bitmap_release_region(guc->guc_ids_bitmap,
 						      cn->guc_id,
 						      cn_o2);
+				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
+					order_base_2(ce->guc_number_children + 1);
 				bitmap_allocate_region(guc->guc_ids_bitmap,
 						       ce->guc_id,
 						       ce_o2);
@@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
 	__guc_context_unpin(ce);
 
 	if (likely(!intel_context_is_barrier(ce)))
-		intel_engine_pm_put(ce->engine);
+		intel_engine_pm_put_async(ce->engine);
 }
 
 static void guc_context_post_unpin(struct intel_context *ce)
@@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
 
 	for_each_engine_masked(engine, ce->engine->gt,
 			       ce->engine->mask, tmp)
-		intel_engine_pm_put(engine);
+		intel_engine_pm_put_async(engine);
 	for_each_child(ce, child)
 		for_each_engine_masked(engine, child->engine->gt,
 				       child->engine->mask, tmp)
-			intel_engine_pm_put(engine);
+			intel_engine_pm_put_async(engine);
 }
 
 static void __guc_context_sched_enable(struct intel_guc *guc,
@@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
 
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 
+	sched_disable_context_delete(ce);
+
 	with_intel_runtime_pm(runtime_pm, wakeref)
 		__guc_context_sched_disable(guc, ce, guc_id);
 
@@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
 								     1);
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 	}
+
+	sched_disable_context_delete(ce);
+}
+
+#define next_sched_disable_time(guc, now, ce) \
+	(guc->sched_disable_delay_ns - \
+	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
+static void ____sched_disable_context_delete(struct intel_guc *guc,
+					     struct intel_context *ce)
+{
+	bool is_first;
+
+	lockdep_assert_held(&guc->sched_disable_lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
+
+	is_first = list_is_first(&ce->guc_sched_disable_link,
+				 &guc->sched_disable_list);
+	list_del_init(&ce->guc_sched_disable_link);
+	if (list_empty(&guc->sched_disable_list)) {
+		hrtimer_try_to_cancel(&guc->sched_disable_timer);
+	} else if (is_first) {
+		struct intel_context *first =
+			list_first_entry(&guc->sched_disable_list,
+					 typeof(*first),
+					 guc_sched_disable_link);
+		u64 next_time = next_sched_disable_time(guc, ktime_get(),
+							first);
+
+		hrtimer_start(&guc->sched_disable_timer,
+			      ns_to_ktime(next_time),
+			      HRTIMER_MODE_REL_PINNED);
+	}
+}
+
+static void __sched_disable_context_delete(struct intel_guc *guc,
+					   struct intel_context *ce)
+{
+	lockdep_assert_held(&guc->sched_disable_lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (!list_empty(&ce->guc_sched_disable_link)) {
+		intel_context_sched_disable_unpin(ce);
+		____sched_disable_context_delete(guc, ce);
+	}
+}
+
+static void sched_disable_context_delete(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+	unsigned long flags;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (!list_empty(&ce->guc_sched_disable_link)) {
+		spin_lock_irqsave(&guc->sched_disable_lock, flags);
+		__sched_disable_context_delete(guc, ce);
+		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
+	}
+}
+
+static void sched_disable_context_add(struct intel_guc *guc,
+				      struct intel_context *ce)
+{
+	unsigned long flags;
+
+	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
+
+	ce->guc_sched_disable_time = ktime_get();
+
+	spin_lock_irqsave(&guc->sched_disable_lock, flags);
+	if (list_empty(&guc->sched_disable_list))
+		hrtimer_start(&guc->sched_disable_timer,
+			      ns_to_ktime(guc->sched_disable_delay_ns),
+			      HRTIMER_MODE_REL_PINNED);
+	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
+	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
+}
+
+static void sched_disable_contexts_flush(struct intel_guc *guc)
+{
+	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->sched_disable_lock, flags);
+
+	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
+					 guc_sched_disable_link) {
+		intel_wakeref_t wakeref;
+		bool enabled;
+		u16 guc_id;
+
+		list_del_init(&ce->guc_sched_disable_link);
+
+		spin_lock(&ce->guc_state.lock);
+		enabled = context_enabled(ce);
+		if (unlikely(!enabled || submission_disabled(guc))) {
+			if (enabled)
+				clr_context_enabled(ce);
+			spin_unlock(&ce->guc_state.lock);
+			intel_context_sched_disable_unpin(ce);
+			continue;
+		}
+		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+			spin_unlock(&ce->guc_state.lock);
+			continue;
+		}
+		guc_id = prep_context_pending_disable(ce);
+		spin_unlock(&ce->guc_state.lock);
+
+		with_intel_runtime_pm(runtime_pm, wakeref)
+			__guc_context_sched_disable(guc, ce, guc_id);
+	}
+
+	hrtimer_try_to_cancel(&guc->sched_disable_timer);
+
+	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
 }
 
+#define should_sched_be_disabled(guc, now, ce) \
+	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
+	(guc->sched_disable_delay_ns / 4) * 3)
+static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
+{
+	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
+					     sched_disable_timer);
+	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
+	struct intel_context *ce, *cn;
+	unsigned long flags;
+	ktime_t now;
+
+	if (list_empty(&guc->sched_disable_list))
+		return HRTIMER_NORESTART;
+
+	now = ktime_get();
+
+	spin_lock_irqsave(&guc->sched_disable_lock, flags);
+
+	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
+					 guc_sched_disable_link) {
+		intel_wakeref_t wakeref;
+		bool enabled;
+		u16 guc_id;
+
+		/*
+		 * If a context has been waiting for 3/4 of its delay or more,
+		 * issue the schedule disable. Using this heuristic allows more
+		 * than 1 context to have its scheduling disabled when this
+		 * timer is run.
+		 */
+		if (!should_sched_be_disabled(guc, now, ce))
+			break;
+
+		list_del_init(&ce->guc_sched_disable_link);
+
+		spin_lock(&ce->guc_state.lock);
+		enabled = context_enabled(ce);
+		if (unlikely(!enabled || submission_disabled(guc))) {
+			if (enabled)
+				clr_context_enabled(ce);
+			spin_unlock(&ce->guc_state.lock);
+			intel_context_sched_disable_unpin(ce);
+			continue;
+		}
+		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+			spin_unlock(&ce->guc_state.lock);
+			continue;
+		}
+		guc_id = prep_context_pending_disable(ce);
+		spin_unlock(&ce->guc_state.lock);
+
+		with_intel_runtime_pm(runtime_pm, wakeref)
+			__guc_context_sched_disable(guc, ce, guc_id);
+	}
+
+	if (!list_empty(&guc->sched_disable_list)) {
+		struct intel_context *first =
+			list_first_entry(&guc->sched_disable_list,
+					 typeof(*first),
+					 guc_sched_disable_link);
+		u64 next_time = next_sched_disable_time(guc, now, first);
+
+		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
+		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
+
+		return HRTIMER_RESTART;
+	} else {
+		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
+
+		return HRTIMER_NORESTART;
+	}
+}
+
+#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
 static void guc_context_sched_disable(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
@@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	intel_wakeref_t wakeref;
 	u16 guc_id;
 	bool enabled;
+	int guc_id_index = intel_context_is_parent(ce) ?
+		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
+	int max_guc_ids = intel_context_is_parent(ce) ?
+	       NUMBER_MULTI_LRC_GUC_ID(guc) :
+	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
 
 	GEM_BUG_ON(intel_context_is_child(ce));
+	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
 
 	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
 	    !lrc_desc_registered(guc, ce->guc_id)) {
@@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
 	if (!context_enabled(ce))
 		goto unpin;
 
+	/*
+	 * If no guc_id pressure and the context isn't closed we delay the
+	 * schedule disable to not to continuously disable / enable scheduling
+	 * putting pressure on both the i915 and GuC. Delay is configurable via
+	 * debugfs, default 1s.
+	 */
+	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
+	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
+		sched_disable_context_add(guc, ce);
+		return;
+	}
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	/*
@@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
 	i915_request_notify_execute_cb_imm(rq);
 }
 
+static void __guc_context_close(struct intel_guc *guc,
+				struct intel_context *ce)
+{
+	lockdep_assert_held(&guc->sched_disable_lock);
+	GEM_BUG_ON(intel_context_is_child(ce));
+
+	if (!list_empty(&ce->guc_sched_disable_link)) {
+		struct intel_runtime_pm *runtime_pm =
+			ce->engine->uncore->rpm;
+		intel_wakeref_t wakeref;
+		bool enabled;
+		u16 guc_id;
+
+		spin_lock(&ce->guc_state.lock);
+		enabled = context_enabled(ce);
+		if (unlikely(!enabled || submission_disabled(guc))) {
+			if (enabled)
+				clr_context_enabled(ce);
+			spin_unlock(&ce->guc_state.lock);
+			intel_context_sched_disable_unpin(ce);
+			goto update_list;
+		}
+		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
+			spin_unlock(&ce->guc_state.lock);
+			goto update_list;
+		}
+		guc_id = prep_context_pending_disable(ce);
+		spin_unlock(&ce->guc_state.lock);
+
+		with_intel_runtime_pm(runtime_pm, wakeref)
+			__guc_context_sched_disable(guc, ce, guc_id);
+update_list:
+		____sched_disable_context_delete(guc, ce);
+	}
+}
+
+static void guc_context_close(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+	unsigned long flags;
+
+	/*
+	 * If we close the context and a schedule disable is pending a delay, do
+	 * it immediately.
+	 */
+	if (!list_empty(&ce->guc_sched_disable_link)) {
+		spin_lock_irqsave(&guc->sched_disable_lock, flags);
+		__guc_context_close(guc, ce);
+		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
+	}
+}
+
 static struct intel_context *
 guc_create_parallel(struct intel_engine_cs **engines,
 		    unsigned int num_siblings,
@@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
+	.close = guc_context_close,
 
 	.cancel_request = guc_context_cancel_request,
 
@@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
 
 	rq->reserved_space -= GUC_REQUEST_SIZE;
 
+	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
+		   atomic_read(&ce->pin_count) < 3);
+	sched_disable_context_delete(ce);
+
 	/*
 	 * guc_ids are exhausted or a heuristic is met indicating too many
 	 * guc_ids are waiting on requests with submission dependencies (not
@@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
 	__guc_context_unpin(ce);
 
 	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
-		intel_engine_pm_put(engine);
+		intel_engine_pm_put_async(engine);
 }
 
 static void guc_virtual_context_enter(struct intel_context *ce)
@@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.post_unpin = guc_context_post_unpin,
 
 	.ban = guc_context_ban,
+	.close = guc_context_close,
 
 	.cancel_request = guc_context_cancel_request,
 
@@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
 	.post_unpin = guc_parent_context_post_unpin,
 
 	.ban = guc_context_ban,
+	.close = guc_context_close,
 
 	.enter = guc_virtual_context_enter,
 	.exit = guc_virtual_context_exit,
@@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
 	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
 		   atomic_read(&guc->outstanding_submission_g2h));
 	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
-	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
+	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
+	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
+		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
+	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
+		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
 	drm_printf(p, "GuC max context registered: %u\n\n",
 		   guc->lrcd_reg.max_idx);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
index 9cfecf9d368e..ad70b3159ce4 100644
--- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
+++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
@@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
 #define NUM_RQ_PER_CONTEXT	2
 #define HEARTBEAT_INTERVAL	1500
 
-static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
+static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
+					bool hang, bool sched_disable_delay)
 {
 	struct intel_gt *gt = arg;
 	struct intel_guc *guc = &gt->uc.guc;
@@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
 	if (limit_guc_ids)
 		guc->num_guc_ids = NUM_GUC_ID;
 
+	if (sched_disable_delay)
+		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
+
 	ce = intel_context_create(intel_selftest_find_any_engine(gt));
 	if (IS_ERR(ce)) {
 		ret = PTR_ERR(ce);
@@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
 	guc->num_guc_ids = guc->max_guc_ids;
 	guc->gse_hang_expected = false;
 	guc->inject_bad_sched_disable = false;
+	guc->sched_disable_delay_ns = 0;
 	kfree(contexts);
 
 	return ret;
@@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
 
 static int intel_guc_flow_control_guc_ids(void *arg)
 {
-	return __intel_guc_flow_control_guc(arg, true, false);
+	return __intel_guc_flow_control_guc(arg, true, false, false);
+}
+
+static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
+{
+	return __intel_guc_flow_control_guc(arg, true, false, true);
 }
 
 static int intel_guc_flow_control_lrcd_reg(void *arg)
 {
-	return __intel_guc_flow_control_guc(arg, false, false);
+	return __intel_guc_flow_control_guc(arg, false, false, false);
 }
 
 static int intel_guc_flow_control_hang_state_machine(void *arg)
 {
-	return __intel_guc_flow_control_guc(arg, true, true);
+	return __intel_guc_flow_control_guc(arg, true, true, false);
 }
 
 #define NUM_RQ_STRESS_CTBS	0x4000
@@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
 	static const struct i915_subtest tests[] = {
 		SUBTEST(intel_guc_flow_control_stress_ctbs),
 		SUBTEST(intel_guc_flow_control_guc_ids),
+		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
 		SUBTEST(intel_guc_flow_control_lrcd_reg),
 		SUBTEST(intel_guc_flow_control_hang_state_machine),
 		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
index f54de0499be7..bf464db7affe 100644
--- a/drivers/gpu/drm/i915/i915_selftest.h
+++ b/drivers/gpu/drm/i915/i915_selftest.h
@@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
 			T, ARRAY_SIZE(T), data)
 #define i915_live_subtests(T, data) ({ \
 	typecheck(struct drm_i915_private *, data); \
+	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
 	__i915_subtests(__func__, \
 			__i915_live_setup, __i915_live_teardown, \
 			T, ARRAY_SIZE(T), data); \
 })
 #define intel_gt_live_subtests(T, data) ({ \
 	typecheck(struct intel_gt *, data); \
+	(data)->uc.guc.sched_disable_delay_ns = 0; \
 	__i915_subtests(__func__, \
 			__intel_gt_live_setup, __intel_gt_live_teardown, \
 			T, ARRAY_SIZE(T), data); \
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 806ad688274b..57ba7065d5ab 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
 	     TP_ARGS(ce)
 );
 
+DEFINE_EVENT(intel_context, intel_context_close,
+	     TP_PROTO(struct intel_context *ce),
+	     TP_ARGS(ce)
+);
+
 DEFINE_EVENT(intel_context, intel_context_ban,
 	     TP_PROTO(struct intel_context *ce),
 	     TP_ARGS(ce)
@@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
 {
 }
 
+static inline void
+trace_intel_context_close(struct intel_context *ce)
+{
+}
+
 static inline void
 trace_intel_context_ban(struct intel_context *ce)
 {
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index f843a5040706..d54c280217fe 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
 
 	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
index 9e9a6cb1d9e5..86bad00cca95 100644
--- a/drivers/gpu/drm/i915/selftests/i915_perf.c
+++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
@@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
 	if (err)
 		return err;
 
-	err = i915_subtests(tests, i915);
+	err = i915_live_subtests(tests, i915);
 
 	destroy_empty_config(&i915->perf);
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index d67710d10615..afbf88865a8b 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
 	if (intel_gt_is_wedged(&i915->gt))
 		return 0;
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
 
 static int switch_to_kernel_sync(struct intel_context *ce, int err)
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index dd0607254a95..f4b157451851 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_vma_remapped_gtt),
 	};
 
-	return i915_subtests(tests, i915);
+	return i915_live_subtests(tests, i915);
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev2)
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (45 preceding siblings ...)
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts Matthew Brost
@ 2021-08-03 22:51 ` Patchwork
  2021-08-03 22:53 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (3 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Patchwork @ 2021-08-03 22:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev2)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
5e27ca906d33 drm/i915/guc: Allow flexible number of context ids
27bc3be5a17a drm/i915/guc: Connect the number of guc_ids to debugfs
314754bac86b drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted
9a569e392672 drm/i915/guc: Don't allow requests not ready to consume all guc_ids
112dd7286938 drm/i915/guc: Introduce guc_submit_engine object
-:1235: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#1235: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 1243 lines checked
889784d21320 drm/i915/guc: Check return of __xa_store when registering a context
f482b6e988b8 drm/i915/guc: Non-static lrc descriptor registration buffer
ac6f010842ee drm/i915/guc: Take GT PM ref when deregistering context
-:35: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#35: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:35: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#35: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:44:
+#define with_intel_gt_pm(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:38: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#38: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:47:
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:38: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#38: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:47:
+#define with_intel_gt_pm_async(gt, tmp) \
+	for (tmp = 1, intel_gt_pm_get(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:41: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#41: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:50:
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:41: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#41: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:50:
+#define with_intel_gt_pm_if_awake(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put(gt), tmp = 0)

-:44: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'gt' - possible side-effects?
#44: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:53:
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

-:44: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'tmp' - possible side-effects?
#44: FILE: drivers/gpu/drm/i915/gt/intel_gt_pm.h:53:
+#define with_intel_gt_pm_if_awake_async(gt, tmp) \
+	for (tmp = intel_gt_pm_get_if_awake(gt); tmp; \
+	     intel_gt_pm_put_async(gt), tmp = 0)

total: 0 errors, 0 warnings, 8 checks, 217 lines checked
8c650cfb86b5 drm/i915: Add GT PM unpark worker
-:59: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#59: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 184 lines checked
68d99ae2054b drm/i915/guc: Take engine PM when a context is pinned with GuC submission
da629af1ed2d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
d89645be64e8 drm/i915/guc: Selftest for GuC flow control
-:220: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#220: 
new file mode 100644

-:362: WARNING:OOM_MESSAGE: Possible unnecessary 'out of memory' message
#362: FILE: drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c:138:
+	if (!contexts) {
+		pr_err("Context array allocation failed\n");

total: 0 errors, 2 warnings, 0 checks, 774 lines checked
a00a22e7333d drm/i915: Add logical engine mapping
4fc6c6ac891e drm/i915: Expose logical engine instance to user
fa22d6168e5d drm/i915/guc: Introduce context parent-child relationship
ffe691bcda89 drm/i915/guc: Implement GuC parent-child context pin / unpin functions
de52a524ffff drm/i915/guc: Add multi-lrc context registration
9b9cf89ec1fa drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
affe37a98124 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
-:96: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'guc' - possible side-effects?
#96: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:178:
+#define NUMBER_MULTI_LRC_GUC_ID(guc) \
+	((guc)->num_guc_ids / 16 > 32 ? (guc)->num_guc_ids / 16 : 32)

total: 0 errors, 0 warnings, 1 checks, 428 lines checked
1e00caa2078b drm/i915/guc: Add hang check to GuC submit engine
dac8e22d9b0f drm/i915/guc: Add guc_child_context_destroy
dae43b9ae4ca drm/i915/guc: Implement multi-lrc submission
-:186: CHECK:SPACING: spaces preferred around that '*' (ctx:ExV)
#186: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:906:
+		*wqi++ = child->ring->tail / sizeof(u64);
 		^

total: 0 errors, 0 warnings, 1 checks, 345 lines checked
b48693c26c56 drm/i915/guc: Insert submit fences between requests in parent-child relationship
e99b23e86490 drm/i915/guc: Implement multi-lrc reset
e278330f18d8 drm/i915/guc: Update debugfs for GuC multi-lrc
d70a91bd7039 drm/i915: Connect UAPI to GuC multi-lrc interface
f5b94aff1055 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
-:12: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 checks, 10 lines checked
1e623d060bfa drm/i915/guc: Add basic GuC multi-lrc selftest
-:22: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#22: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 180 lines checked
f04f2fb6c221 drm/i915/guc: Extend GuC flow control selftest for multi-lrc
-:127: WARNING:OOM_MESSAGE: Possible unnecessary 'out of memory' message
#127: FILE: drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c:654:
+	if (!contexts) {
+		pr_err("Context array allocation failed\n");

-:182: WARNING:LONG_LINE: line length of 105 exceeds 100 columns
#182: FILE: drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c:709:
+					contexts[j] = multi_lrc_create_parent(gt, VIDEO_DECODE_CLASS, 0);

total: 0 errors, 2 warnings, 0 checks, 360 lines checked
87f926fb4e0e drm/i915/guc: Implement no mid batch preemption for multi-lrc
94ba50b6aaae drm/i915: Move secure execbuf check to execbuf2
722cabe63be0 drm/i915: Move input/exec fence handling to i915_gem_execbuffer2
866830d7b8df drm/i915: Move output fence handling to i915_gem_execbuffer2
ca7de92f4b87 drm/i915: Return output fence from i915_gem_do_execbuffer
866efe3b7b62 drm/i915: Store batch index in struct i915_execbuffer
e092e2a0aa4a drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index
53e1b9709de9 drm/i915: Teach execbuf there can be more than one batch in the objects list
b48c4d1c82fb drm/i915: Only track object dependencies on first request
e159516409c6 drm/i915: Force parallel contexts to use copy engine for reloc
77d80c169b21 drm/i915: Multi-batch execbuffer2
3e6d8fb0d85c drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
692205dd3cff drm/i915: Hold all parallel requests until last request, properly handle error
be69dbafd8a5 drm/i915/guc: Handle errors in multi-lrc requests
dd80d82c5c4f drm/i915: Enable multi-bb execbuf
fe8d1a1c8b89 drm/i915/execlists: Weak parallel submission support for execlists
31f37bf214eb drm/i915/guc: Add delay before disabling scheduling on contexts
-:371: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'guc' may be better as '(guc)' to avoid precedence issues
#371: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:2954:
+#define next_sched_disable_time(guc, now, ce) \
+	(guc->sched_disable_delay_ns - \
+	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))

-:371: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'ce' may be better as '(ce)' to avoid precedence issues
#371: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:2954:
+#define next_sched_disable_time(guc, now, ce) \
+	(guc->sched_disable_delay_ns - \
+	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))

-:488: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'guc' may be better as '(guc)' to avoid precedence issues
#488: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3071:
+#define should_sched_be_disabled(guc, now, ce) \
+	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
+	(guc->sched_disable_delay_ns / 4) * 3)

-:488: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'ce' may be better as '(ce)' to avoid precedence issues
#488: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3071:
+#define should_sched_be_disabled(guc, now, ce) \
+	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
+	(guc->sched_disable_delay_ns / 4) * 3)

-:555: WARNING:UNNECESSARY_ELSE: else is not generally useful after a break or return
#555: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3138:
+		return HRTIMER_RESTART;
+	} else {

-:562: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'max' may be better as '(max)' to avoid precedence issues
#562: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3145:
+#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)

-:562: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'in_use' may be better as '(in_use)' to avoid precedence issues
#562: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:3145:
+#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)

total: 0 errors, 1 warnings, 6 checks, 735 lines checked



^ permalink raw reply	[flat|nested] 111+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Parallel submission aka multi-bb execbuf (rev2)
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (46 preceding siblings ...)
  2021-08-03 22:51 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev2) Patchwork
@ 2021-08-03 22:53 ` Patchwork
  2021-08-03 22:57 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
                   ` (2 subsequent siblings)
  50 siblings, 0 replies; 111+ messages in thread
From: Patchwork @ 2021-08-03 22:53 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev2)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+drivers/gpu/drm/i915/intel_wakeref.c:142:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock



^ permalink raw reply	[flat|nested] 111+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Parallel submission aka multi-bb execbuf (rev2)
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (47 preceding siblings ...)
  2021-08-03 22:53 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2021-08-03 22:57 ` Patchwork
  2021-08-03 23:19 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
  2021-08-05  3:53 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
  50 siblings, 0 replies; 111+ messages in thread
From: Patchwork @ 2021-08-03 22:57 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev2)
URL   : https://patchwork.freedesktop.org/series/92789/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
/home/cidrm/kernel/Documentation/gpu/i915:525: ./drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:75: WARNING: Unexpected indentation.
/home/cidrm/kernel/Documentation/gpu/i915:525: ./drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:76: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/cidrm/kernel/Documentation/gpu/i915:525: ./drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:78: WARNING: Unexpected indentation.
/home/cidrm/kernel/Documentation/gpu/i915:525: ./drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:79: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/cidrm/kernel/Documentation/gpu/i915:525: ./drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:81: WARNING: Unexpected indentation.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Parallel submission aka multi-bb execbuf (rev2)
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (48 preceding siblings ...)
  2021-08-03 22:57 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
@ 2021-08-03 23:19 ` Patchwork
  2021-08-05  3:53 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
  50 siblings, 0 replies; 111+ messages in thread
From: Patchwork @ 2021-08-03 23:19 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 4727 bytes --]

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev2)
URL   : https://patchwork.freedesktop.org/series/92789/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10442 -> Patchwork_20767
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/index.html

New tests
---------

  New tests have been introduced between CI_DRM_10442 and Patchwork_20767:

### New IGT tests (2) ###

  * igt@i915_selftest@live@guc_flow_control:
    - Statuses : 28 pass(s)
    - Exec time: [0.39, 4.85] s

  * igt@i915_selftest@live@guc_multi_lrc:
    - Statuses : 28 pass(s)
    - Exec time: [0.39, 5.20] s

  


Changes
-------

  No changes found


Participating hosts (37 -> 33)
------------------------------

  Missing    (4): fi-bdw-samus fi-bsw-cyan bat-jsl-1 fi-hsw-4200u 


Build changes
-------------

  * Linux: CI_DRM_10442 -> Patchwork_20767

  CI-20190529: 20190529
  CI_DRM_10442: d3816ffe379da79a69188424318fe2b5d458347b @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6159: 6135b9cc319ed965e3aafb5b2ae2abf4762a06b2 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20767: 31f37bf214eb7f2a01feb3f13802662fc4b1d1d6 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

31f37bf214eb drm/i915/guc: Add delay before disabling scheduling on contexts
fe8d1a1c8b89 drm/i915/execlists: Weak parallel submission support for execlists
dd80d82c5c4f drm/i915: Enable multi-bb execbuf
be69dbafd8a5 drm/i915/guc: Handle errors in multi-lrc requests
692205dd3cff drm/i915: Hold all parallel requests until last request, properly handle error
3e6d8fb0d85c drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
77d80c169b21 drm/i915: Multi-batch execbuffer2
e159516409c6 drm/i915: Force parallel contexts to use copy engine for reloc
b48c4d1c82fb drm/i915: Only track object dependencies on first request
53e1b9709de9 drm/i915: Teach execbuf there can be more than one batch in the objects list
e092e2a0aa4a drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index
866efe3b7b62 drm/i915: Store batch index in struct i915_execbuffer
ca7de92f4b87 drm/i915: Return output fence from i915_gem_do_execbuffer
866830d7b8df drm/i915: Move output fence handling to i915_gem_execbuffer2
722cabe63be0 drm/i915: Move input/exec fence handling to i915_gem_execbuffer2
94ba50b6aaae drm/i915: Move secure execbuf check to execbuf2
87f926fb4e0e drm/i915/guc: Implement no mid batch preemption for multi-lrc
f04f2fb6c221 drm/i915/guc: Extend GuC flow control selftest for multi-lrc
1e623d060bfa drm/i915/guc: Add basic GuC multi-lrc selftest
f5b94aff1055 drm/i915/doc: Update parallel submit doc to point to i915_drm.h
d70a91bd7039 drm/i915: Connect UAPI to GuC multi-lrc interface
e278330f18d8 drm/i915/guc: Update debugfs for GuC multi-lrc
e99b23e86490 drm/i915/guc: Implement multi-lrc reset
b48693c26c56 drm/i915/guc: Insert submit fences between requests in parent-child relationship
dae43b9ae4ca drm/i915/guc: Implement multi-lrc submission
dac8e22d9b0f drm/i915/guc: Add guc_child_context_destroy
1e00caa2078b drm/i915/guc: Add hang check to GuC submit engine
affe37a98124 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
9b9cf89ec1fa drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
de52a524ffff drm/i915/guc: Add multi-lrc context registration
ffe691bcda89 drm/i915/guc: Implement GuC parent-child context pin / unpin functions
fa22d6168e5d drm/i915/guc: Introduce context parent-child relationship
4fc6c6ac891e drm/i915: Expose logical engine instance to user
a00a22e7333d drm/i915: Add logical engine mapping
d89645be64e8 drm/i915/guc: Selftest for GuC flow control
da629af1ed2d drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
68d99ae2054b drm/i915/guc: Take engine PM when a context is pinned with GuC submission
8c650cfb86b5 drm/i915: Add GT PM unpark worker
ac6f010842ee drm/i915/guc: Take GT PM ref when deregistering context
f482b6e988b8 drm/i915/guc: Non-static lrc descriptor registration buffer
889784d21320 drm/i915/guc: Check return of __xa_store when registering a context
112dd7286938 drm/i915/guc: Introduce guc_submit_engine object
9a569e392672 drm/i915/guc: Don't allow requests not ready to consume all guc_ids
314754bac86b drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted
27bc3be5a17a drm/i915/guc: Connect the number of guc_ids to debugfs
5e27ca906d33 drm/i915/guc: Allow flexible number of context ids

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/index.html

[-- Attachment #2: Type: text/html, Size: 5637 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Parallel submission aka multi-bb execbuf (rev2)
  2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
                   ` (49 preceding siblings ...)
  2021-08-03 23:19 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2021-08-05  3:53 ` Patchwork
  50 siblings, 0 replies; 111+ messages in thread
From: Patchwork @ 2021-08-05  3:53 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30270 bytes --]

== Series Details ==

Series: Parallel submission aka multi-bb execbuf (rev2)
URL   : https://patchwork.freedesktop.org/series/92789/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10442_full -> Patchwork_20767_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

New tests
---------

  New tests have been introduced between CI_DRM_10442_full and Patchwork_20767_full:

### New IGT tests (2) ###

  * igt@i915_selftest@live@guc_flow_control:
    - Statuses : 5 pass(s)
    - Exec time: [1.05, 4.88] s

  * igt@i915_selftest@live@guc_multi_lrc:
    - Statuses : 5 pass(s)
    - Exec time: [1.04, 4.83] s

  

Known issues
------------

  Here are the changes found in Patchwork_20767_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ctx_persistence@legacy-engines-hang@render:
    - shard-iclb:         [PASS][1] -> [FAIL][2] ([i915#2410])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb2/igt@gem_ctx_persistence@legacy-engines-hang@render.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb5/igt@gem_ctx_persistence@legacy-engines-hang@render.html
    - shard-tglb:         [PASS][3] -> [FAIL][4] ([i915#2410])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-tglb2/igt@gem_ctx_persistence@legacy-engines-hang@render.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb6/igt@gem_ctx_persistence@legacy-engines-hang@render.html
    - shard-apl:          NOTRUN -> [FAIL][5] ([i915#2410])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@gem_ctx_persistence@legacy-engines-hang@render.html

  * igt@gem_ctx_persistence@legacy-engines-mixed:
    - shard-snb:          NOTRUN -> [SKIP][6] ([fdo#109271] / [i915#1099]) +2 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-snb6/igt@gem_ctx_persistence@legacy-engines-mixed.html

  * igt@gem_exec_endless@dispatch@bcs0:
    - shard-skl:          NOTRUN -> [SKIP][7] ([fdo#109271]) +6 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl10/igt@gem_exec_endless@dispatch@bcs0.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-apl:          NOTRUN -> [FAIL][8] ([i915#2846])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl8/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - shard-kbl:          NOTRUN -> [FAIL][9] ([i915#2842])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-none-vip@rcs0:
    - shard-glk:          [PASS][10] -> [FAIL][11] ([i915#2842])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-glk5/igt@gem_exec_fair@basic-none-vip@rcs0.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-glk6/igt@gem_exec_fair@basic-none-vip@rcs0.html

  * igt@gem_exec_fair@basic-none@rcs0:
    - shard-tglb:         [PASS][12] -> [FAIL][13] ([i915#2842])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-tglb1/igt@gem_exec_fair@basic-none@rcs0.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb5/igt@gem_exec_fair@basic-none@rcs0.html

  * igt@gem_exec_fair@basic-none@vecs0:
    - shard-kbl:          [PASS][14] -> [FAIL][15] ([i915#2842]) +2 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl7/igt@gem_exec_fair@basic-none@vecs0.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl2/igt@gem_exec_fair@basic-none@vecs0.html

  * igt@gem_exec_fair@basic-pace@vcs1:
    - shard-kbl:          [PASS][16] -> [SKIP][17] ([fdo#109271])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl4/igt@gem_exec_fair@basic-pace@vcs1.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl6/igt@gem_exec_fair@basic-pace@vcs1.html

  * igt@gem_huc_copy@huc-copy:
    - shard-apl:          NOTRUN -> [SKIP][18] ([fdo#109271] / [i915#2190])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl1/igt@gem_huc_copy@huc-copy.html

  * igt@gem_mmap_gtt@cpuset-medium-copy:
    - shard-iclb:         [PASS][19] -> [FAIL][20] ([i915#307])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb1/igt@gem_mmap_gtt@cpuset-medium-copy.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb3/igt@gem_mmap_gtt@cpuset-medium-copy.html

  * igt@gem_pwrite@basic-exhaustion:
    - shard-kbl:          NOTRUN -> [WARN][21] ([i915#2658])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@gem_pwrite@basic-exhaustion.html
    - shard-apl:          NOTRUN -> [WARN][22] ([i915#2658])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl7/igt@gem_pwrite@basic-exhaustion.html

  * igt@gem_render_copy@x-tiled-to-vebox-yf-tiled:
    - shard-kbl:          NOTRUN -> [SKIP][23] ([fdo#109271]) +124 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@gem_render_copy@x-tiled-to-vebox-yf-tiled.html

  * igt@gem_userptr_blits@dmabuf-sync:
    - shard-apl:          NOTRUN -> [SKIP][24] ([fdo#109271] / [i915#3323])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl8/igt@gem_userptr_blits@dmabuf-sync.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-snb:          NOTRUN -> [FAIL][25] ([i915#2724])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-snb2/igt@gem_userptr_blits@vma-merge.html

  * igt@gem_workarounds@suspend-resume-context:
    - shard-apl:          [PASS][26] -> [DMESG-WARN][27] ([i915#180])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl7/igt@gem_workarounds@suspend-resume-context.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@gem_workarounds@suspend-resume-context.html

  * igt@i915_pm_dc@dc5-dpms:
    - shard-kbl:          NOTRUN -> [FAIL][28] ([i915#545])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@i915_pm_dc@dc5-dpms.html

  * igt@i915_pm_rpm@modeset-lpsp-stress:
    - shard-apl:          NOTRUN -> [SKIP][29] ([fdo#109271]) +253 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@i915_pm_rpm@modeset-lpsp-stress.html

  * igt@i915_pm_sseu@full-enable:
    - shard-skl:          [PASS][30] -> [FAIL][31] ([i915#3650])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl3/igt@i915_pm_sseu@full-enable.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl6/igt@i915_pm_sseu@full-enable.html

  * igt@i915_suspend@sysfs-reader:
    - shard-apl:          NOTRUN -> [DMESG-WARN][32] ([i915#180])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl3/igt@i915_suspend@sysfs-reader.html

  * igt@kms_async_flips@alternate-sync-async-flip:
    - shard-skl:          [PASS][33] -> [FAIL][34] ([i915#2521])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl5/igt@kms_async_flips@alternate-sync-async-flip.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl7/igt@kms_async_flips@alternate-sync-async-flip.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][35] ([fdo#109271] / [i915#3777]) +1 similar issue
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip:
    - shard-apl:          NOTRUN -> [SKIP][36] ([fdo#109271] / [i915#3777]) +3 similar issues
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip.html

  * igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-apl:          NOTRUN -> [SKIP][37] ([fdo#109271] / [i915#3886]) +9 similar issues
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_mc_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][38] ([i915#3689] / [i915#3886])
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-b-missing-ccs-buffer-y_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][39] ([i915#3689])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_ccs@pipe-b-missing-ccs-buffer-y_tiled_ccs.html

  * igt@kms_ccs@pipe-c-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-kbl:          NOTRUN -> [SKIP][40] ([fdo#109271] / [i915#3886]) +3 similar issues
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_ccs@pipe-c-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-d-bad-pixel-format-y_tiled_ccs:
    - shard-snb:          NOTRUN -> [SKIP][41] ([fdo#109271]) +223 similar issues
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-snb6/igt@kms_ccs@pipe-d-bad-pixel-format-y_tiled_ccs.html

  * igt@kms_chamelium@hdmi-edid-change-during-suspend:
    - shard-apl:          NOTRUN -> [SKIP][42] ([fdo#109271] / [fdo#111827]) +21 similar issues
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl3/igt@kms_chamelium@hdmi-edid-change-during-suspend.html

  * igt@kms_color_chamelium@pipe-a-ctm-0-25:
    - shard-snb:          NOTRUN -> [SKIP][43] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-snb6/igt@kms_color_chamelium@pipe-a-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-a-ctm-0-75:
    - shard-kbl:          NOTRUN -> [SKIP][44] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl2/igt@kms_color_chamelium@pipe-a-ctm-0-75.html

  * igt@kms_color_chamelium@pipe-c-gamma:
    - shard-tglb:         NOTRUN -> [SKIP][45] ([fdo#109284] / [fdo#111827])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_color_chamelium@pipe-c-gamma.html

  * igt@kms_content_protection@legacy:
    - shard-kbl:          NOTRUN -> [TIMEOUT][46] ([i915#1319])
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl2/igt@kms_content_protection@legacy.html
    - shard-tglb:         NOTRUN -> [SKIP][47] ([fdo#111828])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_content_protection@legacy.html

  * igt@kms_content_protection@srm:
    - shard-apl:          NOTRUN -> [TIMEOUT][48] ([i915#1319])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl7/igt@kms_content_protection@srm.html

  * igt@kms_content_protection@uevent:
    - shard-kbl:          NOTRUN -> [FAIL][49] ([i915#2105])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_content_protection@uevent.html

  * igt@kms_cursor_crc@pipe-c-cursor-suspend:
    - shard-skl:          [PASS][50] -> [INCOMPLETE][51] ([i915#300])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl2/igt@kms_cursor_crc@pipe-c-cursor-suspend.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl4/igt@kms_cursor_crc@pipe-c-cursor-suspend.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-skl:          [PASS][52] -> [FAIL][53] ([i915#2346])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-skl:          [PASS][54] -> [FAIL][55] ([i915#2346] / [i915#533])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl9/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl1/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled:
    - shard-skl:          [PASS][56] -> [DMESG-WARN][57] ([i915#1982]) +2 similar issues
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl9/igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled.html
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl6/igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-hdmi-a2:
    - shard-glk:          [PASS][58] -> [FAIL][59] ([i915#79])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-glk7/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-hdmi-a2.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-glk5/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-hdmi-a2.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-skl:          NOTRUN -> [FAIL][60] ([i915#79])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl6/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@b-dp1:
    - shard-kbl:          [PASS][61] -> [DMESG-WARN][62] ([i915#180])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl6/igt@kms_flip@flip-vs-suspend-interruptible@b-dp1.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl7/igt@kms_flip@flip-vs-suspend-interruptible@b-dp1.html

  * igt@kms_flip@plain-flip-fb-recreate@b-edp1:
    - shard-skl:          [PASS][63] -> [FAIL][64] ([i915#2122])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl6/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl4/igt@kms_flip@plain-flip-fb-recreate@b-edp1.html

  * igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a1:
    - shard-glk:          [PASS][65] -> [FAIL][66] ([i915#2122])
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-glk7/igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a1.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-glk5/igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs:
    - shard-apl:          NOTRUN -> [SKIP][67] ([fdo#109271] / [i915#2672]) +1 similar issue
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytileccs-to-64bpp-ytile:
    - shard-tglb:         NOTRUN -> [SKIP][68] ([i915#2587])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_flip_scaled_crc@flip-32bpp-ytileccs-to-64bpp-ytile.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilercccs:
    - shard-kbl:          NOTRUN -> [SKIP][69] ([fdo#109271] / [i915#2672])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-32bpp-ytilercccs.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-shrfb-pgflip-blt:
    - shard-tglb:         NOTRUN -> [SKIP][70] ([fdo#111825]) +3 similar issues
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-shrfb-pgflip-blt.html

  * igt@kms_pipe_crc_basic@disable-crc-after-crtc-pipe-d:
    - shard-apl:          NOTRUN -> [SKIP][71] ([fdo#109271] / [i915#533]) +1 similar issue
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@kms_pipe_crc_basic@disable-crc-after-crtc-pipe-d.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
    - shard-apl:          NOTRUN -> [FAIL][72] ([fdo#108145] / [i915#265]) +2 similar issues
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl3/igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb.html

  * igt@kms_plane_alpha_blend@pipe-a-coverage-7efc:
    - shard-skl:          [PASS][73] -> [FAIL][74] ([fdo#108145] / [i915#265])
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl8/igt@kms_plane_alpha_blend@pipe-a-coverage-7efc.html
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl5/igt@kms_plane_alpha_blend@pipe-a-coverage-7efc.html

  * igt@kms_plane_alpha_blend@pipe-b-alpha-basic:
    - shard-kbl:          NOTRUN -> [FAIL][75] ([fdo#108145] / [i915#265])
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_plane_alpha_blend@pipe-b-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-b-alpha-transparent-fb:
    - shard-kbl:          NOTRUN -> [FAIL][76] ([i915#265]) +1 similar issue
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_plane_alpha_blend@pipe-b-alpha-transparent-fb.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4:
    - shard-apl:          NOTRUN -> [SKIP][77] ([fdo#109271] / [i915#658]) +4 similar issues
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl8/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1:
    - shard-kbl:          NOTRUN -> [SKIP][78] ([fdo#109271] / [i915#658]) +1 similar issue
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html

  * igt@kms_psr@psr2_primary_page_flip:
    - shard-iclb:         [PASS][79] -> [SKIP][80] ([fdo#109441]) +1 similar issue
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb2/igt@kms_psr@psr2_primary_page_flip.html
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb1/igt@kms_psr@psr2_primary_page_flip.html

  * igt@kms_writeback@writeback-fb-id:
    - shard-apl:          NOTRUN -> [SKIP][81] ([fdo#109271] / [i915#2437])
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@kms_writeback@writeback-fb-id.html

  * igt@nouveau_crc@pipe-a-ctx-flip-skip-current-frame:
    - shard-tglb:         NOTRUN -> [SKIP][82] ([i915#2530])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb7/igt@nouveau_crc@pipe-a-ctx-flip-skip-current-frame.html

  * igt@sysfs_clients@sema-10:
    - shard-apl:          NOTRUN -> [SKIP][83] ([fdo#109271] / [i915#2994]) +1 similar issue
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl1/igt@sysfs_clients@sema-10.html

  * igt@sysfs_clients@split-10:
    - shard-skl:          NOTRUN -> [SKIP][84] ([fdo#109271] / [i915#2994])
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl10/igt@sysfs_clients@split-10.html

  
#### Possible fixes ####

  * igt@feature_discovery@psr2:
    - shard-iclb:         [SKIP][85] ([i915#658]) -> [PASS][86]
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb6/igt@feature_discovery@psr2.html
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb2/igt@feature_discovery@psr2.html

  * igt@gem_ctx_persistence@legacy-engines-hang@render:
    - shard-kbl:          [FAIL][87] ([i915#2410]) -> [PASS][88]
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl4/igt@gem_ctx_persistence@legacy-engines-hang@render.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl3/igt@gem_ctx_persistence@legacy-engines-hang@render.html

  * igt@gem_exec_fair@basic-none-share@rcs0:
    - shard-iclb:         [FAIL][89] ([i915#2842]) -> [PASS][90] +1 similar issue
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb5/igt@gem_exec_fair@basic-none-share@rcs0.html
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb4/igt@gem_exec_fair@basic-none-share@rcs0.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-glk:          [FAIL][91] ([i915#2842]) -> [PASS][92]
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-glk7/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-glk5/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-tglb:         [FAIL][93] ([i915#2842]) -> [PASS][94]
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-tglb3/igt@gem_exec_fair@basic-pace@rcs0.html
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb8/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_huc_copy@huc-copy:
    - shard-tglb:         [SKIP][95] ([i915#2190]) -> [PASS][96]
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-tglb6/igt@gem_huc_copy@huc-copy.html
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-tglb5/igt@gem_huc_copy@huc-copy.html

  * igt@gem_mmap_gtt@cpuset-medium-copy-xy:
    - shard-iclb:         [FAIL][97] ([i915#307]) -> [PASS][98] +1 similar issue
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb8/igt@gem_mmap_gtt@cpuset-medium-copy-xy.html
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb2/igt@gem_mmap_gtt@cpuset-medium-copy-xy.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-iclb:         [FAIL][99] ([i915#454]) -> [PASS][100]
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb6/igt@i915_pm_dc@dc6-psr.html
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb6/igt@i915_pm_dc@dc6-psr.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-kbl:          [DMESG-WARN][101] ([i915#180]) -> [PASS][102] +4 similar issues
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl7/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl2/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-kbl:          [INCOMPLETE][103] ([i915#155] / [i915#180] / [i915#636]) -> [PASS][104]
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl4/igt@kms_fbcon_fbt@fbc-suspend.html
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl4/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-apl:          [DMESG-WARN][105] ([i915#180]) -> [PASS][106] +2 similar issues
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl8/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl7/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1:
    - shard-skl:          [FAIL][107] ([i915#2122]) -> [PASS][108] +1 similar issue
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl4/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl2/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html

  * igt@kms_frontbuffer_tracking@psr-1p-rte:
    - shard-skl:          [DMESG-WARN][109] ([i915#1982]) -> [PASS][110]
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl8/igt@kms_frontbuffer_tracking@psr-1p-rte.html
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl5/igt@kms_frontbuffer_tracking@psr-1p-rte.html

  * igt@kms_hdr@bpc-switch-suspend:
    - shard-skl:          [FAIL][111] ([i915#1188]) -> [PASS][112] +1 similar issue
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl9/igt@kms_hdr@bpc-switch-suspend.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl10/igt@kms_hdr@bpc-switch-suspend.html

  * igt@kms_plane_alpha_blend@pipe-b-coverage-7efc:
    - shard-skl:          [FAIL][113] ([fdo#108145] / [i915#265]) -> [PASS][114]
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-skl8/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-skl9/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html

  * igt@kms_psr2_su@frontbuffer:
    - shard-iclb:         [SKIP][115] ([fdo#109642] / [fdo#111068] / [i915#658]) -> [PASS][116]
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb3/igt@kms_psr2_su@frontbuffer.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb2/igt@kms_psr2_su@frontbuffer.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [SKIP][117] ([fdo#109441]) -> [PASS][118]
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb8/igt@kms_psr@psr2_cursor_render.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb2/igt@kms_psr@psr2_cursor_render.html

  
#### Warnings ####

  * igt@i915_pm_rc6_residency@rc6-idle:
    - shard-iclb:         [WARN][119] ([i915#2684]) -> [WARN][120] ([i915#1804] / [i915#2684])
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb1/igt@i915_pm_rc6_residency@rc6-idle.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb3/igt@i915_pm_rc6_residency@rc6-idle.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4:
    - shard-iclb:         [SKIP][121] ([i915#2920]) -> [SKIP][122] ([i915#658]) +2 similar issues
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb2/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb1/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1:
    - shard-iclb:         [SKIP][123] ([i915#658]) -> [SKIP][124] ([i915#2920]) +2 similar issues
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-iclb6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][125], [FAIL][126], [FAIL][127], [FAIL][128], [FAIL][129], [FAIL][130]) ([i915#180] / [i915#1814] / [i915#2505] / [i915#3002] / [i915#3363] / [i915#92]) -> ([FAIL][131], [FAIL][132], [FAIL][133]) ([i915#180] / [i915#3002] / [i915#3363])
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl7/igt@runner@aborted.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl7/igt@runner@aborted.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl6/igt@runner@aborted.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl4/igt@runner@aborted.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl7/igt@runner@aborted.html
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-kbl4/igt@runner@aborted.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl2/igt@runner@aborted.html
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl7/igt@runner@aborted.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-kbl7/igt@runner@aborted.html
    - shard-apl:          ([FAIL][134], [FAIL][135], [FAIL][136], [FAIL][137], [FAIL][138]) ([i915#1610] / [i915#180] / [i915#1814] / [i915#3002] / [i915#3363]) -> ([FAIL][139], [FAIL][140], [FAIL][141], [FAIL][142], [FAIL][143]) ([fdo#109271] / [i915#180] / [i915#3002] / [i915#3363])
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl8/igt@runner@aborted.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl6/igt@runner@aborted.html
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl1/igt@runner@aborted.html
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl2/igt@runner@aborted.html
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10442/shard-apl8/igt@runner@aborted.html
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl6/igt@runner@aborted.html
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl3/igt@runner@aborted.html
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl3/igt@runner@aborted.html
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl8/igt@runner@aborted.html
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/shard-apl2/igt@runner@aborted.html

  
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [fdo#111828]: https://bugs.freedesktop.org/show_bug.cgi?id=111828
  [i915#1099]: https://gitlab.freedesktop.org/drm/intel/issues/1099
  [i915#1188]: https://gitlab.freedesktop.org/drm/intel/issues/1188
  [i915#1319]: https://gitlab.freedesktop.org/drm/intel/issues/1319
  [i915#155]: https://gitlab.freedesktop.org/drm/intel/issues/155
  [i915#1610]: https://gitlab.freedesktop.org/drm/intel/issues/1610
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1804]: https://gitlab.freedesktop.org/drm/intel/issues/1804
  [i915#1814]: https://gitlab.freedesktop.org/drm/

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20767/index.html

[-- Attachment #2: Type: text/html, Size: 36338 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted Matthew Brost
@ 2021-08-05  8:27   ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-05  8:27 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:00PM -0700, Matthew Brost wrote:
> Rather than returning -EAGAIN to the user when no guc_ids are available,
> implement a fair sharing algorithm in the kernel which blocks submissons
> until guc_ids become available. Submissions are released one at a time,
> based on priority, until the guc_id pressure is released to ensure fair
> sharing of the guc_ids. Once the pressure is fully released, the normal
> guc_id allocation (at request creation time in guc_request_alloc) can
> resume as this allocation path should be significantly faster and a fair
> sharing algorithm isn't needed when guc_ids are plentifully.
> 
> The fair sharing algorithm is implemented by forcing all submissions to
> the tasklet which serializes submissions, dequeuing one at a time.
> 
> If the submission doesn't have a guc_id and new guc_id can't be found,
> two lists are searched, one list with contexts that are not pinned but
> still registered with the guc (searched first) and another list with
> contexts that are pinned but do not have any submissions actively in
> inflight (scheduling enabled + registered, searched second). If no
> guc_ids can be found we kick a workqueue which will retire requests
> hopefully freeing a guc_id. The workqueue + tasklet ping / pong back and
> forth until a guc_id can be found.
> 
> Once a guc_id is found, we may have to disable context scheduling
> depending on which list the context is stolen from. When we disable
> scheduling, we block the tasklet from executing until the completion G2H
> returns. The disable scheduling must be issued from the workqueue
> because of the locking structure. When we deregister a context, we also
> do the same thing (waiting on the G2H) but we can safely issue the
> deregister H2G from the tasklet.
> 
> Once all the G2H have returned we can trigger a submission on the
> context.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 805 ++++++++++++++++--
>  drivers/gpu/drm/i915/i915_request.h           |   6 +
>  4 files changed, 754 insertions(+), 86 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index e54351a170e2..8ed964ef967b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -185,6 +185,9 @@ struct intel_context {
>  	/* GuC LRC descriptor reference count */
>  	atomic_t guc_id_ref;
>  
> +	/* Number of rq submitted without a guc_id */
> +	u16 guc_num_rq_submit_no_id;
> +
>  	/*
>  	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
>  	 */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 1d7cb118e70f..e76579396efd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -33,7 +33,28 @@ struct intel_guc {
>  
>  	/* Global engine used to submit requests to GuC */
>  	struct i915_sched_engine *sched_engine;
> -	struct i915_request *stalled_request;
> +
> +	/* Global state related to submission tasklet */
> +	struct i915_request *stalled_rq;
> +	struct intel_context *stalled_context;
> +	struct work_struct retire_worker;
> +	unsigned long flags;
> +	int total_num_rq_with_no_guc_id;
> +
> +	/*
> +	 * Submisson stall reason. See intel_guc_submission.c for detailed
> +	 * description.
> +	 */

I think documenting this kind of stuff inline as kerneldoc is neater, and
closer to where it's generally needed. Source navigation tools point you
to here, not the comment that's burried somewhere else.

> +	enum {
> +		STALL_NONE,
> +		STALL_GUC_ID_WORKQUEUE,
> +		STALL_GUC_ID_TASKLET,
> +		STALL_SCHED_DISABLE,
> +		STALL_REGISTER_CONTEXT,
> +		STALL_DEREGISTER_CONTEXT,
> +		STALL_MOVE_LRC_TAIL,
> +		STALL_ADD_REQUEST,
> +	} submission_stall_reason;
>  
>  	/* intel_guc_recv interrupt related state */
>  	spinlock_t irq_lock;
> @@ -55,7 +76,8 @@ struct intel_guc {
>  	struct ida guc_ids;
>  	u32 num_guc_ids;
>  	u32 max_guc_ids;
> -	struct list_head guc_id_list;
> +	struct list_head guc_id_list_no_ref;
> +	struct list_head guc_id_list_unpinned;
>  
>  	bool submission_supported;
>  	bool submission_selected;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3b555c05c01c..f42a707f60ca 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -59,6 +59,25 @@
>   * ELSP context descriptor dword into Work Item.
>   * See guc_add_request()
>   *
> + * GuC flow control state machine:
> + * The tasklet, workqueue (retire_worker), and the G2H handlers together more or
> + * less form a state machine which is used to submit requests + flow control
> + * requests, while waiting on resources / actions, if necessary. The enum,
> + * submission_stall_reason, controls the handoff of stalls between these
> + * entities with stalled_rq & stalled_context being the arguments. Each state
> + * described below.
> + *
> + * STALL_NONE			No stall condition
> + * STALL_GUC_ID_WORKQUEUE	Workqueue will try to free guc_ids
> + * STALL_GUC_ID_TASKLET		Tasklet will try to find guc_id
> + * STALL_SCHED_DISABLE		Workqueue will issue context schedule disable
> + *				H2G
> + * STALL_REGISTER_CONTEXT	Tasklet needs to register context
> + * STALL_DEREGISTER_CONTEXT	G2H handler is waiting for context deregister,
> + *				will register context upon receipt of G2H
> + * STALL_MOVE_LRC_TAIL		Tasklet will try to move LRC tail
> + * STALL_ADD_REQUEST		Tasklet will try to add the request (submit
> + *				context)
>   */
>  
>  /* GuC Virtual Engine */
> @@ -72,6 +91,83 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>  
>  #define GUC_REQUEST_SIZE 64 /* bytes */
>  
> +/*
> + * Global GuC flags helper functions
> + */
> +enum {
> +	GUC_STATE_TASKLET_BLOCKED,
> +	GUC_STATE_GUC_IDS_EXHAUSTED,
> +};
> +
> +static bool tasklet_blocked(struct intel_guc *guc)
> +{
> +	return test_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);

So I know the existing code absolutely loves it's atomic bitflags for a
state machine in every corner, but that doesn't make it a good idea.

Roughly the justification needs to be:
- first try to get rid of your state machine as much as possible. Any
  transition or state you can remove is good

- next use big dumb locks for your state machine and state. Stuff like
  wrap your entire tasklet into a mutex.

- if you do smaller locks you already have to think way too hard about
  ordering and what happens if everything gets reorderd/delayed in
  unexpected ways and how things slip through. That needs substantial
  comments already if it's not some obvious pattern. Obvious here means
  you're just using established existing primitives like queue_work() and
  flush_work to manage your state machine (not sure what tasklets all
  provide here, but probably worth a look to make sure we're not
  reinventing wheels)

- if smaller locks are not good enough from a perf pov, then a) you must
  supply perf data to justify your b) the commit message really needs to
  explain why all this complexity is needed and your review requirements
  just went through the roof.

- anything that doesn't use locks and doesn't use estabslished
  well-understood primitives needs barriers. Those barriers each need a
  comment which explains where the counterparty is. If this doesn't feel
  like you're writing a small academic paper, you're not doing it right.
  Note that with dgpu we can't shrug this all off with "IA is a TSO,
  reordering never happens". Also on a TSO the compiler is still allowed
  ro reorder everything


Given that "guc is stalling" isn't really a perf critical thing I think
this needs substantially dumbing down, or substantially more benchmark
data proving that we need it.

And no there's simply no budget anymore in i915-gem for frivolous use of a
bit of overengineered lockless stuff, we have way, way too much of that.
We need to reduce this, not add more.
-Daniel


> +}
> +
> +static void set_tasklet_blocked(struct intel_guc *guc)
> +{
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +	set_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
> +}
> +
> +static void __clr_tasklet_blocked(struct intel_guc *guc)
> +{
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +	clear_bit(GUC_STATE_TASKLET_BLOCKED, &guc->flags);
> +}
> +
> +static void clr_tasklet_blocked(struct intel_guc *guc)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->sched_engine->lock, flags);
> +	__clr_tasklet_blocked(guc);
> +	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
> +}
> +
> +static bool guc_ids_exhausted(struct intel_guc *guc)
> +{
> +	return test_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
> +}
> +
> +static bool test_and_update_guc_ids_exhausted(struct intel_guc *guc)
> +{
> +	unsigned long flags;
> +	bool ret = false;
> +
> +	/*
> +	 * Strict ordering on checking if guc_ids are exhausted isn't required,
> +	 * so let's avoid grabbing the submission lock if possible.
> +	 */
> +	if (guc_ids_exhausted(guc)) {
> +		spin_lock_irqsave(&guc->sched_engine->lock, flags);
> +		ret = guc_ids_exhausted(guc);
> +		if (ret)
> +			++guc->total_num_rq_with_no_guc_id;
> +		spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
> +	}
> +
> +	return ret;
> +}
> +
> +static void set_and_update_guc_ids_exhausted(struct intel_guc *guc)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->sched_engine->lock, flags);
> +	++guc->total_num_rq_with_no_guc_id;
> +	set_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
> +	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
> +}
> +
> +static void clr_guc_ids_exhausted(struct intel_guc *guc)
> +{
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id);
> +
> +	clear_bit(GUC_STATE_GUC_IDS_EXHAUSTED, &guc->flags);
> +}
> +
>  /*
>   * Below is a set of functions which control the GuC scheduling state which do
>   * not require a lock as all state transitions are mutually exclusive. i.e. It
> @@ -82,6 +178,9 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
>  #define SCHED_STATE_NO_LOCK_ENABLED			BIT(0)
>  #define SCHED_STATE_NO_LOCK_PENDING_ENABLE		BIT(1)
>  #define SCHED_STATE_NO_LOCK_REGISTERED			BIT(2)
> +#define SCHED_STATE_NO_LOCK_BLOCK_TASKLET		BIT(3)
> +#define SCHED_STATE_NO_LOCK_GUC_ID_STOLEN		BIT(4)
> +#define SCHED_STATE_NO_LOCK_NEEDS_REGISTER		BIT(5)
>  static inline bool context_enabled(struct intel_context *ce)
>  {
>  	return (atomic_read(&ce->guc_sched_state_no_lock) &
> @@ -135,6 +234,60 @@ static inline void clr_context_registered(struct intel_context *ce)
>  		   &ce->guc_sched_state_no_lock);
>  }
>  
> +static inline bool context_block_tasklet(struct intel_context *ce)
> +{
> +	return (atomic_read(&ce->guc_sched_state_no_lock) &
> +		SCHED_STATE_NO_LOCK_BLOCK_TASKLET);
> +}
> +
> +static inline void set_context_block_tasklet(struct intel_context *ce)
> +{
> +	atomic_or(SCHED_STATE_NO_LOCK_BLOCK_TASKLET,
> +		  &ce->guc_sched_state_no_lock);
> +}
> +
> +static inline void clr_context_block_tasklet(struct intel_context *ce)
> +{
> +	atomic_and((u32)~SCHED_STATE_NO_LOCK_BLOCK_TASKLET,
> +		   &ce->guc_sched_state_no_lock);
> +}
> +
> +static inline bool context_guc_id_stolen(struct intel_context *ce)
> +{
> +	return (atomic_read(&ce->guc_sched_state_no_lock) &
> +		SCHED_STATE_NO_LOCK_GUC_ID_STOLEN);
> +}
> +
> +static inline void set_context_guc_id_stolen(struct intel_context *ce)
> +{
> +	atomic_or(SCHED_STATE_NO_LOCK_GUC_ID_STOLEN,
> +		  &ce->guc_sched_state_no_lock);
> +}
> +
> +static inline void clr_context_guc_id_stolen(struct intel_context *ce)
> +{
> +	atomic_and((u32)~SCHED_STATE_NO_LOCK_GUC_ID_STOLEN,
> +		   &ce->guc_sched_state_no_lock);
> +}
> +
> +static inline bool context_needs_register(struct intel_context *ce)
> +{
> +	return (atomic_read(&ce->guc_sched_state_no_lock) &
> +		SCHED_STATE_NO_LOCK_NEEDS_REGISTER);
> +}
> +
> +static inline void set_context_needs_register(struct intel_context *ce)
> +{
> +	atomic_or(SCHED_STATE_NO_LOCK_NEEDS_REGISTER,
> +		  &ce->guc_sched_state_no_lock);
> +}
> +
> +static inline void clr_context_needs_register(struct intel_context *ce)
> +{
> +	atomic_and((u32)~SCHED_STATE_NO_LOCK_NEEDS_REGISTER,
> +		   &ce->guc_sched_state_no_lock);
> +}
> +
>  /*
>   * Below is a set of functions which control the GuC scheduling state which
>   * require a lock, aside from the special case where the functions are called
> @@ -418,9 +571,12 @@ int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>  					      true, timeout);
>  }
>  
> -static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> +static inline bool request_has_no_guc_id(struct i915_request *rq)
> +{
> +	return test_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
> +}
>  
> -static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +static int __guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  {
>  	int err = 0;
>  	struct intel_context *ce = rq->context;
> @@ -439,18 +595,15 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  		goto out;
>  	}
>  
> +	/* Ensure context is in correct state before a submission */
> +	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
> +	GEM_BUG_ON(request_has_no_guc_id(rq));
>  	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> +	GEM_BUG_ON(context_needs_register(ce));
>  	GEM_BUG_ON(context_guc_id_invalid(ce));
> -
> -	/*
> -	 * Corner case where the GuC firmware was blown away and reloaded while
> -	 * this context was pinned.
> -	 */
> -	if (unlikely(!lrc_desc_registered(guc, ce->guc_id))) {
> -		err = guc_lrc_desc_pin(ce, false);
> -		if (unlikely(err))
> -			goto out;
> -	}
> +	GEM_BUG_ON(context_pending_disable(ce));
> +	GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> +	GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id));
>  
>  	/*
>  	 * The request / context will be run on the hardware when scheduling
> @@ -462,6 +615,8 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  	enabled = context_enabled(ce);
>  
>  	if (!enabled) {
> +		GEM_BUG_ON(context_pending_enable(ce));
> +
>  		action[len++] = INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
>  		action[len++] = ce->guc_id;
>  		action[len++] = GUC_CONTEXT_ENABLE;
> @@ -489,6 +644,67 @@ static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
>  	return err;
>  }
>  
> +static int guc_add_request(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	int ret;
> +
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +
> +	ret = __guc_add_request(guc, rq);
> +	if (ret == -EBUSY) {
> +		guc->stalled_rq = rq;
> +		guc->submission_stall_reason = STALL_ADD_REQUEST;
> +	} else {
> +		guc->stalled_rq = NULL;
> +		guc->submission_stall_reason = STALL_NONE;
> +	}
> +
> +	return ret;
> +}
> +
> +static int guc_lrc_desc_pin(struct intel_context *ce, bool loop);
> +
> +static int tasklet_register_context(struct intel_guc *guc,
> +				    struct i915_request *rq)
> +{
> +	struct intel_context *ce = rq->context;
> +	int ret = 0;
> +
> +	/* Check state */
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
> +	GEM_BUG_ON(request_has_no_guc_id(rq));
> +	GEM_BUG_ON(context_guc_id_invalid(ce));
> +	GEM_BUG_ON(!atomic_read(&ce->guc_id_ref));
> +
> +	/*
> +	 * The guc_id is getting pinned during the tasklet and we need to
> +	 * register this context or a corner case where the GuC firmware was
> +	 * blown away and reloaded while this context was pinned
> +	 */
> +	if (unlikely((!lrc_desc_registered(guc, ce->guc_id) ||
> +		      context_needs_register(ce)) &&
> +		     !intel_context_is_banned(ce))) {
> +		GEM_BUG_ON(context_pending_disable(ce));
> +		GEM_BUG_ON(context_wait_for_deregister_to_register(ce));
> +
> +		ret = guc_lrc_desc_pin(ce, false);
> +
> +		if (likely(ret != -EBUSY))
> +			clr_context_needs_register(ce);
> +
> +		if (unlikely(ret == -EBUSY)) {
> +			guc->stalled_rq = rq;
> +			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
> +		} else if (unlikely(ret == -EINPROGRESS)) {
> +			guc->stalled_rq = rq;
> +			guc->submission_stall_reason = STALL_DEREGISTER_CONTEXT;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  static inline void guc_set_lrc_tail(struct i915_request *rq)
>  {
>  	rq->context->lrc_reg_state[CTX_RING_TAIL] =
> @@ -500,77 +716,142 @@ static inline int rq_prio(const struct i915_request *rq)
>  	return rq->sched.attr.priority;
>  }
>  
> +static void kick_retire_wq(struct intel_guc *guc)
> +{
> +	queue_work(system_unbound_wq, &guc->retire_worker);
> +}
> +
> +static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq);
> +
>  static int guc_dequeue_one_context(struct intel_guc *guc)
>  {
>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> -	struct i915_request *last = NULL;
> -	bool submit = false;
> +	struct i915_request *last = guc->stalled_rq;
> +	bool submit = !!last;
>  	struct rb_node *rb;
>  	int ret;
>  
>  	lockdep_assert_held(&sched_engine->lock);
> +	GEM_BUG_ON(guc->stalled_context);
> +	GEM_BUG_ON(!submit && guc->submission_stall_reason);
>  
> -	if (guc->stalled_request) {
> -		submit = true;
> -		last = guc->stalled_request;
> -		goto resubmit;
> -	}
> +	if (submit) {
> +		/* Flow control conditions */
> +		switch (guc->submission_stall_reason) {
> +		case STALL_GUC_ID_TASKLET:
> +			goto done;
> +		case STALL_REGISTER_CONTEXT:
> +			goto register_context;
> +		case STALL_MOVE_LRC_TAIL:
> +			goto move_lrc_tail;
> +		case STALL_ADD_REQUEST:
> +			goto add_request;
> +		default:
> +			GEM_BUG_ON("Invalid stall state");
> +		}
> +	} else {
> +		GEM_BUG_ON(!guc->total_num_rq_with_no_guc_id &&
> +			   guc_ids_exhausted(guc));
>  
> -	while ((rb = rb_first_cached(&sched_engine->queue))) {
> -		struct i915_priolist *p = to_priolist(rb);
> -		struct i915_request *rq, *rn;
> +		while ((rb = rb_first_cached(&sched_engine->queue))) {
> +			struct i915_priolist *p = to_priolist(rb);
> +			struct i915_request *rq, *rn;
>  
> -		priolist_for_each_request_consume(rq, rn, p) {
> -			if (last && rq->context != last->context)
> -				goto done;
> +			priolist_for_each_request_consume(rq, rn, p) {
> +				if (last && rq->context != last->context)
> +					goto done;
>  
> -			list_del_init(&rq->sched.link);
> +				list_del_init(&rq->sched.link);
>  
> -			__i915_request_submit(rq);
> +				__i915_request_submit(rq);
>  
> -			trace_i915_request_in(rq, 0);
> -			last = rq;
> -			submit = true;
> -		}
> +				trace_i915_request_in(rq, 0);
> +				last = rq;
> +				submit = true;
> +			}
>  
> -		rb_erase_cached(&p->node, &sched_engine->queue);
> -		i915_priolist_free(p);
> +			rb_erase_cached(&p->node, &sched_engine->queue);
> +			i915_priolist_free(p);
> +		}
>  	}
> +
>  done:
>  	if (submit) {
> +		struct intel_context *ce = last->context;
> +
> +		if (ce->guc_num_rq_submit_no_id) {
> +			ret = tasklet_pin_guc_id(guc, last);
> +			if (ret)
> +				goto blk_tasklet_kick;
> +		}
> +
> +register_context:
> +		ret = tasklet_register_context(guc, last);
> +		if (unlikely(ret == -EINPROGRESS)) {
> +			goto blk_tasklet;
> +		} else if (unlikely(ret == -EPIPE)) {
> +			goto deadlk;
> +		} else if (ret == -EBUSY) {
> +			goto schedule_tasklet;
> +		} else if (unlikely(ret != 0)) {
> +			GEM_WARN_ON(ret);	/* Unexpected */
> +			goto deadlk;
> +		}
> +
> +move_lrc_tail:
>  		guc_set_lrc_tail(last);
> -resubmit:
> +
> +add_request:
>  		ret = guc_add_request(guc, last);
> -		if (unlikely(ret == -EPIPE))
> +		if (unlikely(ret == -EPIPE)) {
> +			goto deadlk;
> +		} else if (ret == -EBUSY) {
> +			goto schedule_tasklet;
> +		} else if (unlikely(ret != 0)) {
> +			GEM_WARN_ON(ret);	/* Unexpected */
>  			goto deadlk;
> -		else if (ret == -EBUSY) {
> -			tasklet_schedule(&sched_engine->tasklet);
> -			guc->stalled_request = last;
> -			return false;
>  		}
>  	}
>  
> -	guc->stalled_request = NULL;
> +	/*
> +	 * No requests without a guc_id, enable guc_id allocation at request
> +	 * creation time (guc_request_alloc).
> +	 */
> +	if (!guc->total_num_rq_with_no_guc_id)
> +		clr_guc_ids_exhausted(guc);
> +
>  	return submit;
>  
> +schedule_tasklet:
> +	tasklet_schedule(&sched_engine->tasklet);
> +	return false;
> +
>  deadlk:
>  	sched_engine->tasklet.callback = NULL;
>  	tasklet_disable_nosync(&sched_engine->tasklet);
>  	return false;
> +
> +blk_tasklet_kick:
> +	kick_retire_wq(guc);
> +blk_tasklet:
> +	set_tasklet_blocked(guc);
> +	return false;
>  }
>  
>  static void guc_submission_tasklet(struct tasklet_struct *t)
>  {
>  	struct i915_sched_engine *sched_engine =
>  		from_tasklet(sched_engine, t, tasklet);
> +	struct intel_guc *guc = sched_engine->private_data;
>  	unsigned long flags;
>  	bool loop;
>  
>  	spin_lock_irqsave(&sched_engine->lock, flags);
>  
> -	do {
> -		loop = guc_dequeue_one_context(sched_engine->private_data);
> -	} while (loop);
> +	if (likely(!tasklet_blocked(guc)))
> +		do {
> +			loop = guc_dequeue_one_context(guc);
> +		} while (loop);
>  
>  	i915_sched_engine_reset_on_empty(sched_engine);
>  
> @@ -653,6 +934,14 @@ submission_disabled(struct intel_guc *guc)
>  			!__tasklet_is_enabled(&sched_engine->tasklet));
>  }
>  
> +static void kick_tasklet(struct intel_guc *guc)
> +{
> +	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +
> +	if (likely(!tasklet_blocked(guc)))
> +		tasklet_hi_schedule(&sched_engine->tasklet);
> +}
> +
>  static void disable_submission(struct intel_guc *guc)
>  {
>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> @@ -676,8 +965,16 @@ static void enable_submission(struct intel_guc *guc)
>  	    __tasklet_enable(&sched_engine->tasklet)) {
>  		GEM_BUG_ON(!guc->ct.enabled);
>  
> +		/* Reset tasklet state */
> +		guc->stalled_rq = NULL;
> +		if (guc->stalled_context)
> +			intel_context_put(guc->stalled_context);
> +		guc->stalled_context = NULL;
> +		guc->submission_stall_reason = STALL_NONE;
> +		guc->flags = 0;
> +
>  		/* And kick in case we missed a new request submission. */
> -		tasklet_hi_schedule(&sched_engine->tasklet);
> +		kick_tasklet(guc);
>  	}
>  	spin_unlock_irqrestore(&guc->sched_engine->lock, flags);
>  }
> @@ -856,6 +1153,7 @@ static void __guc_reset_context(struct intel_context *ce, bool stalled)
>  out_replay:
>  	guc_reset_state(ce, head, stalled);
>  	__unwind_incomplete_requests(ce);
> +	ce->guc_num_rq_submit_no_id = 0;
>  	intel_context_put(ce);
>  }
>  
> @@ -888,6 +1186,7 @@ static void guc_cancel_context_requests(struct intel_context *ce)
>  	spin_lock(&ce->guc_active.lock);
>  	list_for_each_entry(rq, &ce->guc_active.requests, sched.link)
>  		i915_request_put(i915_request_mark_eio(rq));
> +	ce->guc_num_rq_submit_no_id = 0;
>  	spin_unlock(&ce->guc_active.lock);
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
>  }
> @@ -924,11 +1223,15 @@ guc_cancel_sched_engine_requests(struct i915_sched_engine *sched_engine)
>  		struct i915_priolist *p = to_priolist(rb);
>  
>  		priolist_for_each_request_consume(rq, rn, p) {
> +			struct intel_context *ce = rq->context;
> +
>  			list_del_init(&rq->sched.link);
>  
>  			__i915_request_submit(rq);
>  
>  			i915_request_put(i915_request_mark_eio(rq));
> +
> +			ce->guc_num_rq_submit_no_id = 0;
>  		}
>  
>  		rb_erase_cached(&p->node, &sched_engine->queue);
> @@ -980,6 +1283,51 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>  }
>  
> +static void retire_worker_sched_disable(struct intel_guc *guc,
> +					struct intel_context *ce);
> +
> +static void retire_worker_func(struct work_struct *w)
> +{
> +	struct intel_guc *guc =
> +		container_of(w, struct intel_guc, retire_worker);
> +
> +	/*
> +	 * It is possible that another thread issues the schedule disable + that
> +	 * G2H completes moving the state machine further along to a point
> +	 * where nothing needs to be done here. Let's be paranoid and kick the
> +	 * tasklet in that case.
> +	 */
> +	if (guc->submission_stall_reason != STALL_SCHED_DISABLE &&
> +	    guc->submission_stall_reason != STALL_GUC_ID_WORKQUEUE) {
> +		kick_tasklet(guc);
> +		return;
> +	}
> +
> +	if (guc->submission_stall_reason == STALL_SCHED_DISABLE) {
> +		GEM_BUG_ON(!guc->stalled_context);
> +		GEM_BUG_ON(context_guc_id_invalid(guc->stalled_context));
> +
> +		retire_worker_sched_disable(guc, guc->stalled_context);
> +	}
> +
> +	/*
> +	 * guc_id pressure, always try to release it regardless of state,
> +	 * albeit after possibly issuing a schedule disable as that is async
> +	 * operation.
> +	 */
> +	intel_gt_retire_requests(guc_to_gt(guc));
> +
> +	if (guc->submission_stall_reason == STALL_GUC_ID_WORKQUEUE) {
> +		GEM_BUG_ON(guc->stalled_context);
> +
> +		/* Hopefully guc_ids are now available, kick tasklet */
> +		guc->submission_stall_reason = STALL_GUC_ID_TASKLET;
> +		clr_tasklet_blocked(guc);
> +
> +		kick_tasklet(guc);
> +	}
> +}
> +
>  /*
>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>   * at firmware loading time.
> @@ -1003,9 +1351,12 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>  
>  	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list);
> +	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
> +	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
>  	ida_init(&guc->guc_ids);
>  
> +	INIT_WORK(&guc->retire_worker, retire_worker_func);
> +
>  	return 0;
>  }
>  
> @@ -1022,10 +1373,28 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
>  				 struct i915_request *rq,
>  				 int prio)
>  {
> +	bool empty = i915_sched_engine_is_empty(sched_engine);
> +
>  	GEM_BUG_ON(!list_empty(&rq->sched.link));
>  	list_add_tail(&rq->sched.link,
>  		      i915_sched_lookup_priolist(sched_engine, prio));
>  	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> +
> +	if (empty)
> +		kick_tasklet(&rq->engine->gt->uc.guc);
> +}
> +
> +static bool need_tasklet(struct intel_guc *guc, struct intel_context *ce)
> +{
> +	struct i915_sched_engine * const sched_engine =
> +		ce->engine->sched_engine;
> +
> +	lockdep_assert_held(&sched_engine->lock);
> +
> +	return guc_ids_exhausted(guc) || submission_disabled(guc) ||
> +		guc->stalled_rq || guc->stalled_context ||
> +		!lrc_desc_registered(guc, ce->guc_id) ||
> +		!i915_sched_engine_is_empty(sched_engine);
>  }
>  
>  static int guc_bypass_tasklet_submit(struct intel_guc *guc,
> @@ -1039,8 +1408,6 @@ static int guc_bypass_tasklet_submit(struct intel_guc *guc,
>  
>  	guc_set_lrc_tail(rq);
>  	ret = guc_add_request(guc, rq);
> -	if (ret == -EBUSY)
> -		guc->stalled_request = rq;
>  
>  	if (unlikely(ret == -EPIPE))
>  		disable_submission(guc);
> @@ -1057,11 +1424,10 @@ static void guc_submit_request(struct i915_request *rq)
>  	/* Will be called from irq-context when using foreign fences. */
>  	spin_lock_irqsave(&sched_engine->lock, flags);
>  
> -	if (submission_disabled(guc) || guc->stalled_request ||
> -	    !i915_sched_engine_is_empty(sched_engine))
> +	if (need_tasklet(guc, rq->context))
>  		queue_request(sched_engine, rq, rq_prio(rq));
>  	else if (guc_bypass_tasklet_submit(guc, rq) == -EBUSY)
> -		tasklet_hi_schedule(&sched_engine->tasklet);
> +		kick_tasklet(guc);
>  
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
>  }
> @@ -1093,32 +1459,71 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>  }
>  
> -static int steal_guc_id(struct intel_guc *guc)
> +/*
> + * We have two lists for guc_ids available to steal. One list is for contexts
> + * that to have a zero guc_id_ref but are still pinned (scheduling enabled, only
> + * available inside tasklet) and the other is for contexts that are not pinned
> + * but still registered (available both outside and inside tasklet). Stealing
> + * from the latter only requires a deregister H2G, while the former requires a
> + * schedule disable H2G + a deregister H2G.
> + */
> +static struct list_head *get_guc_id_list(struct intel_guc *guc,
> +					 bool unpinned)
> +{
> +	if (unpinned)
> +		return &guc->guc_id_list_unpinned;
> +	else
> +		return &guc->guc_id_list_no_ref;
> +}
> +
> +static int steal_guc_id(struct intel_guc *guc, bool unpinned)
>  {
>  	struct intel_context *ce;
>  	int guc_id;
> +	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
>  
>  	lockdep_assert_held(&guc->contexts_lock);
>  
> -	if (!list_empty(&guc->guc_id_list)) {
> -		ce = list_first_entry(&guc->guc_id_list,
> +	if (!list_empty(guc_id_list)) {
> +		ce = list_first_entry(guc_id_list,
>  				      struct intel_context,
>  				      guc_id_link);
>  
> +		/* Ensure context getting stolen in expected state */
>  		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
>  		GEM_BUG_ON(context_guc_id_invalid(ce));
> +		GEM_BUG_ON(context_guc_id_stolen(ce));
>  
>  		list_del_init(&ce->guc_id_link);
>  		guc_id = ce->guc_id;
>  		clr_context_registered(ce);
> -		set_context_guc_id_invalid(ce);
> +
> +		/*
> +		 * If stealing from the pinned list, defer invalidating
> +		 * the guc_id until the retire workqueue processes this
> +		 * context.
> +		 */
> +		if (!unpinned) {
> +			GEM_BUG_ON(guc->stalled_context);
> +			guc->stalled_context = intel_context_get(ce);
> +			set_context_guc_id_stolen(ce);
> +		} else {
> +			set_context_guc_id_invalid(ce);
> +		}
> +
>  		return guc_id;
>  	} else {
>  		return -EAGAIN;
>  	}
>  }
>  
> -static int assign_guc_id(struct intel_guc *guc, u16 *out)
> +enum {	/* Return values for pin_guc_id / assign_guc_id */
> +	SAME_GUC_ID		= 0,
> +	NEW_GUC_ID_DISABLED	= 1,
> +	NEW_GUC_ID_ENABLED	= 2,
> +};
> +
> +static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
>  {
>  	int ret;
>  
> @@ -1126,17 +1531,33 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
>  
>  	ret = new_guc_id(guc);
>  	if (unlikely(ret < 0)) {
> -		ret = steal_guc_id(guc);
> -		if (ret < 0)
> -			return ret;
> +		ret = steal_guc_id(guc, true);
> +		if (ret >= 0) {
> +			*out = ret;
> +			ret = NEW_GUC_ID_DISABLED;
> +		} else if (ret < 0 && tasklet) {
> +			/*
> +			 * We only steal a guc_id from a context with scheduling
> +			 * enabled if guc_ids are exhausted and we are submitting
> +			 * from the tasklet.
> +			 */
> +			ret = steal_guc_id(guc, false);
> +			if (ret >= 0) {
> +				*out = ret;
> +				ret = NEW_GUC_ID_ENABLED;
> +			}
> +		}
> +	} else {
> +		*out = ret;
> +		ret = SAME_GUC_ID;
>  	}
>  
> -	*out = ret;
> -	return 0;
> +	return ret;
>  }
>  
>  #define PIN_GUC_ID_TRIES	4
> -static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> +static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
> +		      bool tasklet)
>  {
>  	int ret = 0;
>  	unsigned long flags, tries = PIN_GUC_ID_TRIES;
> @@ -1146,11 +1567,15 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  try_again:
>  	spin_lock_irqsave(&guc->contexts_lock, flags);
>  
> +	if (!tasklet && guc_ids_exhausted(guc)) {
> +		ret = -EAGAIN;
> +		goto out_unlock;
> +	}
> +
>  	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id);
> -		if (ret)
> +		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
> +		if (unlikely(ret < 0))
>  			goto out_unlock;
> -		ret = 1;	/* Indidcates newly assigned guc_id */
>  	}
>  	if (!list_empty(&ce->guc_id_link))
>  		list_del_init(&ce->guc_id_link);
> @@ -1166,8 +1591,11 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  	 * attempting to retire more requests. Double the sleep period each
>  	 * subsequent pass before finally giving up. The sleep period has max of
>  	 * 100ms and minimum of 1ms.
> +	 *
> +	 * We only try this if outside the tasklet, inside the tasklet we have a
> +	 * (slower, more complex, blocking) different flow control algorithm.
>  	 */
> -	if (ret == -EAGAIN && --tries) {
> +	if (ret == -EAGAIN && --tries && !tasklet) {
>  		if (PIN_GUC_ID_TRIES - tries > 1) {
>  			unsigned int timeslice_shifted =
>  				ce->engine->props.timeslice_duration_ms <<
> @@ -1184,7 +1612,9 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  	return ret;
>  }
>  
> -static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
> +static void unpin_guc_id(struct intel_guc *guc,
> +			 struct intel_context *ce,
> +			 bool unpinned)
>  {
>  	unsigned long flags;
>  
> @@ -1194,9 +1624,17 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  		return;
>  
>  	spin_lock_irqsave(&guc->contexts_lock, flags);
> -	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id_link) &&
> -	    !atomic_read(&ce->guc_id_ref))
> -		list_add_tail(&ce->guc_id_link, &guc->guc_id_list);
> +
> +	if (!list_empty(&ce->guc_id_link))
> +		list_del_init(&ce->guc_id_link);
> +
> +	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
> +	    !atomic_read(&ce->guc_id_ref)) {
> +		struct list_head *head = get_guc_id_list(guc, unpinned);
> +
> +		list_add_tail(&ce->guc_id_link, head);
> +	}
> +
>  	spin_unlock_irqrestore(&guc->contexts_lock, flags);
>  }
>  
> @@ -1300,6 +1738,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  	int ret = 0;
>  
>  	GEM_BUG_ON(!engine->mask);
> +	GEM_BUG_ON(context_guc_id_invalid(ce));
>  
>  	/*
>  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> @@ -1342,6 +1781,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  		trace_intel_context_steal_guc_id(ce);
>  		if (!loop) {
>  			set_context_wait_for_deregister_to_register(ce);
> +			set_context_block_tasklet(ce);
>  			intel_context_get(ce);
>  		} else {
>  			bool disabled;
> @@ -1369,7 +1809,14 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  			ret = deregister_context(ce, ce->guc_id, loop);
>  		if (unlikely(ret == -EBUSY)) {
>  			clr_context_wait_for_deregister_to_register(ce);
> +			clr_context_block_tasklet(ce);
>  			intel_context_put(ce);
> +		} else if (!loop && !ret) {
> +			/*
> +			 * A context de-registration has been issued from within
> +			 * the tasklet. Need to block until it complete.
> +			 */
> +			return -EINPROGRESS;
>  		} else if (unlikely(ret == -ENODEV)) {
>  			ret = 0;	/* Will get registered later */
>  		}
> @@ -1425,7 +1872,9 @@ static void guc_context_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  
> -	unpin_guc_id(guc, ce);
> +	GEM_BUG_ON(context_enabled(ce));
> +
> +	unpin_guc_id(guc, ce, true);
>  	lrc_unpin(ce);
>  }
>  
> @@ -1764,6 +2213,8 @@ static void guc_context_destroy(struct kref *kref)
>  	unsigned long flags;
>  	bool disabled;
>  
> +	GEM_BUG_ON(context_guc_id_stolen(ce));
> +
>  	/*
>  	 * If the guc_id is invalid this context has been stolen and we can free
>  	 * it immediately. Also can be freed immediately if the context is not
> @@ -1925,6 +2376,9 @@ static void add_to_context(struct i915_request *rq)
>  	spin_lock(&ce->guc_active.lock);
>  	list_move_tail(&rq->sched.link, &ce->guc_active.requests);
>  
> +	if (unlikely(request_has_no_guc_id(rq)))
> +		++ce->guc_num_rq_submit_no_id;
> +
>  	if (rq->guc_prio == GUC_PRIO_INIT) {
>  		rq->guc_prio = new_guc_prio;
>  		add_context_inflight_prio(ce, rq->guc_prio);
> @@ -1966,7 +2420,12 @@ static void remove_from_context(struct i915_request *rq)
>  
>  	spin_unlock_irq(&ce->guc_active.lock);
>  
> -	atomic_dec(&ce->guc_id_ref);
> +	if (likely(!request_has_no_guc_id(rq)))
> +		atomic_dec(&ce->guc_id_ref);
> +	else
> +		--ce_to_guc(rq->context)->total_num_rq_with_no_guc_id;
> +	unpin_guc_id(ce_to_guc(ce), ce, false);
> +
>  	i915_request_notify_execute_cb_imm(rq);
>  }
>  
> @@ -2018,13 +2477,144 @@ static void guc_signal_context_fence(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static bool context_needs_register(struct intel_context *ce, bool new_guc_id)
> +static void invalidate_guc_id_sched_disable(struct intel_context *ce)
> +{
> +	set_context_guc_id_invalid(ce);
> +	wmb();	/* Make sure guc_id invalidation visible first */
> +	clr_context_guc_id_stolen(ce);
> +}
> +
> +static void retire_worker_sched_disable(struct intel_guc *guc,
> +					struct intel_context *ce)
> +{
> +	unsigned long flags;
> +	bool disabled;
> +
> +	guc->stalled_context = NULL;
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	disabled = submission_disabled(guc);
> +	if (!disabled && !context_pending_disable(ce) && context_enabled(ce)) {
> +		/*
> +		 * Still enabled, issue schedule disable + configure state so
> +		 * when G2H returns tasklet is kicked.
> +		 */
> +
> +		struct intel_runtime_pm *runtime_pm =
> +			&ce->engine->gt->i915->runtime_pm;
> +		intel_wakeref_t wakeref;
> +		u16 guc_id;
> +
> +		/*
> +		 * We add +2 here as the schedule disable complete CTB handler
> +		 * calls intel_context_sched_disable_unpin (-2 to pin_count).
> +		 */
> +		GEM_BUG_ON(!atomic_read(&ce->pin_count));
> +		atomic_add(2, &ce->pin_count);
> +
> +		set_context_block_tasklet(ce);
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +
> +		invalidate_guc_id_sched_disable(ce);
> +	} else if (!disabled && context_pending_disable(ce)) {
> +		/*
> +		 * Schedule disable in flight, set bit to kick tasklet in G2H
> +		 * handler and call it a day.
> +		 */
> +
> +		set_context_block_tasklet(ce);
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
> +		invalidate_guc_id_sched_disable(ce);
> +	} else {
> +		/* Schedule disable is done, kick tasklet */
> +
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
> +		invalidate_guc_id_sched_disable(ce);
> +
> +		guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
> +		clr_tasklet_blocked(guc);
> +
> +		kick_tasklet(ce_to_guc(ce));
> +	}
> +
> +	intel_context_put(ce);
> +}
> +
> +static bool context_needs_lrc_desc_pin(struct intel_context *ce, bool new_guc_id)
>  {
>  	return (new_guc_id || test_bit(CONTEXT_LRCA_DIRTY, &ce->flags) ||
>  		!lrc_desc_registered(ce_to_guc(ce), ce->guc_id)) &&
>  		!submission_disabled(ce_to_guc(ce));
>  }
>  
> +static int tasklet_pin_guc_id(struct intel_guc *guc, struct i915_request *rq)
> +{
> +	struct intel_context *ce = rq->context;
> +	int ret = 0;
> +
> +	lockdep_assert_held(&guc->sched_engine->lock);
> +	GEM_BUG_ON(!ce->guc_num_rq_submit_no_id);
> +
> +	if (atomic_add_unless(&ce->guc_id_ref, ce->guc_num_rq_submit_no_id, 0))
> +		goto out;
> +
> +	ret = pin_guc_id(guc, ce, true);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * No guc_ids available, disable the tasklet and kick the retire
> +		 * workqueue hopefully freeing up some guc_ids.
> +		 */
> +		guc->stalled_rq = rq;
> +		guc->submission_stall_reason = STALL_GUC_ID_WORKQUEUE;
> +		return ret;
> +	}
> +
> +	if (ce->guc_num_rq_submit_no_id - 1 > 0)
> +		atomic_add(ce->guc_num_rq_submit_no_id - 1,
> +			   &ce->guc_id_ref);
> +
> +	if (context_needs_lrc_desc_pin(ce, !!ret))
> +		set_context_needs_register(ce);
> +
> +	if (ret == NEW_GUC_ID_ENABLED) {
> +		guc->stalled_rq = rq;
> +		guc->submission_stall_reason = STALL_SCHED_DISABLE;
> +	}
> +
> +	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> +out:
> +	guc->total_num_rq_with_no_guc_id -= ce->guc_num_rq_submit_no_id;
> +	GEM_BUG_ON(guc->total_num_rq_with_no_guc_id < 0);
> +
> +	list_for_each_entry_reverse(rq, &ce->guc_active.requests, sched.link)
> +		if (request_has_no_guc_id(rq)) {
> +			--ce->guc_num_rq_submit_no_id;
> +			clear_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
> +				  &rq->fence.flags);
> +		} else if (!ce->guc_num_rq_submit_no_id) {
> +			break;
> +		}
> +
> +	GEM_BUG_ON(ce->guc_num_rq_submit_no_id);
> +
> +	/*
> +	 * When NEW_GUC_ID_ENABLED is returned it means we are stealing a guc_id
> +	 * from a context that has scheduling enabled. We have to disable
> +	 * scheduling before deregistering the context and it isn't safe to do
> +	 * in the tasklet because of lock inversion (ce->guc_state.lock must be
> +	 * acquired before guc->sched_engine->lock). To work around this
> +	 * we do the schedule disable in retire workqueue and block the tasklet
> +	 * until the schedule done G2H returns. Returning non-zero here kicks
> +	 * the workqueue.
> +	 */
> +	return (ret == NEW_GUC_ID_ENABLED) ? ret : 0;
> +}
> +
>  static int guc_request_alloc(struct i915_request *rq)
>  {
>  	struct intel_context *ce = rq->context;
> @@ -2056,6 +2646,15 @@ static int guc_request_alloc(struct i915_request *rq)
>  
>  	rq->reserved_space -= GUC_REQUEST_SIZE;
>  
> +	/*
> +	 * guc_ids are exhausted, don't allocate one here, defer to submission
> +	 * in the tasklet.
> +	 */
> +	if (test_and_update_guc_ids_exhausted(guc)) {
> +		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
> +		goto out;
> +	}
> +
>  	/*
>  	 * Call pin_guc_id here rather than in the pinning step as with
>  	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> @@ -2063,9 +2662,7 @@ static int guc_request_alloc(struct i915_request *rq)
>  	 * when guc_ids are being stolen due to over subscription. By the time
>  	 * this function is reached, it is guaranteed that the guc_id will be
>  	 * persistent until the generated request is retired. Thus, sealing these
> -	 * race conditions. It is still safe to fail here if guc_ids are
> -	 * exhausted and return -EAGAIN to the user indicating that they can try
> -	 * again in the future.
> +	 * race conditions.
>  	 *
>  	 * There is no need for a lock here as the timeline mutex ensures at
>  	 * most one context can be executing this code path at once. The
> @@ -2076,10 +2673,26 @@ static int guc_request_alloc(struct i915_request *rq)
>  	if (atomic_add_unless(&ce->guc_id_ref, 1, 0))
>  		goto out;
>  
> -	ret = pin_guc_id(guc, ce);	/* returns 1 if new guc_id assigned */
> -	if (unlikely(ret < 0))
> +	ret = pin_guc_id(guc, ce, false);	/* > 0 indicates new guc_id */
> +	if (unlikely(ret == -EAGAIN)) {
> +		/*
> +		 * No guc_ids available, so we force this submission and all
> +		 * future submissions to be serialized in the tasklet, sharing
> +		 * the guc_ids on a per submission basis to ensure (more) fair
> +		 * scheduling of submissions. Once the tasklet is flushed of
> +		 * submissions we return to allocating guc_ids in this function.
> +		 */
> +		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
> +		set_and_update_guc_ids_exhausted(guc);
> +
> +		return 0;
> +	} else if (unlikely(ret < 0)) {
>  		return ret;
> -	if (context_needs_register(ce, !!ret)) {
> +	}
> +
> +	GEM_BUG_ON(ret == NEW_GUC_ID_ENABLED);
> +
> +	if (context_needs_lrc_desc_pin(ce, !!ret)) {
>  		ret = guc_lrc_desc_pin(ce, true);
>  		if (unlikely(ret)) {	/* unwind */
>  			if (ret == -EPIPE) {
> @@ -2087,7 +2700,7 @@ static int guc_request_alloc(struct i915_request *rq)
>  				goto out;	/* GPU will be reset */
>  			}
>  			atomic_dec(&ce->guc_id_ref);
> -			unpin_guc_id(guc, ce);
> +			unpin_guc_id(guc, ce, true);
>  			return ret;
>  		}
>  	}
> @@ -2358,7 +2971,7 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
>  					  struct intel_context *ce)
>  {
>  	if (context_guc_id_invalid(ce))
> -		pin_guc_id(guc, ce);
> +		pin_guc_id(guc, ce, false);
>  	guc_lrc_desc_pin(ce, true);
>  }
>  
> @@ -2625,6 +3238,16 @@ int intel_guc_deregister_done_process_msg(struct intel_guc *guc,
>  		with_intel_runtime_pm(runtime_pm, wakeref)
>  			register_context(ce, true);
>  		guc_signal_context_fence(ce);
> +		if (context_block_tasklet(ce)) {
> +			GEM_BUG_ON(guc->submission_stall_reason !=
> +				   STALL_DEREGISTER_CONTEXT);
> +
> +			clr_context_block_tasklet(ce);
> +			guc->submission_stall_reason = STALL_MOVE_LRC_TAIL;
> +			clr_tasklet_blocked(guc);
> +
> +			kick_tasklet(ce_to_guc(ce));
> +		}
>  		intel_context_put(ce);
>  	} else if (context_destroyed(ce)) {
>  		/* Context has been destroyed */
> @@ -2688,6 +3311,14 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
>  		guc_blocked_fence_complete(ce);
>  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
> +		if (context_block_tasklet(ce)) {
> +			clr_context_block_tasklet(ce);
> +			guc->submission_stall_reason = STALL_REGISTER_CONTEXT;
> +			clr_tasklet_blocked(guc);
> +
> +			kick_tasklet(ce_to_guc(ce));
> +		}
> +
>  		if (banned) {
>  			guc_cancel_context_requests(ce);
>  			intel_engine_signal_breadcrumbs(ce->engine);
> @@ -2716,10 +3347,8 @@ static void capture_error_state(struct intel_guc *guc,
>  
>  static void guc_context_replay(struct intel_context *ce)
>  {
> -	struct i915_sched_engine *sched_engine = ce->engine->sched_engine;
> -
>  	__guc_reset_context(ce, true);
> -	tasklet_hi_schedule(&sched_engine->tasklet);
> +	kick_tasklet(ce_to_guc(ce));
>  }
>  
>  static void guc_handle_context_reset(struct intel_guc *guc,
> @@ -2878,8 +3507,16 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>  		   atomic_read(&guc->outstanding_submission_g2h));
>  	drm_printf(p, "GuC Number GuC IDs: %u\n", guc->num_guc_ids);
>  	drm_printf(p, "GuC Max GuC IDs: %u\n", guc->max_guc_ids);
> -	drm_printf(p, "GuC tasklet count: %u\n\n",
> +	drm_printf(p, "GuC tasklet count: %u\n",
>  		   atomic_read(&sched_engine->tasklet.count));
> +	drm_printf(p, "GuC submit flags: 0x%04lx\n", guc->flags);
> +	drm_printf(p, "GuC total number request without guc_id: %d\n",
> +		   guc->total_num_rq_with_no_guc_id);
> +	drm_printf(p, "GuC stall reason: %d\n", guc->submission_stall_reason);
> +	drm_printf(p, "GuC stalled request: %s\n",
> +		   yesno(guc->stalled_rq));
> +	drm_printf(p, "GuC stalled context: %s\n\n",
> +		   yesno(guc->stalled_context));
>  
>  	spin_lock_irqsave(&sched_engine->lock, flags);
>  	drm_printf(p, "Requests in GuC submit tasklet:\n");
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 1bc1349ba3c2..807f76750cf4 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -139,6 +139,12 @@ enum {
>  	 * the GPU. Here we track such boost requests on a per-request basis.
>  	 */
>  	I915_FENCE_FLAG_BOOST,
> +
> +	/*
> +	 * I915_FENCE_FLAG_GUC_ID_NOT_PINNED - Set to signal the GuC submission
> +	 * tasklet that the guc_id isn't pinned.
> +	 */
> +	I915_FENCE_FLAG_GUC_ID_NOT_PINNED,
>  };
>  
>  /**
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids Matthew Brost
@ 2021-08-05  8:29   ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-05  8:29 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:01PM -0700, Matthew Brost wrote:
> Add a heuristic which checks if over half of the available guc_ids are
> currently consumed by requests not ready to be submitted. If this
> heuristic is true at request creation time (normal guc_id allocation
> location) force all submissions + guc_ids allocations to tasklet.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  3 ++
>  drivers/gpu/drm/i915/gt/intel_reset.c         |  9 ++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  1 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 53 +++++++++++++++++--
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
>  5 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 8ed964ef967b..c01530d7dc67 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -188,6 +188,9 @@ struct intel_context {
>  	/* Number of rq submitted without a guc_id */
>  	u16 guc_num_rq_submit_no_id;
>  
> +	/* GuC number of requests not ready */
> +	atomic_t guc_num_rq_not_ready;

atomic_t by default is unordered. This needs some giantic comments and
explainers why this is totally ok and we don't need barriers here.

I think good excuse to convert all the docs here into kerneldoc.
-Daniel

> +
>  	/*
>  	 * GuC ID link - in list when unpinned but guc_id still valid in GuC
>  	 */
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 91200c43951f..ea763138197f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -22,6 +22,7 @@
>  #include "intel_reset.h"
>  
>  #include "uc/intel_guc.h"
> +#include "uc/intel_guc_submission.h"
>  
>  #define RESET_MAX_RETRIES 3
>  
> @@ -850,6 +851,14 @@ static void nop_submit_request(struct i915_request *request)
>  {
>  	RQ_TRACE(request, "-EIO\n");
>  
> +	/*
> +	 * XXX: Kinda ugly to check for GuC submission here but this function is
> +	 * going away once we switch to the DRM scheduler so we can live with
> +	 * this for now.
> +	 */
> +	if (intel_engine_uses_guc(request->engine))
> +		intel_guc_decr_num_rq_not_ready(request->context);
> +
>  	request = i915_request_mark_eio(request);
>  	if (request) {
>  		i915_request_submit(request);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index e76579396efd..917352c9f323 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -76,6 +76,7 @@ struct intel_guc {
>  	struct ida guc_ids;
>  	u32 num_guc_ids;
>  	u32 max_guc_ids;
> +	atomic_t num_guc_ids_not_ready;
>  	struct list_head guc_id_list_no_ref;
>  	struct list_head guc_id_list_unpinned;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f42a707f60ca..ba750fc87af1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1384,6 +1384,41 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
>  		kick_tasklet(&rq->engine->gt->uc.guc);
>  }
>  
> +/* Macro to tweak heuristic, using a simple over 50% not ready for now */
> +#define TOO_MANY_GUC_IDS_NOT_READY(avail, consumed) \
> +	((consumed) > (avail) / 2)
> +static bool too_many_guc_ids_not_ready(struct intel_guc *guc,
> +				       struct intel_context *ce)
> +{
> +	u32 available_guc_ids, guc_ids_consumed;
> +
> +	available_guc_ids = guc->num_guc_ids;
> +	guc_ids_consumed = atomic_read(&guc->num_guc_ids_not_ready);
> +
> +	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
> +		set_and_update_guc_ids_exhausted(guc);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void incr_num_rq_not_ready(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
> +		atomic_inc(&guc->num_guc_ids_not_ready);
> +}
> +
> +void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1)
> +		atomic_dec(&guc->num_guc_ids_not_ready);
> +}
> +
>  static bool need_tasklet(struct intel_guc *guc, struct intel_context *ce)
>  {
>  	struct i915_sched_engine * const sched_engine =
> @@ -1430,6 +1465,8 @@ static void guc_submit_request(struct i915_request *rq)
>  		kick_tasklet(guc);
>  
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> +
> +	intel_guc_decr_num_rq_not_ready(rq->context);
>  }
>  
>  static int new_guc_id(struct intel_guc *guc)
> @@ -2647,10 +2684,13 @@ static int guc_request_alloc(struct i915_request *rq)
>  	rq->reserved_space -= GUC_REQUEST_SIZE;
>  
>  	/*
> -	 * guc_ids are exhausted, don't allocate one here, defer to submission
> -	 * in the tasklet.
> +	 * guc_ids are exhausted or a heuristic is met indicating too many
> +	 * guc_ids are waiting on requests with submission dependencies (not
> +	 * ready to submit). Don't allocate one here, defer to submission in the
> +	 * tasklet.
>  	 */
> -	if (test_and_update_guc_ids_exhausted(guc)) {
> +	if (test_and_update_guc_ids_exhausted(guc) ||
> +	    too_many_guc_ids_not_ready(guc, ce)) {
>  		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
>  		goto out;
>  	}
> @@ -2684,6 +2724,7 @@ static int guc_request_alloc(struct i915_request *rq)
>  		 */
>  		set_bit(I915_FENCE_FLAG_GUC_ID_NOT_PINNED, &rq->fence.flags);
>  		set_and_update_guc_ids_exhausted(guc);
> +		incr_num_rq_not_ready(ce);
>  
>  		return 0;
>  	} else if (unlikely(ret < 0)) {
> @@ -2708,6 +2749,8 @@ static int guc_request_alloc(struct i915_request *rq)
>  	clear_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
>  
>  out:
> +	incr_num_rq_not_ready(ce);
> +
>  	/*
>  	 * We block all requests on this context if a G2H is pending for a
>  	 * schedule disable or context deregistration as the GuC will fail a
> @@ -3512,6 +3555,8 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>  	drm_printf(p, "GuC submit flags: 0x%04lx\n", guc->flags);
>  	drm_printf(p, "GuC total number request without guc_id: %d\n",
>  		   guc->total_num_rq_with_no_guc_id);
> +	drm_printf(p, "GuC Number GuC IDs not ready: %d\n",
> +		   atomic_read(&guc->num_guc_ids_not_ready));
>  	drm_printf(p, "GuC stall reason: %d\n", guc->submission_stall_reason);
>  	drm_printf(p, "GuC stalled request: %s\n",
>  		   yesno(guc->stalled_rq));
> @@ -3567,6 +3612,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  			   atomic_read(&ce->pin_count));
>  		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
>  			   atomic_read(&ce->guc_id_ref));
> +		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> +			   atomic_read(&ce->guc_num_rq_not_ready));
>  		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
>  			   ce->guc_state.sched_state,
>  			   atomic_read(&ce->guc_sched_state_no_lock));
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c7ef44fa0c36..17af5e123b09 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -51,4 +51,6 @@ static inline bool intel_guc_submission_is_used(struct intel_guc *guc)
>  	return intel_guc_is_used(guc) && intel_guc_submission_is_wanted(guc);
>  }
>  
> +void intel_guc_decr_num_rq_not_ready(struct intel_context *ce);
> +
>  #endif
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
@ 2021-08-09 14:23   ` Daniel Vetter
  2021-08-09 18:11     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:23 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:07PM -0700, Matthew Brost wrote:
> Taking a PM reference to prevent intel_gt_wait_for_idle from short
> circuiting while a scheduling of user context could be enabled.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                 |  1 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
>  2 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 903de270f2db..5e3a1e2095b0 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -103,6 +103,7 @@ gt-y += \
>  	gt/intel_gt_clock_utils.o \
>  	gt/intel_gt_irq.o \
>  	gt/intel_gt_pm.o \
> +	gt/intel_gt_pm_unpark_work.o \

This file isn't here?

Also pm stuff tends to have very nasty locking requirements, doing special
stuff like this in the backend tends to lead to really big surprises. I
think two options to make sure our locking design stays consistent:
- Lift this to generic code.
- expose some engine_pm_migt_get/put() calls which do have the right set
  of might_lock annoations, and call those in the generic code.

Imo the worst kernel abstractions are those where all implementations
look&act the same, except for locking. Unfortunately i915-gem code is full
of this stuff, and we need to stop this by enlisting lockdep to check the
contracts for us.
-Daniel

>  	gt/intel_gt_pm_irq.o \
>  	gt/intel_gt_requests.o \
>  	gt/intel_gtt.o \
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 7fe4d1559a81..c5d9548bfd00 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2056,7 +2056,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
>  
>  static int guc_context_pin(struct intel_context *ce, void *vaddr)
>  {
> -	return __guc_context_pin(ce, ce->engine, vaddr);
> +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> +
> +	if (likely(!ret && !intel_context_is_barrier(ce)))
> +		intel_engine_pm_get(ce->engine);
> +
> +	return ret;
>  }
>  
>  static void guc_context_unpin(struct intel_context *ce)
> @@ -2067,6 +2072,9 @@ static void guc_context_unpin(struct intel_context *ce)
>  
>  	unpin_guc_id(guc, ce, true);
>  	lrc_unpin(ce);
> +
> +	if (likely(!intel_context_is_barrier(ce)))
> +		intel_engine_pm_put(ce->engine);
>  }
>  
>  static void guc_context_post_unpin(struct intel_context *ce)
> @@ -3002,8 +3010,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
>  static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
>  {
>  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> +	int ret = __guc_context_pin(ce, engine, vaddr);
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +
> +	if (likely(!ret))
> +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +			intel_engine_pm_get(engine);
>  
> -	return __guc_context_pin(ce, engine, vaddr);
> +	return ret;
> +}
> +
> +static void guc_virtual_context_unpin(struct intel_context *ce)
> +{
> +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> +	struct intel_engine_cs *engine;
> +	struct intel_guc *guc = ce_to_guc(ce);
> +
> +	GEM_BUG_ON(context_enabled(ce));
> +	GEM_BUG_ON(intel_context_is_barrier(ce));
> +
> +	unpin_guc_id(guc, ce, true);
> +	lrc_unpin(ce);
> +
> +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> +		intel_engine_pm_put(engine);
>  }
>  
>  static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -3040,7 +3070,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>  
>  	.pre_pin = guc_virtual_context_pre_pin,
>  	.pin = guc_virtual_context_pin,
> -	.unpin = guc_context_unpin,
> +	.unpin = guc_virtual_context_unpin,
>  	.post_unpin = guc_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
@ 2021-08-09 14:27   ` Daniel Vetter
  2021-08-09 18:20     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:27 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:08PM -0700, Matthew Brost wrote:
> Calling switch_to_kernel_context isn't needed if the engine PM reference
> is taken while all contexts are pinned. By not calling
> switch_to_kernel_context we save on issuing a request to the engine.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 1f07ac4e0672..58099de6bf07 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -162,6 +162,10 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>  	unsigned long flags;
>  	bool result = true;
>  
> +	/* No need to switch_to_kernel_context if GuC submission */

Maybe whack a big FIXME on here that we should unravel this properly.
Currently the execlist backend assumptions are leaked all over the place,
leading to stuff like this. Which means extremely fragile code.

I currently don't have a great idea on how exactly we should do that, but
oh well.

btw just in case we ever want to make guc lrc properly evictable (which as
the og use-case for this function, way, way back), would we need to fully
unregister them from guc? At least I'm assuming there's no other trick
like the below one.

Another aside: How does the perf/OA patching work on GuC?

Anyway, patch looks legit:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>


> +	if (intel_engine_uses_guc(engine))
> +		return true;
> +
>  	/* GPU is pointing to the void, as good as in the kernel context. */
>  	if (intel_gt_is_wedged(engine->gt))
>  		return true;
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping Matthew Brost
@ 2021-08-09 14:28   ` Daniel Vetter
  2021-08-09 18:28     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:28 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:10PM -0700, Matthew Brost wrote:
> Add logical engine mapping. This is required for split-frame, as
> workloads need to be placed on engines in a logically contiguous manner.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  1 +
>  .../drm/i915/gt/intel_execlists_submission.c  |  1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
>  5 files changed, 56 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 0d9105a31d84..4d790f9a65dd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
>  	GEM_DEBUG_WARN_ON(iir);
>  }
>  
> -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> +			      u8 logical_instance)
>  {
>  	const struct engine_info *info = &intel_engines[id];
>  	struct drm_i915_private *i915 = gt->i915;
> @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>  
>  	engine->class = info->class;
>  	engine->instance = info->instance;
> +	engine->logical_mask = BIT(logical_instance);
>  	__sprint_engine_name(engine);
>  
>  	engine->props.heartbeat_interval_ms =
> @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
>  	return info->engine_mask;
>  }
>  
> +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> +				 u8 class, const u8 *map, u8 num_instances)
> +{
> +	int i, j;
> +	u8 current_logical_id = 0;
> +
> +	for (j = 0; j < num_instances; ++j) {
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			if (!HAS_ENGINE(gt, i) ||
> +			    intel_engines[i].class != class)
> +				continue;
> +
> +			if (intel_engines[i].instance == map[j]) {
> +				logical_ids[intel_engines[i].instance] =
> +					current_logical_id++;
> +				break;
> +			}
> +		}
> +	}
> +}
> +
> +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> +{
> +	int i;
> +	u8 map[MAX_ENGINE_INSTANCE + 1];
> +
> +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> +		map[i] = i;
> +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> +}
> +
>  /**
>   * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
>   * @gt: pointer to struct intel_gt
> @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>  	struct drm_i915_private *i915 = gt->i915;
>  	const unsigned int engine_mask = init_engine_mask(gt);
>  	unsigned int mask = 0;
> -	unsigned int i;
> +	unsigned int i, class;
> +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
>  	int err;
>  
>  	drm_WARN_ON(&i915->drm, engine_mask == 0);
> @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
>  	if (i915_inject_probe_failure(i915))
>  		return -ENODEV;
>  
> -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> -		if (!HAS_ENGINE(gt, i))
> -			continue;
> +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> +		setup_logical_ids(gt, logical_ids, class);
>  
> -		err = intel_engine_setup(gt, i);
> -		if (err)
> -			goto cleanup;
> +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> +			u8 instance = intel_engines[i].instance;
> +
> +			if (intel_engines[i].class != class ||
> +			    !HAS_ENGINE(gt, i))
> +				continue;
>  
> -		mask |= BIT(i);
> +			err = intel_engine_setup(gt, i,
> +						 logical_ids[instance]);
> +			if (err)
> +				goto cleanup;
> +
> +			mask |= BIT(i);
> +		}
>  	}
>  
>  	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index ed91bcff20eb..85e5c9a9e502 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -266,6 +266,7 @@ struct intel_engine_cs {
>  	unsigned int guc_id;
>  
>  	intel_engine_mask_t mask;
> +	intel_engine_mask_t logical_mask;

Kerneldoc at least for new stuff. Bonus points if you get the
struct/header file up to speed (with dummy/fixme comments if need be) so
we can include it into our overall html hierarchy).
-Daniel

>  
>  	u8 class;
>  	u8 instance;
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index de5f9c86b9a4..baa1797af1c8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3879,6 +3879,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>  
>  		ve->siblings[ve->num_siblings++] = sibling;
>  		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>  
>  		/*
>  		 * All physical engines must be compatible for their emission
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 6926919bcac6..9f5f43a16182 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
>  	for_each_engine(engine, gt, id) {
>  		u8 guc_class = engine_class_to_guc_class(engine->class);
>  
> -		system_info->mapping_table[guc_class][engine->instance] =
> +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
>  			engine->instance;
>  	}
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 310116f40509..dec757d319a2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1795,23 +1795,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
>  	return __guc_action_deregister_context(guc, guc_id, loop);
>  }
>  
> -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> -{
> -	switch (class) {
> -	case RENDER_CLASS:
> -		return mask >> RCS0;
> -	case VIDEO_ENHANCEMENT_CLASS:
> -		return mask >> VECS0;
> -	case VIDEO_DECODE_CLASS:
> -		return mask >> VCS0;
> -	case COPY_ENGINE_CLASS:
> -		return mask >> BCS0;
> -	default:
> -		MISSING_CASE(class);
> -		return 0;
> -	}
> -}
> -
>  static void guc_context_policy_init(struct intel_engine_cs *engine,
>  				    struct guc_lrc_desc *desc)
>  {
> @@ -1952,8 +1935,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  
>  	desc = __get_lrc_desc(guc, ce->guc_lrcd_reg_idx);
>  	desc->engine_class = engine_class_to_guc_class(engine->class);
> -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> -						      engine->mask);
> +	desc->engine_submit_mask = engine->logical_mask;
>  	desc->hw_context_desc = ce->lrc.lrca;
>  	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
>  	desc->priority = ce->guc_prio;
> @@ -3978,6 +3960,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>  		}
>  
>  		ve->base.mask |= sibling->mask;
> +		ve->base.logical_mask |= sibling->logical_mask;
>  
>  		if (n != 0 && ve->base.class != sibling->class) {
>  			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user Matthew Brost
@ 2021-08-09 14:30   ` Daniel Vetter
  2021-08-09 18:37     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:11PM -0700, Matthew Brost wrote:
> Expose logical engine instance to user via query engine info IOCTL. This
> is required for split-frame workloads as these needs to be placed on
> engines in a logically contiguous order. The logical mapping can change
> based on fusing. Rather than having user have knowledge of the fusing we
> simply just expose the logical mapping with the existing query engine
> info IOCTL.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Uapi must have a link to the userspace MR/patch set using this, and to the
igt patch set validating it.

Ideally in each patch, since it's way too hard to unfortunately find the
cover letter late on.

Jason even went as far as making this a hard requirement because he wasted
a bit too much time trying to find the userspace for new uapi:

https://lore.kernel.org/dri-devel/20210804185704.624883-1-jason@jlekstrand.net/

Cheers, Daniel

>---
>  drivers/gpu/drm/i915/i915_query.c | 2 ++
>  include/uapi/drm/i915_drm.h       | 8 +++++++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> index e49da36c62fb..8a72923fbdba 100644
> --- a/drivers/gpu/drm/i915/i915_query.c
> +++ b/drivers/gpu/drm/i915/i915_query.c
> @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
>  	for_each_uabi_engine(engine, i915) {
>  		info.engine.engine_class = engine->uabi_class;
>  		info.engine.engine_instance = engine->uabi_instance;
> +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
>  		info.capabilities = engine->uabi_capabilities;
> +		info.logical_instance = ilog2(engine->logical_mask);
>  
>  		if (copy_to_user(info_ptr, &info, sizeof(info)))
>  			return -EFAULT;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f13d241417f..ef72e07fe08c 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -2706,14 +2706,20 @@ struct drm_i915_engine_info {
>  
>  	/** @flags: Engine flags. */
>  	__u64 flags;
> +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
>  
>  	/** @capabilities: Capabilities of this engine. */
>  	__u64 capabilities;
>  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
>  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
>  
> +	/** @logical_instance: Logical instance of engine */
> +	__u16 logical_instance;
> +
>  	/** @rsvd1: Reserved fields. */
> -	__u64 rsvd1[4];
> +	__u16 rsvd1[3];
> +	/** @rsvd2: Reserved fields. */
> +	__u64 rsvd2[3];
>  };
>  
>  /**
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
@ 2021-08-09 14:37   ` Daniel Vetter
  2021-08-09 14:40     ` Daniel Vetter
  2021-08-09 18:44     ` Matthew Brost
  0 siblings, 2 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:37 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:12PM -0700, Matthew Brost wrote:
> Introduce context parent-child relationship. Once this relationship is
> created all pinning / unpinning operations are directed to the parent
> context. The parent context is responsible for pinning all of its'
> children and itself.
> 
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - a single H2G is used
> register / deregister all of the contexts simultaneously.
> 
> Subsequent patches in the series will implement the pinning / unpinning
> operations for parent / child contexts.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
>  drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
>  drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
>  3 files changed, 59 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 745e84c72c90..8cb92b10b547 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  	spin_lock_init(&ce->guc_state.lock);
>  	INIT_LIST_HEAD(&ce->guc_state.fences);
>  
> +	INIT_LIST_HEAD(&ce->guc_child_list);
> +
>  	spin_lock_init(&ce->guc_active.lock);
>  	INIT_LIST_HEAD(&ce->guc_active.requests);
>  
> @@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  
>  void intel_context_fini(struct intel_context *ce)
>  {
> +	struct intel_context *child, *next;
> +
>  	if (ce->timeline)
>  		intel_timeline_put(ce->timeline);
>  	i915_vm_put(ce->vm);
>  
> +	/* Need to put the creation ref for the children */
> +	if (intel_context_is_parent(ce))
> +		for_each_child_safe(ce, child, next)
> +			intel_context_put(child);
> +
>  	mutex_destroy(&ce->pin_mutex);
>  	i915_active_fini(&ce->active);
>  }
> @@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>  	return active;
>  }
>  
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child)
> +{
> +	/*
> +	 * Callers responsibility to validate that this function is used
> +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> +	 */
> +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> +	GEM_BUG_ON(intel_context_is_pinned(parent));
> +	GEM_BUG_ON(intel_context_is_child(parent));
> +	GEM_BUG_ON(intel_context_is_pinned(child));
> +	GEM_BUG_ON(intel_context_is_child(child));
> +	GEM_BUG_ON(intel_context_is_parent(child));
> +
> +	parent->guc_number_children++;
> +	list_add_tail(&child->guc_child_link,
> +		      &parent->guc_child_list);
> +	child->parent = parent;
> +}
> +
>  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>  #include "selftest_context.c"
>  #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c41098950746..ad6ce5ac4824 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
>  int intel_context_reconfigure_sseu(struct intel_context *ce,
>  				   const struct intel_sseu sseu);
>  
> +static inline bool intel_context_is_child(struct intel_context *ce)
> +{
> +	return !!ce->parent;
> +}
> +
> +static inline bool intel_context_is_parent(struct intel_context *ce)
> +{
> +	return !!ce->guc_number_children;
> +}
> +
> +void intel_context_bind_parent_child(struct intel_context *parent,
> +				     struct intel_context *child);
> +
> +#define for_each_child(parent, ce)\
> +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> +#define for_each_child_safe(parent, ce, cn)\
> +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> +
>  /**
>   * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
>   * @ce - the context
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 2df79ba39867..66b22b370a72 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -202,6 +202,18 @@ struct intel_context {
>  	/* GuC context blocked fence */
>  	struct i915_sw_fence guc_blocked;
>  
> +	/* Head of children list or link in parent's children list */

Kerneldoc layout would be nice, plus explaining when exactly this is
set or the list empty (e.g. guch_child_list is empty if and only if
guc_number_children > 0 and parent == NULL).

Also mentionting that these are invariant over the lifetime of the object
would be nice.

Finally some words on refcounting (like who holds a reference on whom and
how we guarantee that use-after-free doesn't go boom since you have links
both ways). It looks like parent holds a reference on the child, so how do
you make sure the child looking at the parent doesn't go boom?
-Daniel

> +	union {
> +		struct list_head guc_child_list;	/* parent */
> +		struct list_head guc_child_link;	/* child */
> +	};
> +
> +	/* Pointer to parent */
> +	struct intel_context *parent;
> +
> +	/* Number of children if parent */
> +	u8 guc_number_children;
> +
>  	/*
>  	 * GuC priority management
>  	 */
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-09 14:37   ` Daniel Vetter
@ 2021-08-09 14:40     ` Daniel Vetter
  2021-08-09 18:45       ` Matthew Brost
  2021-08-09 18:44     ` Matthew Brost
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 14:40 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:37:55PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:12PM -0700, Matthew Brost wrote:
> > Introduce context parent-child relationship. Once this relationship is
> > created all pinning / unpinning operations are directed to the parent
> > context. The parent context is responsible for pinning all of its'
> > children and itself.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - a single H2G is used
> > register / deregister all of the contexts simultaneously.
> > 
> > Subsequent patches in the series will implement the pinning / unpinning
> > operations for parent / child contexts.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
> >  drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
> >  drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
> >  3 files changed, 59 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 745e84c72c90..8cb92b10b547 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >  	spin_lock_init(&ce->guc_state.lock);
> >  	INIT_LIST_HEAD(&ce->guc_state.fences);
> >  
> > +	INIT_LIST_HEAD(&ce->guc_child_list);
> > +
> >  	spin_lock_init(&ce->guc_active.lock);
> >  	INIT_LIST_HEAD(&ce->guc_active.requests);
> >  
> > @@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >  
> >  void intel_context_fini(struct intel_context *ce)
> >  {
> > +	struct intel_context *child, *next;
> > +
> >  	if (ce->timeline)
> >  		intel_timeline_put(ce->timeline);
> >  	i915_vm_put(ce->vm);
> >  
> > +	/* Need to put the creation ref for the children */
> > +	if (intel_context_is_parent(ce))
> > +		for_each_child_safe(ce, child, next)
> > +			intel_context_put(child);
> > +
> >  	mutex_destroy(&ce->pin_mutex);
> >  	i915_active_fini(&ce->active);
> >  }
> > @@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >  	return active;
> >  }
> >  
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child)
> > +{
> > +	/*
> > +	 * Callers responsibility to validate that this function is used
> > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > +	 */
> > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > +	GEM_BUG_ON(intel_context_is_child(parent));
> > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > +	GEM_BUG_ON(intel_context_is_child(child));
> > +	GEM_BUG_ON(intel_context_is_parent(child));
> > +
> > +	parent->guc_number_children++;
> > +	list_add_tail(&child->guc_child_link,
> > +		      &parent->guc_child_list);
> > +	child->parent = parent;
> > +}
> > +
> >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >  #include "selftest_context.c"
> >  #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c41098950746..ad6ce5ac4824 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
> >  int intel_context_reconfigure_sseu(struct intel_context *ce,
> >  				   const struct intel_sseu sseu);
> >  
> > +static inline bool intel_context_is_child(struct intel_context *ce)
> > +{
> > +	return !!ce->parent;
> > +}
> > +
> > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > +{
> > +	return !!ce->guc_number_children;
> > +}
> > +
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child);
> > +
> > +#define for_each_child(parent, ce)\
> > +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> > +#define for_each_child_safe(parent, ce, cn)\
> > +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> > +
> >  /**
> >   * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> >   * @ce - the context
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 2df79ba39867..66b22b370a72 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -202,6 +202,18 @@ struct intel_context {
> >  	/* GuC context blocked fence */
> >  	struct i915_sw_fence guc_blocked;
> >  
> > +	/* Head of children list or link in parent's children list */
> 
> Kerneldoc layout would be nice, plus explaining when exactly this is
> set or the list empty (e.g. guch_child_list is empty if and only if
> guc_number_children > 0 and parent == NULL).
> 
> Also mentionting that these are invariant over the lifetime of the object
> would be nice.
> 
> Finally some words on refcounting (like who holds a reference on whom and
> how we guarantee that use-after-free doesn't go boom since you have links
> both ways). It looks like parent holds a reference on the child, so how do
> you make sure the child looking at the parent doesn't go boom?
> -Daniel
> 
> > +	union {
> > +		struct list_head guc_child_list;	/* parent */
> > +		struct list_head guc_child_link;	/* child */
> > +	};
> > +
> > +	/* Pointer to parent */
> > +	struct intel_context *parent;
> > +
> > +	/* Number of children if parent */
> > +	u8 guc_number_children;

Another one: Can we really not afford a int here? The nasty thing about
unsigned is that wrap-around is well-defined, which is why gcc won't ever
complain about it. Which hides bugs. Same for next patch, which also
micro-optimizes a few fields to be tiny.

We generally don't have thousands of contexts hanging around, unless
there's a reason (which should be documented) this feels like it's
squarely on the wrong side of "don't prematurely optimize".
-Daniel

> > +
> >  	/*
> >  	 * GuC priority management
> >  	 */
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions Matthew Brost
@ 2021-08-09 15:17   ` Daniel Vetter
  2021-08-09 18:58     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 15:17 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> Implement GuC parent-child context pin / unpin functions in which in any
> contexts in the relationship are pinned all the contexts are pinned. The
> parent owns most of the pinning / unpinning process and the children
> direct any pins / unpins to the parent.
> 
> Patch implements a number of unused functions that will be connected
> later in the series.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
>  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
>  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
>  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
>  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
>  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
>  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
>  9 files changed, 371 insertions(+), 112 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 8cb92b10b547..bb4c14656067 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
>  	intel_ring_unpin(ring);
>  }
>  
> -static int intel_context_pre_pin(struct intel_context *ce,
> -				 struct i915_gem_ww_ctx *ww)
> +static int __intel_context_pre_pin(struct intel_context *ce,
> +				   struct i915_gem_ww_ctx *ww)
>  {
>  	int err;
>  
> @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
>  	return err;
>  }
>  
> -static void intel_context_post_unpin(struct intel_context *ce)
> +static void __intel_context_post_unpin(struct intel_context *ce)
>  {
>  	if (ce->state)
>  		__context_unpin_state(ce->state);
> @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
>  	__ring_retire(ce->ring);
>  }
>  
> -int __intel_context_do_pin_ww(struct intel_context *ce,
> -			      struct i915_gem_ww_ctx *ww)
> +static int intel_context_pre_pin(struct intel_context *ce,
> +				 struct i915_gem_ww_ctx *ww)
>  {
> -	bool handoff = false;
> -	void *vaddr;
> +	struct intel_context *child;
> +	int err, i = 0;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	for_each_child(ce, child) {
> +		err = __intel_context_pre_pin(child, ww);
> +		if (unlikely(err))
> +			goto unwind;
> +		++i;
> +	}
> +
> +	err = __intel_context_pre_pin(ce, ww);
> +	if (unlikely(err))
> +		goto unwind;
> +
> +	return 0;
> +
> +unwind:
> +	for_each_child(ce, child) {
> +		if (!i--)
> +			break;
> +		__intel_context_post_unpin(ce);
> +	}
> +
> +	return err;
> +}
> +
> +static void intel_context_post_unpin(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	for_each_child(ce, child)
> +		__intel_context_post_unpin(child);
> +
> +	__intel_context_post_unpin(ce);
> +}
> +
> +static int __do_ww_lock(struct intel_context *ce,
> +			struct i915_gem_ww_ctx *ww)
> +{
> +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> +
> +	if (!err && ce->ring->vma->obj)
> +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> +	if (!err && ce->state)
> +		err = i915_gem_object_lock(ce->state->obj, ww);
> +
> +	return err;
> +}
> +
> +static int do_ww_lock(struct intel_context *ce,
> +		      struct i915_gem_ww_ctx *ww)
> +{
> +	struct intel_context *child;
>  	int err = 0;
>  
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	for_each_child(ce, child) {
> +		err = __do_ww_lock(child, ww);
> +		if (unlikely(err))
> +			return err;
> +	}
> +
> +	return __do_ww_lock(ce, ww);
> +}
> +
> +static int __intel_context_do_pin_ww(struct intel_context *ce,
> +				     struct i915_gem_ww_ctx *ww)
> +{
> +	bool handoff = false;
> +	int err;
> +
>  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
>  		err = intel_context_alloc_state(ce);
>  		if (err)
> @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>  	 * refcount for __intel_context_active(), which prevent a lock
>  	 * inversion of ce->pin_mutex vs dma_resv_lock().
>  	 */
> +	err = do_ww_lock(ce, ww);
> +	if (err)
> +		return err;
>  
> -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> -	if (!err && ce->ring->vma->obj)
> -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> -	if (!err && ce->state)
> -		err = i915_gem_object_lock(ce->state->obj, ww);
> -	if (!err)
> -		err = intel_context_pre_pin(ce, ww);
> +	err = intel_context_pre_pin(ce, ww);
>  	if (err)
>  		return err;
>  
> @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>  	if (err)
>  		goto err_ctx_unpin;
>  
> -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> +	err = ce->ops->pre_pin(ce, ww);
>  	if (err)
>  		goto err_release;
>  
> @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>  		if (unlikely(err))
>  			goto err_unlock;
>  
> -		err = ce->ops->pin(ce, vaddr);
> +		err = ce->ops->pin(ce);
>  		if (err) {
>  			intel_context_active_release(ce);
>  			goto err_unlock;
> @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
>  	return err;
>  }
>  
> -int __intel_context_do_pin(struct intel_context *ce)
> +static int __intel_context_do_pin(struct intel_context *ce)
>  {
>  	struct i915_gem_ww_ctx ww;
>  	int err;
> @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
>  		 intel_context_get_avg_runtime_ns(ce));
>  
>  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> -	intel_context_post_unpin(ce);
> +	__intel_context_post_unpin(ce);
>  	intel_context_put(ce);
>  }
>  
> @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
>  	child->parent = parent;
>  }
>  
> +static inline int ____intel_context_pin(struct intel_context *ce)
> +{
> +	if (likely(intel_context_pin_if_active(ce)))
> +		return 0;
> +
> +	return __intel_context_do_pin(ce);
> +}
> +
> +static inline int __intel_context_pin_ww(struct intel_context *ce,
> +					 struct i915_gem_ww_ctx *ww)
> +{
> +	if (likely(intel_context_pin_if_active(ce)))
> +		return 0;
> +
> +	return __intel_context_do_pin_ww(ce, ww);
> +}
> +
> +static inline void __intel_context_unpin(struct intel_context *ce)
> +{
> +	if (!ce->ops->sched_disable) {
> +		__intel_context_do_unpin(ce, 1);
> +	} else {
> +		/*
> +		 * Move ownership of this pin to the scheduling disable which is
> +		 * an async operation. When that operation completes the above
> +		 * intel_context_sched_disable_unpin is called potentially
> +		 * unpinning the context.
> +		 */
> +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {

Uh man lockless algorithms.

Unless this comes:
- with essentially an academic looking paper that describes the abstract
  model of the lockless algorithm and proves it against the linux kernel
  meory model.

- lockless stuff generally needs barriers, and those barriers must be all
  documented. This means a) a comment next to each barrier in the code b)
  pointing to its counterparty c) with the overall design also explained
  in the kerneldoc for those datastructres.

  If you don't know where your barriers are, see above point about "it
  should look more like an academic paper in the commit message"

- hard perf data about how this is absolutely required, based on a
  real-world use-case (which then sometimes justifies a microbenchmark
  metric for the details, but it always needs to be real-world based). And
  also a throughrough explainer how the perf issue isn't fixable through
  better design. If that's not doable, just protect the state machine with
  a big dumb lock and move on.

- Also, because the current code is in such bad shape wrt lockless
  algorithms and premature optimizations: Overall complexity should go
  down (it's way too high right now), so pay down your new lockless trick
  by removing one of the existing ones that we only have because we can.

Yes this is steep, but we're way out in the woods here and need to smoehow
get back.
-Daniel

> +				ce->ops->sched_disable(ce);
> +				break;
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> + * GuC submission. Basically the idea is if any of the contexts, that are
> + * configured for parallel submission, are pinned all the contexts need to be
> + * pinned in order to register these contexts with the GuC. We are adding the
> + * layer here while it should probably be pushed to the backend via a vfunc. But
> + * since we already have ce->pin + a layer atop it is confusing. Definitely
> + * needs a bit of rework how to properly layer / structure this code path. What
> + * is in place works but is not ideal.
> + */
> +int intel_context_pin(struct intel_context *ce)
> +{
> +	if (intel_context_is_child(ce)) {
> +		if (!atomic_fetch_add(1, &ce->pin_count))
> +			return ____intel_context_pin(ce->parent);
> +		else
> +			return 0;
> +	} else {
> +		return ____intel_context_pin(ce);
> +	}
> +}
> +
> +int intel_context_pin_ww(struct intel_context *ce,
> +			 struct i915_gem_ww_ctx *ww)
> +{
> +	if (intel_context_is_child(ce)) {
> +		if (!atomic_fetch_add(1, &ce->pin_count))
> +			return __intel_context_pin_ww(ce->parent, ww);
> +		else
> +			return 0;
> +	} else {
> +		return __intel_context_pin_ww(ce, ww);
> +	}
> +}
> +
> +void intel_context_unpin(struct intel_context *ce)
> +{
> +	if (intel_context_is_child(ce)) {
> +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> +			__intel_context_unpin(ce->parent);
> +	} else {
> +		__intel_context_unpin(ce);
> +	}
> +}
> +
>  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>  #include "selftest_context.c"
>  #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index ad6ce5ac4824..c208691fc87d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
>  	mutex_unlock(&ce->pin_mutex);
>  }
>  
> -int __intel_context_do_pin(struct intel_context *ce);
> -int __intel_context_do_pin_ww(struct intel_context *ce,
> -			      struct i915_gem_ww_ctx *ww);
> -
>  static inline bool intel_context_pin_if_active(struct intel_context *ce)
>  {
>  	return atomic_inc_not_zero(&ce->pin_count);
>  }
>  
> -static inline int intel_context_pin(struct intel_context *ce)
> -{
> -	if (likely(intel_context_pin_if_active(ce)))
> -		return 0;
> -
> -	return __intel_context_do_pin(ce);
> -}
> -
> -static inline int intel_context_pin_ww(struct intel_context *ce,
> -				       struct i915_gem_ww_ctx *ww)
> -{
> -	if (likely(intel_context_pin_if_active(ce)))
> -		return 0;
> +int intel_context_pin(struct intel_context *ce);
>  
> -	return __intel_context_do_pin_ww(ce, ww);
> -}
> +int intel_context_pin_ww(struct intel_context *ce,
> +			 struct i915_gem_ww_ctx *ww);
>  
>  static inline void __intel_context_pin(struct intel_context *ce)
>  {
> @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
>  
>  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
>  {
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  	__intel_context_do_unpin(ce, 2);
>  }
>  
> -static inline void intel_context_unpin(struct intel_context *ce)
> -{
> -	if (!ce->ops->sched_disable) {
> -		__intel_context_do_unpin(ce, 1);
> -	} else {
> -		/*
> -		 * Move ownership of this pin to the scheduling disable which is
> -		 * an async operation. When that operation completes the above
> -		 * intel_context_sched_disable_unpin is called potentially
> -		 * unpinning the context.
> -		 */
> -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> -				ce->ops->sched_disable(ce);
> -				break;
> -			}
> -		}
> -	}
> -}
> +void intel_context_unpin(struct intel_context *ce);
>  
>  void intel_context_enter_engine(struct intel_context *ce);
>  void intel_context_exit_engine(struct intel_context *ce);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 66b22b370a72..eb82be15b7a2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -39,8 +39,8 @@ struct intel_context_ops {
>  
>  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
>  
> -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> -	int (*pin)(struct intel_context *ce, void *vaddr);
> +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> +	int (*pin)(struct intel_context *ce);
>  	void (*unpin)(struct intel_context *ce);
>  	void (*post_unpin)(struct intel_context *ce);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index baa1797af1c8..fc74ca28f245 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
>  static int
>  __execlists_context_pre_pin(struct intel_context *ce,
>  			    struct intel_engine_cs *engine,
> -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> +			    struct i915_gem_ww_ctx *ww)
>  {
>  	int err;
>  
> -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> +	err = lrc_pre_pin(ce, engine, ww);
>  	if (err)
>  		return err;
>  
>  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> -		lrc_init_state(ce, engine, *vaddr);
> +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
>  
>  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
>  	}
> @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
>  }
>  
>  static int execlists_context_pre_pin(struct intel_context *ce,
> -				     struct i915_gem_ww_ctx *ww,
> -				     void **vaddr)
> +				     struct i915_gem_ww_ctx *ww)
>  {
> -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> +	return __execlists_context_pre_pin(ce, ce->engine, ww);
>  }
>  
> -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> +static int execlists_context_pin(struct intel_context *ce)
>  {
> -	return lrc_pin(ce, ce->engine, vaddr);
> +	return lrc_pin(ce, ce->engine);
>  }
>  
>  static int execlists_context_alloc(struct intel_context *ce)
> @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
>  }
>  
>  static int virtual_context_pre_pin(struct intel_context *ce,
> -				   struct i915_gem_ww_ctx *ww,
> -				   void **vaddr)
> +				   struct i915_gem_ww_ctx *ww)
>  {
>  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
>  
>  	 /* Note: we must use a real engine class for setting up reg state */
> -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
>  }
>  
> -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> +static int virtual_context_pin(struct intel_context *ce)
>  {
>  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
>  
> -	return lrc_pin(ce, ve->siblings[0], vaddr);
> +	return lrc_pin(ce, ve->siblings[0]);
>  }
>  
>  static void virtual_context_enter(struct intel_context *ce)
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index bb4af4977920..c466fc966005 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
>  int
>  lrc_pre_pin(struct intel_context *ce,
>  	    struct intel_engine_cs *engine,
> -	    struct i915_gem_ww_ctx *ww,
> -	    void **vaddr)
> +	    struct i915_gem_ww_ctx *ww)
>  {
> +	void *vaddr;
>  	GEM_BUG_ON(!ce->state);
>  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
>  
> -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> -					 i915_coherent_map_type(ce->engine->i915,
> -								ce->state->obj,
> -								false) |
> -					 I915_MAP_OVERRIDE);
> +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> +					i915_coherent_map_type(ce->engine->i915,
> +							       ce->state->obj,
> +							       false) |
> +					I915_MAP_OVERRIDE);
>  
> -	return PTR_ERR_OR_ZERO(*vaddr);
> +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> +
> +	return PTR_ERR_OR_ZERO(vaddr);
>  }
>  
>  int
>  lrc_pin(struct intel_context *ce,
> -	struct intel_engine_cs *engine,
> -	void *vaddr)
> +	struct intel_engine_cs *engine)
>  {
> -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> -
>  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> -		lrc_init_state(ce, engine, vaddr);
> +		lrc_init_state(ce, engine,
> +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
>  
>  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
>  	return 0;
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> index 7f697845c4cf..837fcf00270d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
>  int
>  lrc_pre_pin(struct intel_context *ce,
>  	    struct intel_engine_cs *engine,
> -	    struct i915_gem_ww_ctx *ww,
> -	    void **vaddr);
> +	    struct i915_gem_ww_ctx *ww);
>  int
>  lrc_pin(struct intel_context *ce,
> -	struct intel_engine_cs *engine,
> -	void *vaddr);
> +	struct intel_engine_cs *engine);
>  void lrc_unpin(struct intel_context *ce);
>  void lrc_post_unpin(struct intel_context *ce);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> index 2958e2fae380..f4f301bfb9f7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
>  }
>  
>  static int ring_context_pre_pin(struct intel_context *ce,
> -				struct i915_gem_ww_ctx *ww,
> -				void **unused)
> +				struct i915_gem_ww_ctx *ww)
>  {
>  	struct i915_address_space *vm;
>  	int err = 0;
> @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
>  	return 0;
>  }
>  
> -static int ring_context_pin(struct intel_context *ce, void *unused)
> +static int ring_context_pin(struct intel_context *ce)
>  {
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> index 2c1af030310c..826b5d7a4573 100644
> --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
>  }
>  
>  static int mock_context_pre_pin(struct intel_context *ce,
> -				struct i915_gem_ww_ctx *ww, void **unused)
> +				struct i915_gem_ww_ctx *ww)
>  {
>  	return 0;
>  }
>  
> -static int mock_context_pin(struct intel_context *ce, void *unused)
> +static int mock_context_pin(struct intel_context *ce)
>  {
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index dec757d319a2..c5c73c42bcf7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  
>  	GEM_BUG_ON(!engine->mask);
>  	GEM_BUG_ON(context_guc_id_invalid(ce));
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  
>  	/*
>  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
>  
>  static int __guc_context_pre_pin(struct intel_context *ce,
>  				 struct intel_engine_cs *engine,
> -				 struct i915_gem_ww_ctx *ww,
> -				 void **vaddr)
> +				 struct i915_gem_ww_ctx *ww)
>  {
> -	return lrc_pre_pin(ce, engine, ww, vaddr);
> +	return lrc_pre_pin(ce, engine, ww);
>  }
>  
>  static int __guc_context_pin(struct intel_context *ce,
> -			     struct intel_engine_cs *engine,
> -			     void *vaddr)
> +			     struct intel_engine_cs *engine)
>  {
>  	if (i915_ggtt_offset(ce->state) !=
>  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
>  	 * explaination of why.
>  	 */
>  
> -	return lrc_pin(ce, engine, vaddr);
> +	return lrc_pin(ce, engine);
> +}
> +
> +static void __guc_context_unpin(struct intel_context *ce)
> +{
> +	lrc_unpin(ce);
> +}
> +
> +static void __guc_context_post_unpin(struct intel_context *ce)
> +{
> +	lrc_post_unpin(ce);
>  }
>  
>  static int guc_context_pre_pin(struct intel_context *ce,
> -			       struct i915_gem_ww_ctx *ww,
> -			       void **vaddr)
> +			       struct i915_gem_ww_ctx *ww)
>  {
> -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> +	return __guc_context_pre_pin(ce, ce->engine, ww);
>  }
>  
> -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> +static int guc_context_pin(struct intel_context *ce)
>  {
> -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> +	int ret;
>  
> +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> +		   intel_context_is_child(ce));
> +
> +	ret = __guc_context_pin(ce, ce->engine);
>  	if (likely(!ret && !intel_context_is_barrier(ce)))
>  		intel_engine_pm_get(ce->engine);
>  
> @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
>  	GEM_BUG_ON(context_enabled(ce));
>  
>  	unpin_guc_id(guc, ce, true);
> -	lrc_unpin(ce);
> +	__guc_context_unpin(ce);
>  
>  	if (likely(!intel_context_is_barrier(ce)))
>  		intel_engine_pm_put(ce->engine);
> @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
>  
>  static void guc_context_post_unpin(struct intel_context *ce)
>  {
> -	lrc_post_unpin(ce);
> +	__guc_context_post_unpin(ce);
> +}
> +
> +/* Future patches will use this function */
> +__maybe_unused
> +static int guc_parent_context_pre_pin(struct intel_context *ce,
> +				      struct i915_gem_ww_ctx *ww)
> +{
> +	struct intel_context *child;
> +	int err, i = 0, j = 0;
> +
> +	for_each_child(ce, child) {
> +		err = i915_active_acquire(&child->active);
> +		if (unlikely(err))
> +			goto unwind_active;
> +		++i;
> +	}
> +
> +	for_each_child(ce, child) {
> +		err = __guc_context_pre_pin(child, child->engine, ww);
> +		if (unlikely(err))
> +			goto unwind_pre_pin;
> +		++j;
> +	}
> +
> +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> +	if (unlikely(err))
> +		goto unwind_pre_pin;
> +
> +	return 0;
> +
> +unwind_pre_pin:
> +	for_each_child(ce, child) {
> +		if (!j--)
> +			break;
> +		__guc_context_post_unpin(child);
> +	}
> +
> +unwind_active:
> +	for_each_child(ce, child) {
> +		if (!i--)
> +			break;
> +		i915_active_release(&child->active);
> +	}
> +
> +	return err;
> +}
> +
> +/* Future patches will use this function */
> +__maybe_unused
> +static void guc_parent_context_post_unpin(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +
> +	for_each_child(ce, child)
> +		__guc_context_post_unpin(child);
> +	__guc_context_post_unpin(ce);
> +
> +	for_each_child(ce, child) {
> +		intel_context_get(child);
> +		i915_active_release(&child->active);
> +		intel_context_put(child);
> +	}
> +}
> +
> +/* Future patches will use this function */
> +__maybe_unused
> +static int guc_parent_context_pin(struct intel_context *ce)
> +{
> +	int ret, i = 0, j = 0;
> +	struct intel_context *child;
> +	struct intel_engine_cs *engine;
> +	intel_engine_mask_t tmp;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +
> +	for_each_child(ce, child) {
> +		ret = __guc_context_pin(child, child->engine);
> +		if (unlikely(ret))
> +			goto unwind_pin;
> +		++i;
> +	}
> +	ret = __guc_context_pin(ce, ce->engine);
> +	if (unlikely(ret))
> +		goto unwind_pin;
> +
> +	for_each_child(ce, child)
> +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> +			break;
> +		}
> +
> +	for_each_engine_masked(engine, ce->engine->gt,
> +			       ce->engine->mask, tmp)
> +		intel_engine_pm_get(engine);
> +	for_each_child(ce, child)
> +		for_each_engine_masked(engine, child->engine->gt,
> +				       child->engine->mask, tmp)
> +			intel_engine_pm_get(engine);
> +
> +	return 0;
> +
> +unwind_pin:
> +	for_each_child(ce, child) {
> +		if (++j > i)
> +			break;
> +		__guc_context_unpin(child);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Future patches will use this function */
> +__maybe_unused
> +static void guc_parent_context_unpin(struct intel_context *ce)
> +{
> +	struct intel_context *child;
> +	struct intel_engine_cs *engine;
> +	intel_engine_mask_t tmp;
> +
> +	GEM_BUG_ON(!intel_context_is_parent(ce));
> +	GEM_BUG_ON(context_enabled(ce));
> +
> +	unpin_guc_id(ce_to_guc(ce), ce, true);
> +	for_each_child(ce, child)
> +		__guc_context_unpin(child);
> +	__guc_context_unpin(ce);
> +
> +	for_each_engine_masked(engine, ce->engine->gt,
> +			       ce->engine->mask, tmp)
> +		intel_engine_pm_put(engine);
> +	for_each_child(ce, child)
> +		for_each_engine_masked(engine, child->engine->gt,
> +				       child->engine->mask, tmp)
> +			intel_engine_pm_put(engine);
>  }
>  
>  static void __guc_context_sched_enable(struct intel_guc *guc,
> @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
>  }
>  
>  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> -				       struct i915_gem_ww_ctx *ww,
> -				       void **vaddr)
> +				       struct i915_gem_ww_ctx *ww)
>  {
>  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
>  
> -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> +	return __guc_context_pre_pin(ce, engine, ww);
>  }
>  
> -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> +static int guc_virtual_context_pin(struct intel_context *ce)
>  {
>  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> -	int ret = __guc_context_pin(ce, engine, vaddr);
> +	int ret = __guc_context_pin(ce, engine);
>  	intel_engine_mask_t tmp, mask = ce->engine->mask;
>  
>  	if (likely(!ret))
> @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
>  	GEM_BUG_ON(intel_context_is_barrier(ce));
>  
>  	unpin_guc_id(guc, ce, true);
> -	lrc_unpin(ce);
> +	__guc_context_unpin(ce);
>  
>  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
>  		intel_engine_pm_put(engine);
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
@ 2021-08-09 15:31   ` Daniel Vetter
  2021-08-09 19:03     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 15:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:16PM -0700, Matthew Brost wrote:
> Assign contexts in parent-child relationship consecutive guc_ids. This
> is accomplished by partitioning guc_id space between ones that need to
> be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> available guc_ids). The consecutive search is implemented via the bitmap
> API.
> 
> This is a precursor to the full GuC multi-lrc implementation but aligns
> to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> when using the GuC multi-lrc interface.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context.h       |   6 +
>  drivers/gpu/drm/i915/gt/intel_reset.c         |   3 +-
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 222 ++++++++++++------
>  .../i915/gt/uc/intel_guc_submission_types.h   |  10 +
>  5 files changed, 179 insertions(+), 69 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index c208691fc87d..7ce3b3d2edb7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -54,6 +54,12 @@ static inline bool intel_context_is_parent(struct intel_context *ce)
>  	return !!ce->guc_number_children;
>  }
>  
> +static inline struct intel_context *
> +intel_context_to_parent(struct intel_context *ce)
> +{
> +	return intel_context_is_child(ce) ? ce->parent : ce;
> +}
> +
>  void intel_context_bind_parent_child(struct intel_context *parent,
>  				     struct intel_context *child);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index ea763138197f..c3d4baa1b2b8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -849,6 +849,7 @@ static void reset_finish(struct intel_gt *gt, intel_engine_mask_t awake)
>  
>  static void nop_submit_request(struct i915_request *request)
>  {
> +	struct intel_context *ce = intel_context_to_parent(request->context);
>  	RQ_TRACE(request, "-EIO\n");
>  
>  	/*
> @@ -857,7 +858,7 @@ static void nop_submit_request(struct i915_request *request)
>  	 * this for now.
>  	 */
>  	if (intel_engine_uses_guc(request->engine))
> -		intel_guc_decr_num_rq_not_ready(request->context);
> +		intel_guc_decr_num_rq_not_ready(ce);
>  
>  	request = i915_request_mark_eio(request);
>  	if (request) {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index c0c60ccabfa4..30a0f364db8f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -24,6 +24,7 @@ struct __guc_ads_blob;
>  
>  enum {
>  	GUC_SUBMIT_ENGINE_SINGLE_LRC,
> +	GUC_SUBMIT_ENGINE_MULTI_LRC,
>  	GUC_SUBMIT_ENGINE_MAX
>  };
>  
> @@ -59,8 +60,10 @@ struct intel_guc {
>  	struct ida guc_ids;
>  	u32 num_guc_ids;
>  	u32 max_guc_ids;
> -	struct list_head guc_id_list_no_ref;
> -	struct list_head guc_id_list_unpinned;
> +	unsigned long *guc_ids_bitmap;
> +#define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> +	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> +	struct list_head guc_id_list_unpinned[MAX_GUC_ID_ORDER + 1];

Random new global lists definitely need kerneldoc about what is on them,
how they're linked, what their lifetime rules are and what locks we're
holding.

Leaving this all to reviews to figure out, and worse, future readers of
your code, is not kind.

>  	spinlock_t destroy_lock;	/* protects list / worker */
>  	struct list_head destroyed_contexts;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f23dd716723f..afb9b4bb8971 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -169,6 +169,15 @@ static void clr_guc_ids_exhausted(struct guc_submit_engine *gse)
>  	clear_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
>  }
>  
> +/*
> + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous

I think it'd be good to put down the reason here for why. Is this a
requirement of the guc interface, or just an artifact of our current
implementation? In the latter case also explain what exactly the
contstraint is (but honestly I can't think of much reasons for that)
-Daniel

> + * and a different allocation algorithm is used (bitmap vs. ida). We believe the
> + * number of multi-lrc contexts in use should be low and 1/16 should be
> + * sufficient. Minimum of 32 ids for multi-lrc.
> + */
> +#define NUMBER_MULTI_LRC_GUC_ID(guc) \
> +	((guc)->num_guc_ids / 16 > 32 ? (guc)->num_guc_ids / 16 : 32)
> +
>  /*
>   * Below is a set of functions which control the GuC scheduling state which do
>   * not require a lock as all state transitions are mutually exclusive. i.e. It
> @@ -405,16 +414,10 @@ static inline void decr_context_blocked(struct intel_context *ce)
>  	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
>  }
>  
> -static inline struct intel_context *
> -to_parent(struct intel_context *ce)
> -{
> -	return intel_context_is_child(ce) ? ce->parent : ce;
> -}
> -
>  static inline struct intel_context *
>  request_to_scheduling_context(struct i915_request *rq)
>  {
> -	return to_parent(rq->context);
> +	return intel_context_to_parent(rq->context);
>  }
>  
>  static inline bool context_guc_id_invalid(struct intel_context *ce)
> @@ -1436,7 +1439,7 @@ static void destroy_worker_func(struct work_struct *w);
>   */
>  int intel_guc_submission_init(struct intel_guc *guc)
>  {
> -	int ret;
> +	int ret, i;
>  
>  	if (guc_submission_initialized(guc))
>  		return 0;
> @@ -1448,9 +1451,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
>  
>  	spin_lock_init(&guc->contexts_lock);
> -	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
> -	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
> +	for (i = 0; i < MAX_GUC_ID_ORDER + 1; ++i) {
> +		INIT_LIST_HEAD(&guc->guc_id_list_no_ref[i]);
> +		INIT_LIST_HEAD(&guc->guc_id_list_unpinned[i]);
> +	}
>  	ida_init(&guc->guc_ids);
> +	guc->guc_ids_bitmap =
> +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
>  
>  	spin_lock_init(&guc->destroy_lock);
>  
> @@ -1476,6 +1483,8 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>  
>  		i915_sched_engine_put(sched_engine);
>  	}
> +
> +	bitmap_free(guc->guc_ids_bitmap);
>  }
>  
>  static inline void queue_request(struct i915_sched_engine *sched_engine,
> @@ -1499,11 +1508,13 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
>  static bool too_many_guc_ids_not_ready(struct guc_submit_engine *gse,
>  				       struct intel_context *ce)
>  {
> -	u32 available_guc_ids, guc_ids_consumed;
>  	struct intel_guc *guc = gse->sched_engine.private_data;
> +	u32 available_guc_ids = intel_context_is_parent(ce) ?
> +		NUMBER_MULTI_LRC_GUC_ID(guc) :
> +		guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> +	u32 guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
>  
> -	available_guc_ids = guc->num_guc_ids;
> -	guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  
>  	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
>  		set_and_update_guc_ids_exhausted(gse);
> @@ -1517,17 +1528,26 @@ static void incr_num_rq_not_ready(struct intel_context *ce)
>  {
>  	struct guc_submit_engine *gse = ce_to_gse(ce);
>  
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(!intel_context_is_parent(ce) &&
> +		   ce->guc_number_children);
> +
>  	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
> -		atomic_inc(&gse->num_guc_ids_not_ready);
> +		atomic_add(ce->guc_number_children + 1,
> +			   &gse->num_guc_ids_not_ready);
>  }
>  
>  void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
>  {
>  	struct guc_submit_engine *gse = ce_to_gse(ce);
>  
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>  	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1) {
>  		GEM_BUG_ON(!atomic_read(&gse->num_guc_ids_not_ready));
> -		atomic_dec(&gse->num_guc_ids_not_ready);
> +
> +		atomic_sub(ce->guc_number_children + 1,
> +			   &gse->num_guc_ids_not_ready);
>  	}
>  }
>  
> @@ -1579,20 +1599,42 @@ static void guc_submit_request(struct i915_request *rq)
>  
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
>  
> -	intel_guc_decr_num_rq_not_ready(rq->context);
> +	intel_guc_decr_num_rq_not_ready(request_to_scheduling_context(rq));
>  }
>  
> -static int new_guc_id(struct intel_guc *guc)
> +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  {
> -	return ida_simple_get(&guc->guc_ids, 0,
> -			      guc->num_guc_ids, GFP_KERNEL |
> -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +	int ret;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (intel_context_is_parent(ce))
> +		ret = bitmap_find_free_region(guc->guc_ids_bitmap,
> +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> +					      order_base_2(ce->guc_number_children
> +							   + 1));
> +	else
> +		ret = ida_simple_get(&guc->guc_ids,
> +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> +				     guc->num_guc_ids, GFP_KERNEL |
> +				     __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +	if (unlikely(ret < 0))
> +		return ret;
> +
> +	ce->guc_id = ret;
> +	return 0;
>  }
>  
>  static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  {
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  	if (!context_guc_id_invalid(ce)) {
> -		ida_simple_remove(&guc->guc_ids, ce->guc_id);
> +		if (intel_context_is_parent(ce))
> +			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> +					      order_base_2(ce->guc_number_children
> +							   + 1));
> +		else
> +			ida_simple_remove(&guc->guc_ids, ce->guc_id);
>  		clr_lrc_desc_registered(guc, ce->guc_id);
>  		set_context_guc_id_invalid(ce);
>  	}
> @@ -1604,6 +1646,8 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  {
>  	unsigned long flags;
>  
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
>  	spin_lock_irqsave(&guc->contexts_lock, flags);
>  	__release_guc_id(guc, ce);
>  	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> @@ -1618,54 +1662,93 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   * schedule disable H2G + a deregister H2G.
>   */
>  static struct list_head *get_guc_id_list(struct intel_guc *guc,
> +					 u8 number_children,
>  					 bool unpinned)
>  {
> +	GEM_BUG_ON(order_base_2(number_children + 1) > MAX_GUC_ID_ORDER);
> +
>  	if (unpinned)
> -		return &guc->guc_id_list_unpinned;
> +		return &guc->guc_id_list_unpinned[order_base_2(number_children + 1)];
>  	else
> -		return &guc->guc_id_list_no_ref;
> +		return &guc->guc_id_list_no_ref[order_base_2(number_children + 1)];
>  }
>  
> -static int steal_guc_id(struct intel_guc *guc, bool unpinned)
> +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> +			bool unpinned)
>  {
> -	struct intel_context *ce;
> -	int guc_id;
> -	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
> +	struct intel_context *cn;
> +	u8 number_children = ce->guc_number_children;
>  
>  	lockdep_assert_held(&guc->contexts_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  
> -	if (!list_empty(guc_id_list)) {
> -		ce = list_first_entry(guc_id_list,
> -				      struct intel_context,
> -				      guc_id_link);
> +	do {
> +		struct list_head *guc_id_list =
> +			get_guc_id_list(guc, number_children, unpinned);
>  
> -		/* Ensure context getting stolen in expected state */
> -		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> -		GEM_BUG_ON(context_guc_id_invalid(ce));
> -		GEM_BUG_ON(context_guc_id_stolen(ce));
> +		if (!list_empty(guc_id_list)) {
> +			u8 cn_o2, ce_o2 =
> +				order_base_2(ce->guc_number_children + 1);
>  
> -		list_del_init(&ce->guc_id_link);
> -		guc_id = ce->guc_id;
> -		clr_context_registered(ce);
> +			cn = list_first_entry(guc_id_list,
> +					      struct intel_context,
> +					      guc_id_link);
> +			cn_o2 = order_base_2(cn->guc_number_children + 1);
> +
> +			/*
> +			 * Corner case where a multi-lrc context steals a guc_id
> +			 * from another context that has more guc_id that itself.
> +			 */
> +			if (cn_o2 != ce_o2) {
> +				bitmap_release_region(guc->guc_ids_bitmap,
> +						      cn->guc_id,
> +						      cn_o2);
> +				bitmap_allocate_region(guc->guc_ids_bitmap,
> +						       ce->guc_id,
> +						       ce_o2);
> +			}
> +
> +			/* Ensure context getting stolen in expected state */
> +			GEM_BUG_ON(atomic_read(&cn->guc_id_ref));
> +			GEM_BUG_ON(context_guc_id_invalid(cn));
> +			GEM_BUG_ON(context_guc_id_stolen(cn));
> +			GEM_BUG_ON(ce_to_gse(ce) != ce_to_gse(cn));
> +
> +			list_del_init(&cn->guc_id_link);
> +			ce->guc_id = cn->guc_id;
> +
> +			/*
> +			 * If stealing from the pinned list, defer invalidating
> +			 * the guc_id until the retire workqueue processes this
> +			 * context.
> +			 */
> +			clr_context_registered(cn);
> +			if (!unpinned) {
> +				GEM_BUG_ON(ce_to_gse(cn)->stalled_context);
> +				ce_to_gse(cn)->stalled_context =
> +					intel_context_get(cn);
> +				set_context_guc_id_stolen(cn);
> +			} else {
> +				set_context_guc_id_invalid(cn);
> +			}
> +
> +			return 0;
> +		}
>  
>  		/*
> -		 * If stealing from the pinned list, defer invalidating
> -		 * the guc_id until the retire workqueue processes this
> -		 * context.
> +		 * When using multi-lrc we search the guc_id_lists with the
> +		 * least amount of guc_ids required first but will consume a
> +		 * block larger of guc_ids if necessary. 2x the children always
> +		 * moves you two the next list.
>  		 */
> -		if (!unpinned) {
> -			GEM_BUG_ON(ce_to_gse(ce)->stalled_context);
> +		if (!number_children ||
> +		    order_base_2(number_children + 1) == MAX_GUC_ID_ORDER)
> +			break;
>  
> -			ce_to_gse(ce)->stalled_context = intel_context_get(ce);
> -			set_context_guc_id_stolen(ce);
> -		} else {
> -			set_context_guc_id_invalid(ce);
> -		}
> +		number_children *= 2;
> +	} while (true);
>  
> -		return guc_id;
> -	} else {
> -		return -EAGAIN;
> -	}
> +	return -EAGAIN;
>  }
>  
>  enum {	/* Return values for pin_guc_id / assign_guc_id */
> @@ -1674,17 +1757,18 @@ enum {	/* Return values for pin_guc_id / assign_guc_id */
>  	NEW_GUC_ID_ENABLED	= 2,
>  };
>  
> -static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
> +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce,
> +			 bool tasklet)
>  {
>  	int ret;
>  
>  	lockdep_assert_held(&guc->contexts_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  
> -	ret = new_guc_id(guc);
> +	ret = new_guc_id(guc, ce);
>  	if (unlikely(ret < 0)) {
> -		ret = steal_guc_id(guc, true);
> -		if (ret >= 0) {
> -			*out = ret;
> +		ret = steal_guc_id(guc, ce, true);
> +		if (!ret) {
>  			ret = NEW_GUC_ID_DISABLED;
>  		} else if (ret < 0 && tasklet) {
>  			/*
> @@ -1692,15 +1776,18 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
>  			 * enabled if guc_ids are exhausted and we are submitting
>  			 * from the tasklet.
>  			 */
> -			ret = steal_guc_id(guc, false);
> -			if (ret >= 0) {
> -				*out = ret;
> +			ret = steal_guc_id(guc, ce, false);
> +			if (!ret)
>  				ret = NEW_GUC_ID_ENABLED;
> -			}
>  		}
> -	} else {
> -		*out = ret;
> -		ret = SAME_GUC_ID;
> +	}
> +
> +	if (!(ret < 0) && intel_context_is_parent(ce)) {
> +		struct intel_context *child;
> +		int i = 1;
> +
> +		for_each_child(ce, child)
> +			child->guc_id = ce->guc_id + i++;
>  	}
>  
>  	return ret;
> @@ -1713,6 +1800,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
>  	int ret = 0;
>  	unsigned long flags, tries = PIN_GUC_ID_TRIES;
>  
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
>  
>  try_again:
> @@ -1724,7 +1812,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
>  	}
>  
>  	if (context_guc_id_invalid(ce)) {
> -		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
> +		ret = assign_guc_id(guc, ce, tasklet);
>  		if (unlikely(ret < 0))
>  			goto out_unlock;
>  	}
> @@ -1770,6 +1858,7 @@ static void unpin_guc_id(struct intel_guc *guc,
>  	unsigned long flags;
>  
>  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
> +	GEM_BUG_ON(intel_context_is_child(ce));
>  
>  	if (unlikely(context_guc_id_invalid(ce)))
>  		return;
> @@ -1781,7 +1870,8 @@ static void unpin_guc_id(struct intel_guc *guc,
>  
>  	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
>  	    !atomic_read(&ce->guc_id_ref)) {
> -		struct list_head *head = get_guc_id_list(guc, unpinned);
> +		struct list_head *head =
> +			get_guc_id_list(guc, ce->guc_number_children, unpinned);
>  
>  		list_add_tail(&ce->guc_id_link, head);
>  	}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> index 7069b7248f55..a5933e07bdd2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> @@ -22,6 +22,16 @@ struct guc_virtual_engine {
>  /*
>   * Object which encapsulates the globally operated on i915_sched_engine +
>   * the GuC submission state machine described in intel_guc_submission.c.
> + *
> + * Currently we have two instances of these per GuC. One for single-lrc and one
> + * for multi-lrc submission. We split these into two submission engines as they
> + * can operate in parallel allowing a blocking condition on one not to affect
> + * the other. i.e. guc_ids are statically allocated between these two submission
> + * modes. One mode may have guc_ids exhausted which requires blocking while the
> + * other has plenty of guc_ids and can make forward progres.
> + *
> + * In the future if different submission use cases arise we can simply
> + * instantiate another of these objects and assign it to the context.
>   */
>  struct guc_submit_engine {
>  	struct i915_sched_engine sched_engine;
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine Matthew Brost
@ 2021-08-09 15:35   ` Daniel Vetter
  2021-08-09 19:05     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 15:35 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> the hang check. As such if a different GSE's state machine hangs, the
> heartbeat cannot detect this hang. Add timer to each GSE which in turn
> can disable all submissions if it is hung.
> 
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
>  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index afb9b4bb8971..2d8296bcc583 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
>  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> +#define MAX_TASKLET_BLOCKED_NS	2000000000
>  static void set_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>  	lockdep_assert_held(&gse->sched_engine.lock);
> +	hrtimer_start_range_ns(&gse->hang_timer,
> +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> +			       HRTIMER_MODE_REL_PINNED);
>  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);

So with drm/scheduler the reset handling is assumed to be
single-threaded, and there's quite complex rules around that. I've
recently worked with Boris Brezillion to clarify all this a bit and
improve docs. Does this all still work in that glorious future? Might be
good to at least sprinkle some comments/thoughts around in the commit
message about the envisaged future direction for all this stuff, to keep
people in the loop. Especially future people.

Ofc plan is still to just largely land all this.

Also: set_bit is an unordered atomic, which means you need barriers, which
meanes ... *insert the full rant about justifying/documenting lockless
algorithms from earlier *

But I think this all falls out with the removal of the guc-id allocation
scheme?
-Daniel

>  }
>  
>  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>  	lockdep_assert_held(&gse->sched_engine.lock);
> +	hrtimer_cancel(&gse->hang_timer);
>  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
>  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>  			GEM_BUG_ON(!guc->ct.enabled);
>  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
>  			sched_engine->tasklet.callback = NULL;
>  		}
>  	}
> @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
>  	kfree(gse);
>  }
>  
> +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> +{
> +	struct guc_submit_engine *gse =
> +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> +	struct intel_guc *guc = gse->sched_engine.private_data;
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +	if (guc->gse_hang_expected)
> +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> +			"GSE[%i] hung, disabling submission", gse->id);
> +	else
> +		drm_err(&guc_to_gt(guc)->i915->drm,
> +			"GSE[%i] hung, disabling submission", gse->id);
> +#else
> +	drm_err(&guc_to_gt(guc)->i915->drm,
> +		"GSE[%i] hung, disabling submission", gse->id);
> +#endif
> +
> +	/*
> +	 * Tasklet not making forward progress, disable submission which in turn
> +	 * will kick in the heartbeat to do a full GPU reset.
> +	 */
> +	disable_submission(guc);
> +
> +	return HRTIMER_NORESTART;
> +}
> +
>  static void guc_submit_engine_init(struct intel_guc *guc,
>  				   struct guc_submit_engine *gse,
>  				   int id)
> @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
>  	sched_engine->retire_inflight_request_prio =
>  		guc_retire_inflight_request_prio;
>  	sched_engine->private_data = guc;
> +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	gse->hang_timer.function = gse_hang;
>  	gse->id = id;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> index a5933e07bdd2..eae2e9725ede 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> @@ -6,6 +6,8 @@
>  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
>  #define _INTEL_GUC_SUBMISSION_TYPES_H_
>  
> +#include <linux/xarray.h>
> +
>  #include "gt/intel_engine_types.h"
>  #include "gt/intel_context_types.h"
>  #include "i915_scheduler_types.h"
> @@ -41,6 +43,7 @@ struct guc_submit_engine {
>  	unsigned long flags;
>  	int total_num_rq_with_no_guc_id;
>  	atomic_t num_guc_ids_not_ready;
> +	struct hrtimer hang_timer;
>  	int id;
>  
>  	/*
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy Matthew Brost
@ 2021-08-09 15:36   ` Daniel Vetter
  2021-08-09 19:06     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 15:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:18PM -0700, Matthew Brost wrote:
> Since child contexts do not own the guc_ids or GuC context registration,
> child contexts can simply be freed on destroy. Add
> guc_child_context_destroy context operation to do this.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 2d8296bcc583..850edeff9230 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2828,6 +2828,13 @@ static void destroy_worker_func(struct work_struct *w)
>  		intel_gt_pm_unpark_work_add(gt, destroy_worker);
>  }
>  
> +/* Future patches will use this function */
> +__maybe_unused

Pure bikeshed, but for something this small just squash it in with the
first user. This kinda does nothing alone.
-Daniel

> +static void guc_child_context_destroy(struct kref *kref)
> +{
> +	__guc_context_destroy(container_of(kref, struct intel_context, ref));
> +}
> +
>  static void guc_context_destroy(struct kref *kref)
>  {
>  	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
@ 2021-08-09 16:32   ` Daniel Vetter
  2021-08-09 16:39     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 16:32 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:20PM -0700, Matthew Brost wrote:
> The GuC must receive requests in the order submitted for contexts in a
> parent-child relationship to function correctly. To ensure this, insert
> a submit fence between the current request and last request submitted
> for requests / contexts in a parent child relationship. This is
> conceptually similar to a single timeline.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>  drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   3 +-
>  drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
>  5 files changed, 105 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index bb4c14656067..98ef2d0f7a39 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -487,6 +487,8 @@ void intel_context_fini(struct intel_context *ce)
>  {
>  	struct intel_context *child, *next;
>  
> +	if (ce->last_rq)
> +		i915_request_put(ce->last_rq);
>  	if (ce->timeline)
>  		intel_timeline_put(ce->timeline);
>  	i915_vm_put(ce->vm);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 7ce3b3d2edb7..a302599e436a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -60,6 +60,11 @@ intel_context_to_parent(struct intel_context *ce)
>  	return intel_context_is_child(ce) ? ce->parent : ce;
>  }
>  
> +static inline bool intel_context_is_parallel(struct intel_context *ce)
> +{
> +	return intel_context_is_child(ce) || intel_context_is_parent(ce);
> +}
> +
>  void intel_context_bind_parent_child(struct intel_context *parent,
>  				     struct intel_context *child);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 9665cb31bab0..f4fc81f64921 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -225,6 +225,9 @@ struct intel_context {
>  	 */
>  	u8 guc_prio;
>  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> +
> +	/* Last request submitted on a parent */
> +	struct i915_request *last_rq;
>  };
>  
>  #endif /* __INTEL_CONTEXT_TYPES__ */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d1d4a1e59e8d..1cb382f7d79d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -820,8 +820,7 @@ static inline int rq_prio(const struct i915_request *rq)
>  
>  static inline bool is_multi_lrc_rq(struct i915_request *rq)
>  {
> -	return intel_context_is_child(rq->context) ||
> -		intel_context_is_parent(rq->context);
> +	return intel_context_is_parallel(rq->context);
>  }
>  
>  /*
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index ce446716d092..2e51c8999088 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
>  	return ret;
>  }
>  
> +static inline bool is_parallel_rq(struct i915_request *rq)
> +{
> +	return intel_context_is_parallel(rq->context);
> +}
> +
> +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> +{
> +	return intel_context_to_parent(rq->context);
> +}
> +
>  static struct i915_request *
> -__i915_request_add_to_timeline(struct i915_request *rq)
> +__i915_request_ensure_parallel_ordering(struct i915_request *rq,
> +					struct intel_timeline *timeline)
>  {
> -	struct intel_timeline *timeline = i915_request_timeline(rq);
>  	struct i915_request *prev;
>  
> -	/*
> -	 * Dependency tracking and request ordering along the timeline
> -	 * is special cased so that we can eliminate redundant ordering
> -	 * operations while building the request (we know that the timeline
> -	 * itself is ordered, and here we guarantee it).
> -	 *
> -	 * As we know we will need to emit tracking along the timeline,
> -	 * we embed the hooks into our request struct -- at the cost of
> -	 * having to have specialised no-allocation interfaces (which will
> -	 * be beneficial elsewhere).
> -	 *
> -	 * A second benefit to open-coding i915_request_await_request is
> -	 * that we can apply a slight variant of the rules specialised
> -	 * for timelines that jump between engines (such as virtual engines).
> -	 * If we consider the case of virtual engine, we must emit a dma-fence
> -	 * to prevent scheduling of the second request until the first is
> -	 * complete (to maximise our greedy late load balancing) and this
> -	 * precludes optimising to use semaphores serialisation of a single
> -	 * timeline across engines.
> -	 */
> +	GEM_BUG_ON(!is_parallel_rq(rq));
> +
> +	prev = request_to_parent(rq)->last_rq;
> +	if (prev) {
> +		if (!__i915_request_is_complete(prev)) {
> +			i915_sw_fence_await_sw_fence(&rq->submit,
> +						     &prev->submit,
> +						     &rq->submitq);
> +
> +			if (rq->engine->sched_engine->schedule)
> +				__i915_sched_node_add_dependency(&rq->sched,
> +								 &prev->sched,
> +								 &rq->dep,
> +								 0);
> +		}
> +		i915_request_put(prev);
> +	}
> +
> +	request_to_parent(rq)->last_rq = i915_request_get(rq);
> +
> +	return to_request(__i915_active_fence_set(&timeline->last_request,
> +						  &rq->fence));
> +}
> +
> +static struct i915_request *
> +__i915_request_ensure_ordering(struct i915_request *rq,
> +			       struct intel_timeline *timeline)
> +{
> +	struct i915_request *prev;
> +
> +	GEM_BUG_ON(is_parallel_rq(rq));
> +
>  	prev = to_request(__i915_active_fence_set(&timeline->last_request,
>  						  &rq->fence));
> +
>  	if (prev && !__i915_request_is_complete(prev)) {
>  		bool uses_guc = intel_engine_uses_guc(rq->engine);
> +		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
> +					  rq->engine->mask);
> +		bool same_context = prev->context == rq->context;
>  
>  		/*
>  		 * The requests are supposed to be kept in order. However,
> @@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
>  		 * is used as a barrier for external modification to this
>  		 * context.
>  		 */
> -		GEM_BUG_ON(prev->context == rq->context &&
> +		GEM_BUG_ON(same_context &&
>  			   i915_seqno_passed(prev->fence.seqno,
>  					     rq->fence.seqno));
>  
> -		if ((!uses_guc &&
> -		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
> -		    (uses_guc && prev->context == rq->context))
> +		if ((same_context && uses_guc) || (!uses_guc && pow2))
>  			i915_sw_fence_await_sw_fence(&rq->submit,
>  						     &prev->submit,
>  						     &rq->submitq);
> @@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
>  							 0);
>  	}
>  
> +	return prev;
> +}
> +
> +static struct i915_request *
> +__i915_request_add_to_timeline(struct i915_request *rq)
> +{
> +	struct intel_timeline *timeline = i915_request_timeline(rq);
> +	struct i915_request *prev;
> +
> +	/*
> +	 * Dependency tracking and request ordering along the timeline
> +	 * is special cased so that we can eliminate redundant ordering
> +	 * operations while building the request (we know that the timeline
> +	 * itself is ordered, and here we guarantee it).
> +	 *
> +	 * As we know we will need to emit tracking along the timeline,
> +	 * we embed the hooks into our request struct -- at the cost of
> +	 * having to have specialised no-allocation interfaces (which will
> +	 * be beneficial elsewhere).
> +	 *
> +	 * A second benefit to open-coding i915_request_await_request is
> +	 * that we can apply a slight variant of the rules specialised
> +	 * for timelines that jump between engines (such as virtual engines).
> +	 * If we consider the case of virtual engine, we must emit a dma-fence
> +	 * to prevent scheduling of the second request until the first is
> +	 * complete (to maximise our greedy late load balancing) and this
> +	 * precludes optimising to use semaphores serialisation of a single
> +	 * timeline across engines.
> +	 *

Can we put a big FIXME in here that this should all be resolved with a
proper interface which passes the entire thing down to the backend?

Or is that no longer (or wasn't ever) the long-term goal?
-Daniel

> +	 * We do not order parallel submission requests on the timeline as each
> +	 * parallel submission context has its own timeline and the ordering
> +	 * rules for parallel requests are that they must be submitted in the
> +	 * order received from the execbuf IOCTL. So rather than using the
> +	 * timeline we store a pointer to last request submitted in the
> +	 * relationship in the gem context and insert a submission fence
> +	 * between that request and request passed into this function or
> +	 * alternatively we use completion fence if gem context has a single
> +	 * timeline and this is the first submission of an execbuf IOCTL.
> +	 */
> +	if (likely(!is_parallel_rq(rq)))
> +		prev = __i915_request_ensure_ordering(rq, timeline);
> +	else
> +		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
> +
>  	/*
>  	 * Make sure that no request gazumped us - if it was allocated after
>  	 * our i915_request_alloc() and called __i915_request_add() before
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
@ 2021-08-09 16:36   ` Daniel Vetter
  2021-08-09 19:13     ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 16:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> Display the workqueue status in debugfs for GuC contexts that are in
> parent-child relationship.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
>  1 file changed, 39 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 30df1c8db491..44a7582c9aed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>  		gse_log_submission_info(guc->gse[i], p, i);
>  }
>  
> +static inline void guc_log_context(struct drm_printer *p,
> +				   struct intel_context *ce)
> +{
> +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> +		   ce->ring->head,
> +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> +		   ce->ring->tail,
> +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> +		   atomic_read(&ce->pin_count));
> +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> +		   atomic_read(&ce->guc_id_ref));
> +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> +		   atomic_read(&ce->guc_num_rq_not_ready));
> +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> +		   ce->guc_state.sched_state,
> +		   atomic_read(&ce->guc_sched_state_no_lock));

It's all debugfs, but I think proper locking even there is good. It at
least reduces the confusion when the locking scheme is largely
undocumented. Also given how much we have rcu for everything would be good
to double-check all pointer dererences are properly protected.

> +}
> +
>  void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  					     struct drm_printer *p)
>  {
>  	struct intel_context *ce;
>  	unsigned long index;
>  	xa_for_each(&guc->context_lookup, index, ce) {

xa_for_each doesn't provide any guarantees, so doesn't protect against
concurrent removeal or anything like that. We need to do better than that.
-Daniel

> -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> -			   ce->ring->head,
> -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> -			   ce->ring->tail,
> -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> -			   atomic_read(&ce->pin_count));
> -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> -			   atomic_read(&ce->guc_id_ref));
> -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> -			   atomic_read(&ce->guc_num_rq_not_ready));
> -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> -			   ce->guc_state.sched_state,
> -			   atomic_read(&ce->guc_sched_state_no_lock));
> +		GEM_BUG_ON(intel_context_is_child(ce));
>  
> +		guc_log_context(p, ce);
>  		guc_log_context_priority(p, ce);
> +
> +		if (intel_context_is_parent(ce)) {
> +			struct guc_process_desc *desc = __get_process_desc(ce);
> +			struct intel_context *child;
> +
> +			drm_printf(p, "\t\tWQI Head: %u\n",
> +				   READ_ONCE(desc->head));
> +			drm_printf(p, "\t\tWQI Tail: %u\n",
> +				   READ_ONCE(desc->tail));
> +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> +				   READ_ONCE(desc->wq_status));
> +
> +			for_each_child(ce, child)
> +				guc_log_context(p, child);
> +		}
>  	}
>  }
>  
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface Matthew Brost
@ 2021-08-09 16:37   ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 16:37 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:23PM -0700, Matthew Brost wrote:
> Introduce 'set parallel submit' extension to connect UAPI to GuC
> multi-lrc interface. Kernel doc in new uAPI should explain it all.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

UMD merge request link + igt patchwork link because this is uapi please.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 157 +++++++++++++++++-
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |   6 +
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +-
>  drivers/gpu/drm/i915/gt/intel_engine.h        |  12 +-
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   6 +-
>  .../drm/i915/gt/intel_execlists_submission.c  |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_execlists.c  |  12 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 111 +++++++++++--
>  include/uapi/drm/i915_drm.h                   | 128 ++++++++++++++
>  9 files changed, 417 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index cff72679ad7c..2b0dd3ff4db8 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -515,9 +515,149 @@ set_proto_ctx_engines_bond(struct i915_user_extension __user *base, void *data)
>  	return 0;
>  }
>  
> +static int
> +set_proto_ctx_engines_parallel_submit(struct i915_user_extension __user *base,
> +				      void *data)
> +{
> +	struct i915_context_engines_parallel_submit __user *ext =
> +		container_of_user(base, typeof(*ext), base);
> +	const struct set_proto_ctx_engines *set = data;
> +	struct drm_i915_private *i915 = set->i915;
> +	u64 flags;
> +	int err = 0, n, i, j;
> +	u16 slot, width, num_siblings;
> +	struct intel_engine_cs **siblings = NULL;
> +	intel_engine_mask_t prev_mask;
> +
> +	/* Disabling for now */
> +	return -ENODEV;
> +
> +	if (!(intel_uc_uses_guc_submission(&i915->gt.uc)))
> +		return -ENODEV;
> +
> +	if (get_user(slot, &ext->engine_index))
> +		return -EFAULT;
> +
> +	if (get_user(width, &ext->width))
> +		return -EFAULT;
> +
> +	if (get_user(num_siblings, &ext->num_siblings))
> +		return -EFAULT;
> +
> +	if (slot >= set->num_engines) {
> +		drm_dbg(&i915->drm, "Invalid placement value, %d >= %d\n",
> +			slot, set->num_engines);
> +		return -EINVAL;
> +	}
> +
> +	if (set->engines[slot].type != I915_GEM_ENGINE_TYPE_INVALID) {
> +		drm_dbg(&i915->drm,
> +			"Invalid placement[%d], already occupied\n", slot);
> +		return -EINVAL;
> +	}
> +
> +	if (get_user(flags, &ext->flags))
> +		return -EFAULT;
> +
> +	if (flags) {
> +		drm_dbg(&i915->drm, "Unknown flags 0x%02llx", flags);
> +		return -EINVAL;
> +	}
> +
> +	for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> +		err = check_user_mbz(&ext->mbz64[n]);
> +		if (err)
> +			return err;
> +	}
> +
> +	if (width < 2) {
> +		drm_dbg(&i915->drm, "Width (%d) < 2\n", width);
> +		return -EINVAL;
> +	}
> +
> +	if (num_siblings < 1) {
> +		drm_dbg(&i915->drm, "Number siblings (%d) < 1\n",
> +			num_siblings);
> +		return -EINVAL;
> +	}
> +
> +	siblings = kmalloc_array(num_siblings * width,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return -ENOMEM;
> +
> +	/* Create contexts / engines */
> +	for (i = 0; i < width; ++i) {
> +		intel_engine_mask_t current_mask = 0;
> +		struct i915_engine_class_instance prev_engine;
> +
> +		for (j = 0; j < num_siblings; ++j) {
> +			struct i915_engine_class_instance ci;
> +
> +			n = i * num_siblings + j;
> +			if (copy_from_user(&ci, &ext->engines[n], sizeof(ci))) {
> +				err = -EFAULT;
> +				goto out_err;
> +			}
> +
> +			siblings[n] =
> +				intel_engine_lookup_user(i915, ci.engine_class,
> +							 ci.engine_instance);
> +			if (!siblings[n]) {
> +				drm_dbg(&i915->drm,
> +					"Invalid sibling[%d]: { class:%d, inst:%d }\n",
> +					n, ci.engine_class, ci.engine_instance);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +
> +			if (n) {
> +				if (prev_engine.engine_class !=
> +				    ci.engine_class) {
> +					drm_dbg(&i915->drm,
> +						"Mismatched class %d, %d\n",
> +						prev_engine.engine_class,
> +						ci.engine_class);
> +					err = -EINVAL;
> +					goto out_err;
> +				}
> +			}
> +
> +			prev_engine = ci;
> +			current_mask |= siblings[n]->logical_mask;
> +		}
> +
> +		if (i > 0) {
> +			if (current_mask != prev_mask << 1) {
> +				drm_dbg(&i915->drm,
> +					"Non contiguous logical mask 0x%x, 0x%x\n",
> +					prev_mask, current_mask);
> +				err = -EINVAL;
> +				goto out_err;
> +			}
> +		}
> +		prev_mask = current_mask;
> +	}
> +
> +	set->engines[slot].type = I915_GEM_ENGINE_TYPE_PARALLEL;
> +	set->engines[slot].num_siblings = num_siblings;
> +	set->engines[slot].width = width;
> +	set->engines[slot].siblings = siblings;
> +
> +	return 0;
> +
> +out_err:
> +	kfree(siblings);
> +
> +	return err;
> +}
> +
>  static const i915_user_extension_fn set_proto_ctx_engines_extensions[] = {
>  	[I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_proto_ctx_engines_balance,
>  	[I915_CONTEXT_ENGINES_EXT_BOND] = set_proto_ctx_engines_bond,
> +	[I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT] =
> +		set_proto_ctx_engines_parallel_submit,
>  };
>  
>  static int set_proto_ctx_engines(struct drm_i915_file_private *fpriv,
> @@ -938,7 +1078,7 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>  
>  	e = alloc_engines(num_engines);
>  	for (n = 0; n < num_engines; n++) {
> -		struct intel_context *ce;
> +		struct intel_context *ce, *child;
>  		int ret;
>  
>  		switch (pe[n].type) {
> @@ -948,7 +1088,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>  
>  		case I915_GEM_ENGINE_TYPE_BALANCED:
>  			ce = intel_engine_create_virtual(pe[n].siblings,
> -							 pe[n].num_siblings);
> +							 pe[n].num_siblings, 0);
> +			break;
> +
> +		case I915_GEM_ENGINE_TYPE_PARALLEL:
> +			ce = intel_engine_create_parallel(pe[n].siblings,
> +							  pe[n].num_siblings,
> +							  pe[n].width);
>  			break;
>  
>  		case I915_GEM_ENGINE_TYPE_INVALID:
> @@ -969,6 +1115,13 @@ static struct i915_gem_engines *user_engines(struct i915_gem_context *ctx,
>  			err = ERR_PTR(ret);
>  			goto free_engines;
>  		}
> +		for_each_child(ce, child) {
> +			ret = intel_context_set_gem(child, ctx, pe->sseu);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto free_engines;
> +			}
> +		}
>  	}
>  	e->num_engines = num_engines;
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index 94c03a97cb77..7b096d83bca1 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -78,6 +78,9 @@ enum i915_gem_engine_type {
>  
>  	/** @I915_GEM_ENGINE_TYPE_BALANCED: A load-balanced engine set */
>  	I915_GEM_ENGINE_TYPE_BALANCED,
> +
> +	/** @I915_GEM_ENGINE_TYPE_PARALLEL: A parallel engine set */
> +	I915_GEM_ENGINE_TYPE_PARALLEL,
>  };
>  
>  /**
> @@ -108,6 +111,9 @@ struct i915_gem_proto_engine {
>  	/** @num_siblings: Number of balanced siblings */
>  	unsigned int num_siblings;
>  
> +	/** @width: Width of each sibling */
> +	unsigned int width;
> +
>  	/** @siblings: Balanced siblings */
>  	struct intel_engine_cs **siblings;
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index f4fc81f64921..9cdbea752014 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -55,9 +55,13 @@ struct intel_context_ops {
>  	void (*reset)(struct intel_context *ce);
>  	void (*destroy)(struct kref *kref);
>  
> -	/* virtual engine/context interface */
> +	/* virtual/parallel engine/context interface */
>  	struct intel_context *(*create_virtual)(struct intel_engine_cs **engine,
> -						unsigned int count);
> +						unsigned int count,
> +						unsigned long flags);
> +	struct intel_context *(*create_parallel)(struct intel_engine_cs **engines,
> +						 unsigned int num_siblings,
> +						 unsigned int width);
>  	struct intel_engine_cs *(*get_sibling)(struct intel_engine_cs *engine,
>  					       unsigned int sibling);
>  };
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 87579affb952..43f16a8347ee 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -279,9 +279,19 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>  	return intel_engine_has_preemption(engine);
>  }
>  
> +#define FORCE_VIRTUAL	BIT(0)
>  struct intel_context *
>  intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count);
> +			    unsigned int count, unsigned long flags);
> +
> +static inline struct intel_context *
> +intel_engine_create_parallel(struct intel_engine_cs **engines,
> +			     unsigned int num_engines,
> +			     unsigned int width)
> +{
> +	GEM_BUG_ON(!engines[0]->cops->create_parallel);
> +	return engines[0]->cops->create_parallel(engines, num_engines, width);
> +}
>  
>  static inline bool
>  intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 4d790f9a65dd..f66c75c77584 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1923,16 +1923,16 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>  
>  struct intel_context *
>  intel_engine_create_virtual(struct intel_engine_cs **siblings,
> -			    unsigned int count)
> +			    unsigned int count, unsigned long flags)
>  {
>  	if (count == 0)
>  		return ERR_PTR(-EINVAL);
>  
> -	if (count == 1)
> +	if (count == 1 && !(flags & FORCE_VIRTUAL))
>  		return intel_context_create(siblings[0]);
>  
>  	GEM_BUG_ON(!siblings[0]->cops->create_virtual);
> -	return siblings[0]->cops->create_virtual(siblings, count);
> +	return siblings[0]->cops->create_virtual(siblings, count, flags);
>  }
>  
>  struct i915_request *
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index fc74ca28f245..769480e026bb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -201,7 +201,8 @@ static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
>  }
>  
>  static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags);
>  
>  static struct i915_request *
>  __active_request(const struct intel_timeline * const tl,
> @@ -3785,7 +3786,8 @@ static void virtual_submit_request(struct i915_request *rq)
>  }
>  
>  static struct intel_context *
> -execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +			 unsigned long flags)
>  {
>  	struct virtual_engine *ve;
>  	unsigned int n;
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index f12ffe797639..e876a9d88a5c 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -3733,7 +3733,7 @@ static int nop_virtual_engine(struct intel_gt *gt,
>  	GEM_BUG_ON(!nctx || nctx > ARRAY_SIZE(ve));
>  
>  	for (n = 0; n < nctx; n++) {
> -		ve[n] = intel_engine_create_virtual(siblings, nsibling);
> +		ve[n] = intel_engine_create_virtual(siblings, nsibling, 0);
>  		if (IS_ERR(ve[n])) {
>  			err = PTR_ERR(ve[n]);
>  			nctx = n;
> @@ -3929,7 +3929,7 @@ static int mask_virtual_engine(struct intel_gt *gt,
>  	 * restrict it to our desired engine within the virtual engine.
>  	 */
>  
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>  	if (IS_ERR(ve)) {
>  		err = PTR_ERR(ve);
>  		goto out_close;
> @@ -4060,7 +4060,7 @@ static int slicein_virtual_engine(struct intel_gt *gt,
>  		i915_request_add(rq);
>  	}
>  
> -	ce = intel_engine_create_virtual(siblings, nsibling);
> +	ce = intel_engine_create_virtual(siblings, nsibling, 0);
>  	if (IS_ERR(ce)) {
>  		err = PTR_ERR(ce);
>  		goto out;
> @@ -4112,7 +4112,7 @@ static int sliceout_virtual_engine(struct intel_gt *gt,
>  
>  	/* XXX We do not handle oversubscription and fairness with normal rq */
>  	for (n = 0; n < nsibling; n++) {
> -		ce = intel_engine_create_virtual(siblings, nsibling);
> +		ce = intel_engine_create_virtual(siblings, nsibling, 0);
>  		if (IS_ERR(ce)) {
>  			err = PTR_ERR(ce);
>  			goto out;
> @@ -4214,7 +4214,7 @@ static int preserved_virtual_engine(struct intel_gt *gt,
>  	if (err)
>  		goto out_scratch;
>  
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>  	if (IS_ERR(ve)) {
>  		err = PTR_ERR(ve);
>  		goto out_scratch;
> @@ -4354,7 +4354,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
>  	if (igt_spinner_init(&spin, gt))
>  		return -ENOMEM;
>  
> -	ve = intel_engine_create_virtual(siblings, nsibling);
> +	ve = intel_engine_create_virtual(siblings, nsibling, 0);
>  	if (IS_ERR(ve)) {
>  		err = PTR_ERR(ve);
>  		goto out_spin;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 44a7582c9aed..89528624710a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -82,7 +82,8 @@
>   */
>  
>  static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count);
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags);
>  
>  #define GUC_REQUEST_SIZE 64 /* bytes */
>  
> @@ -2514,8 +2515,6 @@ static void guc_context_post_unpin(struct intel_context *ce)
>  	__guc_context_post_unpin(ce);
>  }
>  
> -/* Future patches will use this function */
> -__maybe_unused
>  static int guc_parent_context_pre_pin(struct intel_context *ce,
>  				      struct i915_gem_ww_ctx *ww)
>  {
> @@ -2559,8 +2558,6 @@ static int guc_parent_context_pre_pin(struct intel_context *ce,
>  	return err;
>  }
>  
> -/* Future patches will use this function */
> -__maybe_unused
>  static void guc_parent_context_post_unpin(struct intel_context *ce)
>  {
>  	struct intel_context *child;
> @@ -2576,8 +2573,6 @@ static void guc_parent_context_post_unpin(struct intel_context *ce)
>  	}
>  }
>  
> -/* Future patches will use this function */
> -__maybe_unused
>  static int guc_parent_context_pin(struct intel_context *ce)
>  {
>  	int ret, i = 0, j = 0;
> @@ -2623,8 +2618,6 @@ static int guc_parent_context_pin(struct intel_context *ce)
>  	return ret;
>  }
>  
> -/* Future patches will use this function */
> -__maybe_unused
>  static void guc_parent_context_unpin(struct intel_context *ce)
>  {
>  	struct intel_context *child;
> @@ -3048,8 +3041,6 @@ static void destroy_worker_func(struct work_struct *w)
>  		intel_gt_pm_unpark_work_add(gt, destroy_worker);
>  }
>  
> -/* Future patches will use this function */
> -__maybe_unused
>  static void guc_child_context_destroy(struct kref *kref)
>  {
>  	__guc_context_destroy(container_of(kref, struct intel_context, ref));
> @@ -3272,6 +3263,11 @@ static void remove_from_context(struct i915_request *rq)
>  	i915_request_notify_execute_cb_imm(rq);
>  }
>  
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width);
> +
>  static const struct intel_context_ops guc_context_ops = {
>  	.alloc = guc_context_alloc,
>  
> @@ -3293,6 +3289,7 @@ static const struct intel_context_ops guc_context_ops = {
>  	.destroy = guc_context_destroy,
>  
>  	.create_virtual = guc_create_virtual,
> +	.create_parallel = guc_create_parallel,
>  };
>  
>  static void __guc_signal_context_fence(struct intel_context *ce)
> @@ -3782,6 +3779,91 @@ static void guc_retire_inflight_request_prio(struct i915_request *rq)
>  	spin_unlock(&ce->guc_active.lock);
>  }
>  
> +static const struct intel_context_ops virtual_parent_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.pre_pin = guc_parent_context_pre_pin,
> +	.pin = guc_parent_context_pin,
> +	.unpin = guc_parent_context_unpin,
> +	.post_unpin = guc_parent_context_post_unpin,
> +
> +	.ban = guc_context_ban,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.sched_disable = guc_context_sched_disable,
> +
> +	.destroy = guc_context_destroy,
> +
> +	.get_sibling = guc_virtual_get_sibling,
> +};
> +
> +static const struct intel_context_ops virtual_child_context_ops = {
> +	.alloc = guc_virtual_context_alloc,
> +
> +	.enter = guc_virtual_context_enter,
> +	.exit = guc_virtual_context_exit,
> +
> +	.destroy = guc_child_context_destroy,
> +};
> +
> +static struct intel_context *
> +guc_create_parallel(struct intel_engine_cs **engines,
> +		    unsigned int num_siblings,
> +		    unsigned int width)
> +{
> +	struct intel_engine_cs **siblings = NULL;
> +	struct intel_context *parent = NULL, *ce, *err;
> +	int i, j;
> +	int ret;
> +
> +	siblings = kmalloc_array(num_siblings,
> +				 sizeof(*siblings),
> +				 GFP_KERNEL);
> +	if (!siblings)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < width; ++i) {
> +		for (j = 0; j < num_siblings; ++j)
> +			siblings[j] = engines[i * num_siblings + j];
> +
> +		ce = intel_engine_create_virtual(siblings, num_siblings,
> +						 FORCE_VIRTUAL);
> +		if (!ce) {
> +			err = ERR_PTR(-ENOMEM);
> +			goto unwind;
> +		}
> +
> +		if (i == 0) {
> +			parent = ce;
> +		} else {
> +			intel_context_bind_parent_child(parent, ce);
> +			ret = intel_context_alloc_state(ce);
> +			if (ret) {
> +				err = ERR_PTR(ret);
> +				goto unwind;
> +			}
> +		}
> +	}
> +
> +	parent->ops = &virtual_parent_context_ops;
> +	for_each_child(parent, ce)
> +		ce->ops = &virtual_child_context_ops;
> +
> +	kfree(siblings);
> +	return parent;
> +
> +unwind:
> +	if (parent) {
> +		for_each_child(parent, ce)
> +			intel_context_put(ce);
> +		intel_context_put(parent);
> +	}
> +	kfree(siblings);
> +	return err;
> +}
> +
>  static void sanitize_hwsp(struct intel_engine_cs *engine)
>  {
>  	struct intel_timeline *tl;
> @@ -4578,7 +4660,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  }
>  
>  static struct intel_context *
> -guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> +guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count,
> +		   unsigned long flags)
>  {
>  	struct guc_virtual_engine *ve;
>  	struct intel_guc *guc;
> @@ -4591,7 +4674,9 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
>  		return ERR_PTR(-ENOMEM);
>  
>  	guc = &siblings[0]->gt->uc.guc;
> -	sched_engine = guc_to_sched_engine(guc, GUC_SUBMIT_ENGINE_SINGLE_LRC);
> +	sched_engine = guc_to_sched_engine(guc, (flags & FORCE_VIRTUAL) ?
> +					   GUC_SUBMIT_ENGINE_MULTI_LRC :
> +					   GUC_SUBMIT_ENGINE_SINGLE_LRC);
>  
>  	ve->base.i915 = siblings[0]->i915;
>  	ve->base.gt = siblings[0]->gt;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index ef72e07fe08c..a16f0f8908de 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1821,6 +1821,7 @@ struct drm_i915_gem_context_param {
>   * Extensions:
>   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
>   */
>  #define I915_CONTEXT_PARAM_ENGINES	0xa
>  
> @@ -2046,6 +2047,132 @@ struct i915_context_engines_bond {
>  	struct i915_engine_class_instance engines[N__]; \
>  } __attribute__((packed)) name__
>  
> +/**
> + * struct i915_context_engines_parallel_submit - Configure engine for
> + * parallel submission.
> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915
> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> + * many BBs there are based on the slot's configuration. The N BBs are the last
> + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> + *
> + * The default placement behavior is to create implicit bonds between each
> + * context if each context maps to more than 1 physical engine (e.g. context is
> + * a virtual engine). Also we only allow contexts of same engine class and these
> + * contexts must be in logically contiguous order. Examples of the placement
> + * behavior described below. Lastly, the default is to not allow BBs to
> + * preempted mid BB rather insert coordinated preemption on all hardware
> + * contexts between each set of BBs. Flags may be added in the future to change
> + * both of these default behaviors.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * interface.
> + *
> + * .. code-block:: none
> + *
> + *	Example 1 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> + *		     engines=CS[0],CS[1])
> + *
> + *	Results in the following valid placement:
> + *	CS[0], CS[1]
> + *
> + *	Example 2 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[2],CS[1],CS[3])
> + *
> + *	Results in the following valid placements:
> + *	CS[0], CS[1]
> + *	CS[2], CS[3]
> + *
> + *	This can also be thought of as 2 virtual engines described by 2-D array
> + *	in the engines the field with bonds placed between each index of the
> + *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
> + *	CS[3].
> + *	VE[0] = CS[0], CS[2]
> + *	VE[1] = CS[1], CS[3]
> + *
> + *	Example 3 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[1],CS[1],CS[3])
> + *
> + *	Results in the following valid and invalid placements:
> + *	CS[0], CS[1]
> + *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
> + */
> +struct i915_context_engines_parallel_submit {
> +	/**
> +	 * @base: base user extension.
> +	 */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @engine_index: slot for parallel engine
> +	 */
> +	__u16 engine_index;
> +
> +	/**
> +	 * @width: number of contexts per parallel engine
> +	 */
> +	__u16 width;
> +
> +	/**
> +	 * @num_siblings: number of siblings per context
> +	 */
> +	__u16 num_siblings;
> +
> +	/**
> +	 * @mbz16: reserved for future use; must be zero
> +	 */
> +	__u16 mbz16;
> +
> +	/**
> +	 * @flags: all undefined flags must be zero, currently not defined flags
> +	 */
> +	__u64 flags;
> +
> +	/**
> +	 * @mbz64: reserved for future use; must be zero
> +	 */
> +	__u64 mbz64[3];
> +
> +	/**
> +	 * @engines: 2-d array of engine instances to configure parallel engine
> +	 *
> +	 * length = width (i) * num_siblings (j)
> +	 * index = j + i * num_siblings
> +	 */
> +	struct i915_engine_class_instance engines[0];
> +
> +} __packed;
> +
> +#define I915_DEFINE_CONTEXT_ENGINES_PARALLEL_SUBMIT(name__, N__) struct { \
> +	struct i915_user_extension base; \
> +	__u16 engine_index; \
> +	__u16 width; \
> +	__u16 num_siblings; \
> +	__u16 mbz16; \
> +	__u64 flags; \
> +	__u64 mbz64[3]; \
> +	struct i915_engine_class_instance engines[N__]; \
> +} __attribute__((packed)) name__
> +
>  /**
>   * DOC: Context Engine Map uAPI
>   *
> @@ -2105,6 +2232,7 @@ struct i915_context_param_engines {
>  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>  	struct i915_engine_class_instance engines[0];
>  } __attribute__((packed));
>  
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-09 16:32   ` Daniel Vetter
@ 2021-08-09 16:39     ` Matthew Brost
  2021-08-09 17:03       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 16:39 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:32:42PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:20PM -0700, Matthew Brost wrote:
> > The GuC must receive requests in the order submitted for contexts in a
> > parent-child relationship to function correctly. To ensure this, insert
> > a submit fence between the current request and last request submitted
> > for requests / contexts in a parent child relationship. This is
> > conceptually similar to a single timeline.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: John Harrison <John.C.Harrison@Intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >  drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   3 +-
> >  drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
> >  5 files changed, 105 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index bb4c14656067..98ef2d0f7a39 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -487,6 +487,8 @@ void intel_context_fini(struct intel_context *ce)
> >  {
> >  	struct intel_context *child, *next;
> >  
> > +	if (ce->last_rq)
> > +		i915_request_put(ce->last_rq);
> >  	if (ce->timeline)
> >  		intel_timeline_put(ce->timeline);
> >  	i915_vm_put(ce->vm);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index 7ce3b3d2edb7..a302599e436a 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -60,6 +60,11 @@ intel_context_to_parent(struct intel_context *ce)
> >  	return intel_context_is_child(ce) ? ce->parent : ce;
> >  }
> >  
> > +static inline bool intel_context_is_parallel(struct intel_context *ce)
> > +{
> > +	return intel_context_is_child(ce) || intel_context_is_parent(ce);
> > +}
> > +
> >  void intel_context_bind_parent_child(struct intel_context *parent,
> >  				     struct intel_context *child);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 9665cb31bab0..f4fc81f64921 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -225,6 +225,9 @@ struct intel_context {
> >  	 */
> >  	u8 guc_prio;
> >  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > +
> > +	/* Last request submitted on a parent */
> > +	struct i915_request *last_rq;
> >  };
> >  
> >  #endif /* __INTEL_CONTEXT_TYPES__ */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index d1d4a1e59e8d..1cb382f7d79d 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -820,8 +820,7 @@ static inline int rq_prio(const struct i915_request *rq)
> >  
> >  static inline bool is_multi_lrc_rq(struct i915_request *rq)
> >  {
> > -	return intel_context_is_child(rq->context) ||
> > -		intel_context_is_parent(rq->context);
> > +	return intel_context_is_parallel(rq->context);
> >  }
> >  
> >  /*
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index ce446716d092..2e51c8999088 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
> >  	return ret;
> >  }
> >  
> > +static inline bool is_parallel_rq(struct i915_request *rq)
> > +{
> > +	return intel_context_is_parallel(rq->context);
> > +}
> > +
> > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > +{
> > +	return intel_context_to_parent(rq->context);
> > +}
> > +
> >  static struct i915_request *
> > -__i915_request_add_to_timeline(struct i915_request *rq)
> > +__i915_request_ensure_parallel_ordering(struct i915_request *rq,
> > +					struct intel_timeline *timeline)
> >  {
> > -	struct intel_timeline *timeline = i915_request_timeline(rq);
> >  	struct i915_request *prev;
> >  
> > -	/*
> > -	 * Dependency tracking and request ordering along the timeline
> > -	 * is special cased so that we can eliminate redundant ordering
> > -	 * operations while building the request (we know that the timeline
> > -	 * itself is ordered, and here we guarantee it).
> > -	 *
> > -	 * As we know we will need to emit tracking along the timeline,
> > -	 * we embed the hooks into our request struct -- at the cost of
> > -	 * having to have specialised no-allocation interfaces (which will
> > -	 * be beneficial elsewhere).
> > -	 *
> > -	 * A second benefit to open-coding i915_request_await_request is
> > -	 * that we can apply a slight variant of the rules specialised
> > -	 * for timelines that jump between engines (such as virtual engines).
> > -	 * If we consider the case of virtual engine, we must emit a dma-fence
> > -	 * to prevent scheduling of the second request until the first is
> > -	 * complete (to maximise our greedy late load balancing) and this
> > -	 * precludes optimising to use semaphores serialisation of a single
> > -	 * timeline across engines.
> > -	 */
> > +	GEM_BUG_ON(!is_parallel_rq(rq));
> > +
> > +	prev = request_to_parent(rq)->last_rq;
> > +	if (prev) {
> > +		if (!__i915_request_is_complete(prev)) {
> > +			i915_sw_fence_await_sw_fence(&rq->submit,
> > +						     &prev->submit,
> > +						     &rq->submitq);
> > +
> > +			if (rq->engine->sched_engine->schedule)
> > +				__i915_sched_node_add_dependency(&rq->sched,
> > +								 &prev->sched,
> > +								 &rq->dep,
> > +								 0);
> > +		}
> > +		i915_request_put(prev);
> > +	}
> > +
> > +	request_to_parent(rq)->last_rq = i915_request_get(rq);
> > +
> > +	return to_request(__i915_active_fence_set(&timeline->last_request,
> > +						  &rq->fence));
> > +}
> > +
> > +static struct i915_request *
> > +__i915_request_ensure_ordering(struct i915_request *rq,
> > +			       struct intel_timeline *timeline)
> > +{
> > +	struct i915_request *prev;
> > +
> > +	GEM_BUG_ON(is_parallel_rq(rq));
> > +
> >  	prev = to_request(__i915_active_fence_set(&timeline->last_request,
> >  						  &rq->fence));
> > +
> >  	if (prev && !__i915_request_is_complete(prev)) {
> >  		bool uses_guc = intel_engine_uses_guc(rq->engine);
> > +		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
> > +					  rq->engine->mask);
> > +		bool same_context = prev->context == rq->context;
> >  
> >  		/*
> >  		 * The requests are supposed to be kept in order. However,
> > @@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
> >  		 * is used as a barrier for external modification to this
> >  		 * context.
> >  		 */
> > -		GEM_BUG_ON(prev->context == rq->context &&
> > +		GEM_BUG_ON(same_context &&
> >  			   i915_seqno_passed(prev->fence.seqno,
> >  					     rq->fence.seqno));
> >  
> > -		if ((!uses_guc &&
> > -		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
> > -		    (uses_guc && prev->context == rq->context))
> > +		if ((same_context && uses_guc) || (!uses_guc && pow2))
> >  			i915_sw_fence_await_sw_fence(&rq->submit,
> >  						     &prev->submit,
> >  						     &rq->submitq);
> > @@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
> >  							 0);
> >  	}
> >  
> > +	return prev;
> > +}
> > +
> > +static struct i915_request *
> > +__i915_request_add_to_timeline(struct i915_request *rq)
> > +{
> > +	struct intel_timeline *timeline = i915_request_timeline(rq);
> > +	struct i915_request *prev;
> > +
> > +	/*
> > +	 * Dependency tracking and request ordering along the timeline
> > +	 * is special cased so that we can eliminate redundant ordering
> > +	 * operations while building the request (we know that the timeline
> > +	 * itself is ordered, and here we guarantee it).
> > +	 *
> > +	 * As we know we will need to emit tracking along the timeline,
> > +	 * we embed the hooks into our request struct -- at the cost of
> > +	 * having to have specialised no-allocation interfaces (which will
> > +	 * be beneficial elsewhere).
> > +	 *
> > +	 * A second benefit to open-coding i915_request_await_request is
> > +	 * that we can apply a slight variant of the rules specialised
> > +	 * for timelines that jump between engines (such as virtual engines).
> > +	 * If we consider the case of virtual engine, we must emit a dma-fence
> > +	 * to prevent scheduling of the second request until the first is
> > +	 * complete (to maximise our greedy late load balancing) and this
> > +	 * precludes optimising to use semaphores serialisation of a single
> > +	 * timeline across engines.
> > +	 *
> 
> Can we put a big FIXME in here that this should all be resolved with a
> proper interface which passes the entire thing down to the backend?
> 
> Or is that no longer (or wasn't ever) the long-term goal?

I now you mentioned this in the past but I really don't think this all
that great of an idea as it would be a pretty intrusive change and not
sure what the real benefit is.

However, when we move the DRM scheduler this can be dropped because the
ordering of jobs on a sched_entity.

Matt

> -Daniel
> 
> > +	 * We do not order parallel submission requests on the timeline as each
> > +	 * parallel submission context has its own timeline and the ordering
> > +	 * rules for parallel requests are that they must be submitted in the
> > +	 * order received from the execbuf IOCTL. So rather than using the
> > +	 * timeline we store a pointer to last request submitted in the
> > +	 * relationship in the gem context and insert a submission fence
> > +	 * between that request and request passed into this function or
> > +	 * alternatively we use completion fence if gem context has a single
> > +	 * timeline and this is the first submission of an execbuf IOCTL.
> > +	 */
> > +	if (likely(!is_parallel_rq(rq)))
> > +		prev = __i915_request_ensure_ordering(rq, timeline);
> > +	else
> > +		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
> > +
> >  	/*
> >  	 * Make sure that no request gazumped us - if it was allocated after
> >  	 * our i915_request_alloc() and called __i915_request_add() before
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc Matthew Brost
@ 2021-08-09 16:39   ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 16:39 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:36PM -0700, Matthew Brost wrote:
> Submitting to a subset of hardware contexts is not allowed, so use the
> copy engine for GPU relocations when using a parallel context.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Luckily I just pushed the patches to delete all this, so you can too.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index b224b28530d1..b6143973ac67 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -1386,7 +1386,8 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>  	if (err)
>  		goto err_unmap;
>  
> -	if (engine == eb->context->engine) {
> +	if (engine == eb->context->engine &&
> +	    !intel_context_is_parallel(eb->context)) {
>  		rq = i915_request_create(eb->context);
>  	} else {
>  		struct intel_context *ce = eb->reloc_context;
> @@ -1483,7 +1484,8 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
>  		if (eb_use_cmdparser(eb))
>  			return ERR_PTR(-EWOULDBLOCK);
>  
> -		if (!reloc_can_use_engine(engine)) {
> +		if (!reloc_can_use_engine(engine) ||
> +		    intel_context_is_parallel(eb->context)) {
>  			engine = engine->gt->engine_class[COPY_ENGINE_CLASS][0];
>  			if (!engine)
>  				return ERR_PTR(-ENODEV);
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2 Matthew Brost
@ 2021-08-09 17:02   ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 17:02 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:37PM -0700, Matthew Brost wrote:
> For contexts with width set to two or more, we add a mode to execbuf2
> which implies there are N batch buffers in the buffer list, each of
> which will be sent to one of the engines from the engine map array
> (I915_CONTEXT_PARAM_ENGINES, I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT).
> 
> Those N batches can either be first N, or last N objects in the list as
> controlled by the existing execbuffer2 flag.
> 
> The N batches will be submitted to consecutive engines from the previously
> configured allowed engine array starting at index 0.
> 
> Input and output fences are fully supported, with the latter getting
> signalled when all batch buffers have completed.
> 
> Last, it isn't safe for subsequent batches to touch any objects written
> to by a multi-BB submission until all the batches in that submission
> complete. As such all batches in a multi-BB submission must be combined
> into a single composite fence and put into the dma reseveration excl
> fence slot.
> 
> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

So either I've missed something, or this has the exact deadlock issue as
the old submit fence, except it's all internally in the kmd.

Also, this is bad news (if I'm right about what's going on here).

- Between each batch submission we drop the dma_resv_locks on the objects.
  This can currently even happen due to relocations within a submission,
  but since we don't allow relocations on platforms with parallel
  submit/guc scheduler, this could be worked around.

- When the buffer is unlocked someone else could step in and do exactly
  what you say is not allowed, namely touch the object.

- The indivual batch fences won't completely until the last one has
  finished, leading to a deadlock which might or might not get resolved by
  gpu reset code. Since the deadlock is on the submission side I'm
  assuming the answer is "it won't be resolved by gpu reset", but maybe
  you do have a "I'm stuck for too long, let's ragequit" timer in your
  state machine somewhere. Old bonded submit would be rescued by the
  hangcheck we readded at least because there it's all completely
  free-floating requests.

- ttm on dgpu makes this all substantially worse.

The fundamental fix is still to build up a single i915_request, go through
the execbuf flow once, and then split things up again in the backend. That
would also mean all your prep work to pull execbuf prep step out of
do_execbuf() is a pure distraction.

I'm not yet fully understanding all the ordering rules drm/sched has, but
I don't think it will be any happier about this kind of submission model.

tldr; what do?

Cheers, Daniel
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 262 +++++++++++++++---
>  drivers/gpu/drm/i915/gt/intel_context.c       |   5 +
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +
>  drivers/gpu/drm/i915/i915_vma.c               |  13 +-
>  drivers/gpu/drm/i915/i915_vma.h               |  16 +-
>  5 files changed, 266 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index b6143973ac67..ecdb583cc2eb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -252,6 +252,9 @@ struct i915_execbuffer {
>  	struct eb_vma *batch; /** identity of the batch obj/vma */
>  	struct i915_vma *trampoline; /** trampoline used for chaining */
>  
> +	/** used for excl fence in dma_resv objects when > 1 BB submitted */
> +	struct dma_fence *composite_fence;
> +
>  	/* batch_index in vma list */
>  	unsigned int batch_index;
>  
> @@ -367,11 +370,6 @@ static int eb_create(struct i915_execbuffer *eb)
>  		eb->lut_size = -eb->buffer_count;
>  	}
>  
> -	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
> -		eb->batch_index = 0;
> -	else
> -		eb->batch_index = eb->args->buffer_count - 1;
> -
>  	return 0;
>  }
>  
> @@ -2241,7 +2239,7 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	return err;
>  }
>  
> -static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first)
> +static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first, bool last)
>  {
>  	const unsigned int count = eb->buffer_count;
>  	unsigned int i = count;
> @@ -2289,8 +2287,16 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb, bool first)
>  		}
>  
>  		if (err == 0)
> -			err = i915_vma_move_to_active(vma, eb->request,
> -						      flags | __EXEC_OBJECT_NO_RESERVE);
> +			err = _i915_vma_move_to_active(vma, eb->request,
> +						       flags | __EXEC_OBJECT_NO_RESERVE,
> +						       !last ?
> +						       NULL :
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->request->fence,
> +						       eb->composite_fence ?
> +						       eb->composite_fence :
> +						       &eb->request->fence);
>  	}
>  
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -2528,14 +2534,14 @@ static int eb_parse(struct i915_execbuffer *eb)
>  }
>  
>  static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch,
> -		     bool first)
> +		     bool first, bool last)
>  {
>  	int err;
>  
>  	if (intel_context_nopreempt(eb->context))
>  		__set_bit(I915_FENCE_FLAG_NOPREEMPT, &eb->request->fence.flags);
>  
> -	err = eb_move_to_gpu(eb, first);
> +	err = eb_move_to_gpu(eb, first, last);
>  	if (err)
>  		return err;
>  
> @@ -2748,7 +2754,7 @@ eb_select_legacy_ring(struct i915_execbuffer *eb)
>  }
>  
>  static int
> -eb_select_engine(struct i915_execbuffer *eb)
> +eb_select_engine(struct i915_execbuffer *eb, unsigned int batch_number)
>  {
>  	struct intel_context *ce;
>  	unsigned int idx;
> @@ -2763,6 +2769,18 @@ eb_select_engine(struct i915_execbuffer *eb)
>  	if (IS_ERR(ce))
>  		return PTR_ERR(ce);
>  
> +	if (batch_number > 0) {
> +		struct intel_context *parent = ce;
> +
> +		GEM_BUG_ON(!intel_context_is_parent(parent));
> +
> +		for_each_child(parent, ce)
> +			if (!--batch_number)
> +				break;
> +		intel_context_put(parent);
> +		intel_context_get(ce);
> +	}
> +
>  	intel_gt_pm_get(ce->engine->gt);
>  
>  	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
> @@ -3155,13 +3173,49 @@ parse_execbuf2_extensions(struct drm_i915_gem_execbuffer2 *args,
>  				    eb);
>  }
>  
> +static int setup_composite_fence(struct i915_execbuffer *eb,
> +				 struct dma_fence **out_fence,
> +				 unsigned int num_batches)
> +{
> +	struct dma_fence_array *fence_array;
> +	struct dma_fence **fences = kmalloc(num_batches * sizeof(*fences),
> +					    GFP_KERNEL);
> +	struct intel_context *parent = intel_context_to_parent(eb->context);
> +	int i;
> +
> +	if (!fences)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < num_batches; ++i)
> +		fences[i] = out_fence[i];
> +
> +	fence_array = dma_fence_array_create(num_batches,
> +					     fences,
> +					     parent->fence_context,
> +					     ++parent->seqno,
> +					     false);
> +	if (!fence_array) {
> +		kfree(fences);
> +		return -ENOMEM;
> +	}
> +
> +	/* Move ownership to the dma_fence_array created above */
> +	for (i = 0; i < num_batches; ++i)
> +		dma_fence_get(fences[i]);
> +
> +	eb->composite_fence = &fence_array->base;
> +
> +	return 0;
> +}
> +
>  static int
>  i915_gem_do_execbuffer(struct drm_device *dev,
>  		       struct drm_file *file,
>  		       struct drm_i915_gem_execbuffer2 *args,
>  		       struct drm_i915_gem_exec_object2 *exec,
> -		       int batch_index,
> +		       unsigned int batch_index,
>  		       unsigned int num_batches,
> +		       unsigned int batch_number,
>  		       struct dma_fence *in_fence,
>  		       struct dma_fence *exec_fence,
>  		       struct dma_fence **out_fence)
> @@ -3170,6 +3224,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	struct i915_execbuffer eb;
>  	struct i915_vma *batch;
>  	int err;
> +	bool first = batch_number == 0;
> +	bool last = batch_number + 1 == num_batches;
>  
>  	BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
>  	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS &
> @@ -3194,6 +3250,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	eb.batch_start_offset = args->batch_start_offset;
>  	eb.batch_len = args->batch_len;
>  	eb.trampoline = NULL;
> +	eb.composite_fence = NULL;
>  
>  	eb.fences = NULL;
>  	eb.num_fences = 0;
> @@ -3219,14 +3276,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	GEM_BUG_ON(!eb.lut_size);
>  
>  	eb.num_batches = num_batches;
> -	if (batch_index >= 0)
> -		eb.batch_index = batch_index;
> +	eb.batch_index = batch_index;
>  
>  	err = eb_select_context(&eb);
>  	if (unlikely(err))
>  		goto err_destroy;
>  
> -	err = eb_select_engine(&eb);
> +	err = eb_select_engine(&eb, batch_number);
>  	if (unlikely(err))
>  		goto err_context;
>  
> @@ -3275,6 +3331,23 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  			goto err_ext;
>  	}
>  
> +	if (out_fence) {
> +		/* Move ownership to caller (i915_gem_execbuffer2_ioctl) */
> +		out_fence[batch_number] = dma_fence_get(&eb.request->fence);
> +
> +		/*
> +		 * Need to create a composite fence (dma_fence_array,
> +		 * eb.composite_fence) for the excl fence of the dma_resv
> +		 * objects as each BB can write to the object. Since we create
> +		 */
> +		if (num_batches > 1 && last) {
> +			err = setup_composite_fence(&eb, out_fence,
> +						    num_batches);
> +			if (err < 0)
> +				goto err_request;
> +		}
> +	}
> +
>  	if (exec_fence) {
>  		err = i915_request_await_execution(eb.request,
>  						   exec_fence);
> @@ -3307,17 +3380,27 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  		intel_gt_buffer_pool_mark_active(eb.batch_pool, eb.request);
>  
>  	trace_i915_request_queue(eb.request, eb.batch_flags);
> -	err = eb_submit(&eb, batch, true);
> +	err = eb_submit(&eb, batch, first, last);
>  
>  err_request:
> +	if (last)
> +		set_bit(I915_FENCE_FLAG_SUBMIT_PARALLEL,
> +			&eb.request->fence.flags);
> +
>  	i915_request_get(eb.request);
>  	err = eb_request_add(&eb, err);
>  
>  	if (eb.fences)
>  		signal_fence_array(&eb);
>  
> -	if (!err && out_fence)
> -		*out_fence = dma_fence_get(&eb.request->fence);
> +	/*
> +	 * Ownership of the composite fence (dma_fence_array,
> +	 * eb.composite_fence) has been moved to the dma_resv objects these BB
> +	 * write to in i915_vma_move_to_active. It is ok to release the creation
> +	 * reference of this fence now.
> +	 */
> +	if (eb.composite_fence)
> +		dma_fence_put(eb.composite_fence);
>  
>  	if (unlikely(eb.gem_context->syncobj)) {
>  		drm_syncobj_replace_fence(eb.gem_context->syncobj,
> @@ -3368,6 +3451,17 @@ static bool check_buffer_count(size_t count)
>  	return !(count < 1 || count > INT_MAX || count > SIZE_MAX / sz - 1);
>  }
>  
> +/* Release fences from the dma_fence_get in i915_gem_do_execbuffer. */
> +static inline void put_out_fences(struct dma_fence **out_fences,
> +				  unsigned int num_batches)
> +{
> +	int i;
> +
> +	for (i = 0; i < num_batches; ++i)
> +		if (out_fences[i])
> +			dma_fence_put(out_fences[i]);
> +}
> +
>  int
>  i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  			   struct drm_file *file)
> @@ -3375,13 +3469,16 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  	struct drm_i915_private *i915 = to_i915(dev);
>  	struct drm_i915_gem_execbuffer2 *args = data;
>  	struct drm_i915_gem_exec_object2 *exec2_list;
> -	struct dma_fence **out_fence_p = NULL;
> -	struct dma_fence *out_fence = NULL;
> +	struct dma_fence **out_fences = NULL;
>  	struct dma_fence *in_fence = NULL;
>  	struct dma_fence *exec_fence = NULL;
>  	int out_fence_fd = -1;
>  	const size_t count = args->buffer_count;
>  	int err;
> +	struct i915_gem_context *ctx;
> +	struct intel_context *parent = NULL;
> +	unsigned int num_batches = 1, i;
> +	bool is_parallel = false;
>  
>  	if (!check_buffer_count(count)) {
>  		drm_dbg(&i915->drm, "execbuf2 with %zd buffers\n", count);
> @@ -3404,10 +3501,39 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  	if (err)
>  		return err;
>  
> +	ctx = i915_gem_context_lookup(file->driver_priv, args->rsvd1);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	if (i915_gem_context_user_engines(ctx)) {
> +		parent = i915_gem_context_get_engine(ctx, args->flags &
> +						     I915_EXEC_RING_MASK);
> +		if (IS_ERR(parent)) {
> +			err = PTR_ERR(parent);
> +			goto err_context;
> +		}
> +
> +		if (intel_context_is_parent(parent)) {
> +			if (args->batch_len) {
> +				err = -EINVAL;
> +				goto err_context;
> +			}
> +
> +			num_batches = parent->guc_number_children + 1;
> +			if (num_batches > count) {
> +				i915_gem_context_put(ctx);
> +				goto err_parent;
> +			}
> +			is_parallel = true;
> +		}
> +	}
> +
>  	if (args->flags & I915_EXEC_FENCE_IN) {
>  		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
> -		if (!in_fence)
> -			return -EINVAL;
> +		if (!in_fence) {
> +			err = -EINVAL;
> +			goto err_parent;
> +		}
>  	}
>  
>  	if (args->flags & I915_EXEC_FENCE_SUBMIT) {
> @@ -3423,13 +3549,25 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  		}
>  	}
>  
> -	if (args->flags & I915_EXEC_FENCE_OUT) {
> -		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
> -		if (out_fence_fd < 0) {
> -			err = out_fence_fd;
> +	/*
> +	 * We always allocate out fences when doing multi-BB submission as
> +	 * this is required to create an excl fence for any dma buf objects
> +	 * these BBs touch.
> +	 */
> +	if (args->flags & I915_EXEC_FENCE_OUT || is_parallel) {
> +		out_fences = kcalloc(num_batches, sizeof(*out_fences),
> +				     GFP_KERNEL);
> +		if (!out_fences) {
> +			err = -ENOMEM;
>  			goto err_out_fence;
>  		}
> -		out_fence_p = &out_fence;
> +		if (args->flags & I915_EXEC_FENCE_OUT) {
> +			out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
> +			if (out_fence_fd < 0) {
> +				err = out_fence_fd;
> +				goto err_out_fence;
> +			}
> +		}
>  	}
>  
>  	/* Allocate extra slots for use by the command parser */
> @@ -3449,8 +3587,35 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  		goto err_copy;
>  	}
>  
> -	err = i915_gem_do_execbuffer(dev, file, args, exec2_list, -1, 1,
> -				     in_fence, exec_fence, out_fence_p);
> +	/*
> +	 * Downstream submission code expects all parallel submissions to occur
> +	 * in intel_context sequence, thus only 1 submission can happen at a
> +	 * time.
> +	 */
> +	if (is_parallel)
> +		mutex_lock(&parent->parallel_submit);
> +
> +	err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
> +				     args->flags & I915_EXEC_BATCH_FIRST ?
> +				     0 : count - num_batches,
> +				     num_batches,
> +				     0,
> +				     in_fence,
> +				     exec_fence,
> +				     out_fences);
> +
> +	for (i = 1; err == 0 && i < num_batches; i++)
> +		err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
> +					     args->flags & I915_EXEC_BATCH_FIRST ?
> +					     i : count - num_batches + i,
> +					     num_batches,
> +					     i,
> +					     NULL,
> +					     NULL,
> +					     out_fences);
> +
> +	if (is_parallel)
> +		mutex_unlock(&parent->parallel_submit);
>  
>  	/*
>  	 * Now that we have begun execution of the batchbuffer, we ignore
> @@ -3491,8 +3656,31 @@ end:;
>  	}
>  
>  	if (!err && out_fence_fd >= 0) {
> +		struct dma_fence *out_fence = NULL;
>  		struct sync_file *sync_fence;
>  
> +		if (is_parallel) {
> +			struct dma_fence_array *fence_array;
> +
> +			/*
> +			 * The dma_fence_array now owns out_fences (from
> +			 * dma_fence_get in i915_gem_do_execbuffer) assuming
> +			 * successful creation of dma_fence_array.
> +			 */
> +			fence_array = dma_fence_array_create(num_batches,
> +							     out_fences,
> +							     parent->fence_context,
> +							     ++parent->seqno,
> +							     false);
> +			if (!fence_array)
> +				goto put_out_fences;
> +
> +			out_fence = &fence_array->base;
> +			out_fences = NULL;
> +		} else {
> +			out_fence = out_fences[0];
> +		}
> +
>  		sync_fence = sync_file_create(out_fence);
>  		if (sync_fence) {
>  			fd_install(out_fence_fd, sync_fence->file);
> @@ -3500,9 +3688,15 @@ end:;
>  			args->rsvd2 |= (u64)out_fence_fd << 32;
>  			out_fence_fd = -1;
>  		}
> +
> +		/*
> +		 * The sync_file now owns out_fence, drop the creation
> +		 * reference.
> +		 */
>  		dma_fence_put(out_fence);
> -	} else if (out_fence) {
> -		dma_fence_put(out_fence);
> +	} else if (out_fences) {
> +put_out_fences:
> +		put_out_fences(out_fences, num_batches);
>  	}
>  
>  	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
> @@ -3513,9 +3707,15 @@ end:;
>  	if (out_fence_fd >= 0)
>  		put_unused_fd(out_fence_fd);
>  err_out_fence:
> +	kfree(out_fences);
>  	dma_fence_put(exec_fence);
>  err_exec_fence:
>  	dma_fence_put(in_fence);
> +err_parent:
> +	if (parent)
> +		intel_context_put(parent);
> +err_context:
> +	i915_gem_context_put(ctx);
>  
>  	return err;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index f396993374da..2c07f5f22c94 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -472,6 +472,9 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  	ce->guc_id = GUC_INVALID_LRC_ID;
>  	INIT_LIST_HEAD(&ce->guc_id_link);
>  
> +	mutex_init(&ce->parallel_submit);
> +	ce->fence_context = dma_fence_context_alloc(1);
> +
>  	/*
>  	 * Initialize fence to be complete as this is expected to be complete
>  	 * unless there is a pending schedule disable outstanding.
> @@ -498,6 +501,8 @@ void intel_context_fini(struct intel_context *ce)
>  		for_each_child_safe(ce, child, next)
>  			intel_context_put(child);
>  
> +	mutex_destroy(&ce->parallel_submit);
> +
>  	mutex_destroy(&ce->pin_mutex);
>  	i915_active_fini(&ce->active);
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index fdc4890335b7..8af9ace4c052 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -235,6 +235,15 @@ struct intel_context {
>  
>  	/* Last request submitted on a parent */
>  	struct i915_request *last_rq;
> +
> +	/* Parallel submission mutex */
> +	struct mutex parallel_submit;
> +
> +	/* Fence context for parallel submission */
> +	u64 fence_context;
> +
> +	/* Seqno for parallel submission */
> +	u32 seqno;
>  };
>  
>  #endif /* __INTEL_CONTEXT_TYPES__ */
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 4b7fc4647e46..ed4e790276a9 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1234,9 +1234,11 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>  	return i915_active_add_request(&vma->active, rq);
>  }
>  
> -int i915_vma_move_to_active(struct i915_vma *vma,
> -			    struct i915_request *rq,
> -			    unsigned int flags)
> +int _i915_vma_move_to_active(struct i915_vma *vma,
> +			     struct i915_request *rq,
> +			     unsigned int flags,
> +			     struct dma_fence *shared_fence,
> +			     struct dma_fence *excl_fence)
>  {
>  	struct drm_i915_gem_object *obj = vma->obj;
>  	int err;
> @@ -1257,7 +1259,7 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  			intel_frontbuffer_put(front);
>  		}
>  
> -		dma_resv_add_excl_fence(vma->resv, &rq->fence);
> +		dma_resv_add_excl_fence(vma->resv, excl_fence);
>  		obj->write_domain = I915_GEM_DOMAIN_RENDER;
>  		obj->read_domains = 0;
>  	} else {
> @@ -1267,7 +1269,8 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  				return err;
>  		}
>  
> -		dma_resv_add_shared_fence(vma->resv, &rq->fence);
> +		if (shared_fence)
> +			dma_resv_add_shared_fence(vma->resv, shared_fence);
>  		obj->write_domain = 0;
>  	}
>  
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index ed69f66c7ab0..a36da651dbff 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -57,9 +57,19 @@ static inline bool i915_vma_is_active(const struct i915_vma *vma)
>  
>  int __must_check __i915_vma_move_to_active(struct i915_vma *vma,
>  					   struct i915_request *rq);
> -int __must_check i915_vma_move_to_active(struct i915_vma *vma,
> -					 struct i915_request *rq,
> -					 unsigned int flags);
> +
> +int __must_check _i915_vma_move_to_active(struct i915_vma *vma,
> +					  struct i915_request *rq,
> +					  unsigned int flags,
> +					  struct dma_fence *shared_fence,
> +					  struct dma_fence *excl_fence);
> +static inline int __must_check
> +i915_vma_move_to_active(struct i915_vma *vma,
> +			struct i915_request *rq,
> +			unsigned int flags)
> +{
> +	return _i915_vma_move_to_active(vma, rq, flags, &rq->fence, &rq->fence);
> +}
>  
>  #define __i915_vma_flags(v) ((unsigned long *)&(v)->flags.counter)
>  
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship
  2021-08-09 16:39     ` Matthew Brost
@ 2021-08-09 17:03       ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 17:03 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:39:48PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 06:32:42PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:20PM -0700, Matthew Brost wrote:
> > > The GuC must receive requests in the order submitted for contexts in a
> > > parent-child relationship to function correctly. To ensure this, insert
> > > a submit fence between the current request and last request submitted
> > > for requests / contexts in a parent child relationship. This is
> > > conceptually similar to a single timeline.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> > >  drivers/gpu/drm/i915/gt/intel_context.h       |   5 +
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   3 +-
> > >  drivers/gpu/drm/i915/i915_request.c           | 120 ++++++++++++++----
> > >  5 files changed, 105 insertions(+), 28 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > index bb4c14656067..98ef2d0f7a39 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > @@ -487,6 +487,8 @@ void intel_context_fini(struct intel_context *ce)
> > >  {
> > >  	struct intel_context *child, *next;
> > >  
> > > +	if (ce->last_rq)
> > > +		i915_request_put(ce->last_rq);
> > >  	if (ce->timeline)
> > >  		intel_timeline_put(ce->timeline);
> > >  	i915_vm_put(ce->vm);
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index 7ce3b3d2edb7..a302599e436a 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -60,6 +60,11 @@ intel_context_to_parent(struct intel_context *ce)
> > >  	return intel_context_is_child(ce) ? ce->parent : ce;
> > >  }
> > >  
> > > +static inline bool intel_context_is_parallel(struct intel_context *ce)
> > > +{
> > > +	return intel_context_is_child(ce) || intel_context_is_parent(ce);
> > > +}
> > > +
> > >  void intel_context_bind_parent_child(struct intel_context *parent,
> > >  				     struct intel_context *child);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 9665cb31bab0..f4fc81f64921 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -225,6 +225,9 @@ struct intel_context {
> > >  	 */
> > >  	u8 guc_prio;
> > >  	u32 guc_prio_count[GUC_CLIENT_PRIORITY_NUM];
> > > +
> > > +	/* Last request submitted on a parent */
> > > +	struct i915_request *last_rq;
> > >  };
> > >  
> > >  #endif /* __INTEL_CONTEXT_TYPES__ */
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index d1d4a1e59e8d..1cb382f7d79d 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -820,8 +820,7 @@ static inline int rq_prio(const struct i915_request *rq)
> > >  
> > >  static inline bool is_multi_lrc_rq(struct i915_request *rq)
> > >  {
> > > -	return intel_context_is_child(rq->context) ||
> > > -		intel_context_is_parent(rq->context);
> > > +	return intel_context_is_parallel(rq->context);
> > >  }
> > >  
> > >  /*
> > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > > index ce446716d092..2e51c8999088 100644
> > > --- a/drivers/gpu/drm/i915/i915_request.c
> > > +++ b/drivers/gpu/drm/i915/i915_request.c
> > > @@ -1546,36 +1546,62 @@ i915_request_await_object(struct i915_request *to,
> > >  	return ret;
> > >  }
> > >  
> > > +static inline bool is_parallel_rq(struct i915_request *rq)
> > > +{
> > > +	return intel_context_is_parallel(rq->context);
> > > +}
> > > +
> > > +static inline struct intel_context *request_to_parent(struct i915_request *rq)
> > > +{
> > > +	return intel_context_to_parent(rq->context);
> > > +}
> > > +
> > >  static struct i915_request *
> > > -__i915_request_add_to_timeline(struct i915_request *rq)
> > > +__i915_request_ensure_parallel_ordering(struct i915_request *rq,
> > > +					struct intel_timeline *timeline)
> > >  {
> > > -	struct intel_timeline *timeline = i915_request_timeline(rq);
> > >  	struct i915_request *prev;
> > >  
> > > -	/*
> > > -	 * Dependency tracking and request ordering along the timeline
> > > -	 * is special cased so that we can eliminate redundant ordering
> > > -	 * operations while building the request (we know that the timeline
> > > -	 * itself is ordered, and here we guarantee it).
> > > -	 *
> > > -	 * As we know we will need to emit tracking along the timeline,
> > > -	 * we embed the hooks into our request struct -- at the cost of
> > > -	 * having to have specialised no-allocation interfaces (which will
> > > -	 * be beneficial elsewhere).
> > > -	 *
> > > -	 * A second benefit to open-coding i915_request_await_request is
> > > -	 * that we can apply a slight variant of the rules specialised
> > > -	 * for timelines that jump between engines (such as virtual engines).
> > > -	 * If we consider the case of virtual engine, we must emit a dma-fence
> > > -	 * to prevent scheduling of the second request until the first is
> > > -	 * complete (to maximise our greedy late load balancing) and this
> > > -	 * precludes optimising to use semaphores serialisation of a single
> > > -	 * timeline across engines.
> > > -	 */
> > > +	GEM_BUG_ON(!is_parallel_rq(rq));
> > > +
> > > +	prev = request_to_parent(rq)->last_rq;
> > > +	if (prev) {
> > > +		if (!__i915_request_is_complete(prev)) {
> > > +			i915_sw_fence_await_sw_fence(&rq->submit,
> > > +						     &prev->submit,
> > > +						     &rq->submitq);
> > > +
> > > +			if (rq->engine->sched_engine->schedule)
> > > +				__i915_sched_node_add_dependency(&rq->sched,
> > > +								 &prev->sched,
> > > +								 &rq->dep,
> > > +								 0);
> > > +		}
> > > +		i915_request_put(prev);
> > > +	}
> > > +
> > > +	request_to_parent(rq)->last_rq = i915_request_get(rq);
> > > +
> > > +	return to_request(__i915_active_fence_set(&timeline->last_request,
> > > +						  &rq->fence));
> > > +}
> > > +
> > > +static struct i915_request *
> > > +__i915_request_ensure_ordering(struct i915_request *rq,
> > > +			       struct intel_timeline *timeline)
> > > +{
> > > +	struct i915_request *prev;
> > > +
> > > +	GEM_BUG_ON(is_parallel_rq(rq));
> > > +
> > >  	prev = to_request(__i915_active_fence_set(&timeline->last_request,
> > >  						  &rq->fence));
> > > +
> > >  	if (prev && !__i915_request_is_complete(prev)) {
> > >  		bool uses_guc = intel_engine_uses_guc(rq->engine);
> > > +		bool pow2 = is_power_of_2(READ_ONCE(prev->engine)->mask |
> > > +					  rq->engine->mask);
> > > +		bool same_context = prev->context == rq->context;
> > >  
> > >  		/*
> > >  		 * The requests are supposed to be kept in order. However,
> > > @@ -1583,13 +1609,11 @@ __i915_request_add_to_timeline(struct i915_request *rq)
> > >  		 * is used as a barrier for external modification to this
> > >  		 * context.
> > >  		 */
> > > -		GEM_BUG_ON(prev->context == rq->context &&
> > > +		GEM_BUG_ON(same_context &&
> > >  			   i915_seqno_passed(prev->fence.seqno,
> > >  					     rq->fence.seqno));
> > >  
> > > -		if ((!uses_guc &&
> > > -		     is_power_of_2(READ_ONCE(prev->engine)->mask | rq->engine->mask)) ||
> > > -		    (uses_guc && prev->context == rq->context))
> > > +		if ((same_context && uses_guc) || (!uses_guc && pow2))
> > >  			i915_sw_fence_await_sw_fence(&rq->submit,
> > >  						     &prev->submit,
> > >  						     &rq->submitq);
> > > @@ -1604,6 +1628,50 @@ __i915_request_add_to_timeline(struct i915_request *rq)
> > >  							 0);
> > >  	}
> > >  
> > > +	return prev;
> > > +}
> > > +
> > > +static struct i915_request *
> > > +__i915_request_add_to_timeline(struct i915_request *rq)
> > > +{
> > > +	struct intel_timeline *timeline = i915_request_timeline(rq);
> > > +	struct i915_request *prev;
> > > +
> > > +	/*
> > > +	 * Dependency tracking and request ordering along the timeline
> > > +	 * is special cased so that we can eliminate redundant ordering
> > > +	 * operations while building the request (we know that the timeline
> > > +	 * itself is ordered, and here we guarantee it).
> > > +	 *
> > > +	 * As we know we will need to emit tracking along the timeline,
> > > +	 * we embed the hooks into our request struct -- at the cost of
> > > +	 * having to have specialised no-allocation interfaces (which will
> > > +	 * be beneficial elsewhere).
> > > +	 *
> > > +	 * A second benefit to open-coding i915_request_await_request is
> > > +	 * that we can apply a slight variant of the rules specialised
> > > +	 * for timelines that jump between engines (such as virtual engines).
> > > +	 * If we consider the case of virtual engine, we must emit a dma-fence
> > > +	 * to prevent scheduling of the second request until the first is
> > > +	 * complete (to maximise our greedy late load balancing) and this
> > > +	 * precludes optimising to use semaphores serialisation of a single
> > > +	 * timeline across engines.
> > > +	 *
> > 
> > Can we put a big FIXME in here that this should all be resolved with a
> > proper interface which passes the entire thing down to the backend?
> > 
> > Or is that no longer (or wasn't ever) the long-term goal?
> 
> I now you mentioned this in the past but I really don't think this all
> that great of an idea as it would be a pretty intrusive change and not
> sure what the real benefit is.
> 
> However, when we move the DRM scheduler this can be dropped because the
> ordering of jobs on a sched_entity.

See further down. Your current design deadlocks - or well, can deadlock. I
think at least.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > +	 * We do not order parallel submission requests on the timeline as each
> > > +	 * parallel submission context has its own timeline and the ordering
> > > +	 * rules for parallel requests are that they must be submitted in the
> > > +	 * order received from the execbuf IOCTL. So rather than using the
> > > +	 * timeline we store a pointer to last request submitted in the
> > > +	 * relationship in the gem context and insert a submission fence
> > > +	 * between that request and request passed into this function or
> > > +	 * alternatively we use completion fence if gem context has a single
> > > +	 * timeline and this is the first submission of an execbuf IOCTL.
> > > +	 */
> > > +	if (likely(!is_parallel_rq(rq)))
> > > +		prev = __i915_request_ensure_ordering(rq, timeline);
> > > +	else
> > > +		prev = __i915_request_ensure_parallel_ordering(rq, timeline);
> > > +
> > >  	/*
> > >  	 * Make sure that no request gazumped us - if it was allocated after
> > >  	 * our i915_request_alloc() and called __i915_request_add() before
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission Matthew Brost
@ 2021-08-09 17:07   ` Daniel Vetter
  2021-08-09 17:12     ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 17:07 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:38PM -0700, Matthew Brost wrote:
> Certain VMA functions in the execbuf IOCTL only need to be called on
> first or last BB of a multi-BB submission. eb_relocate() on the first

eb_relocate should be outright disallowed on multi lrc execbuf ioctl.
There's no users of that left, and it does substantially simplify the
entire locking problem if we outright disallow that.

> and eb_release_vmas() on the last. Doing so will save CPU / GPU cycles.

Yah for our mix of pin/unpin vs dma_resv_lock/unlock. Now with the current
unpin design this move is ok, but we want/need to switch vma over to
dma_resv_lock. And then it gets really nasty, because you run into a ton
of problems.

To more I read this the less I like this :-/
-Daniel

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 127 +++++++++++-------
>  .../i915/gem/selftests/i915_gem_execbuffer.c  |  14 +-
>  2 files changed, 83 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index ecdb583cc2eb..70784779872a 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -270,7 +270,7 @@ struct i915_execbuffer {
>  	/** list of vma that have execobj.relocation_count */
>  	struct list_head relocs;
>  
> -	struct i915_gem_ww_ctx ww;
> +	struct i915_gem_ww_ctx *ww;
>  
>  	/**
>  	 * Track the most recently used object for relocations, as we
> @@ -448,7 +448,7 @@ eb_pin_vma(struct i915_execbuffer *eb,
>  		pin_flags |= PIN_GLOBAL;
>  
>  	/* Attempt to reuse the current location if available */
> -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, pin_flags);
> +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, pin_flags);
>  	if (err == -EDEADLK)
>  		return err;
>  
> @@ -457,11 +457,11 @@ eb_pin_vma(struct i915_execbuffer *eb,
>  			return err;
>  
>  		/* Failing that pick any _free_ space if suitable */
> -		err = i915_vma_pin_ww(vma, &eb->ww,
> -					     entry->pad_to_size,
> -					     entry->alignment,
> -					     eb_pin_flags(entry, ev->flags) |
> -					     PIN_USER | PIN_NOEVICT);
> +		err = i915_vma_pin_ww(vma, eb->ww,
> +				      entry->pad_to_size,
> +				      entry->alignment,
> +				      eb_pin_flags(entry, ev->flags) |
> +				      PIN_USER | PIN_NOEVICT);
>  		if (unlikely(err))
>  			return err;
>  	}
> @@ -643,9 +643,9 @@ static int eb_reserve_vma(struct i915_execbuffer *eb,
>  			return err;
>  	}
>  
> -	err = i915_vma_pin_ww(vma, &eb->ww,
> -			   entry->pad_to_size, entry->alignment,
> -			   eb_pin_flags(entry, ev->flags) | pin_flags);
> +	err = i915_vma_pin_ww(vma, eb->ww,
> +			      entry->pad_to_size, entry->alignment,
> +			      eb_pin_flags(entry, ev->flags) | pin_flags);
>  	if (err)
>  		return err;
>  
> @@ -940,7 +940,7 @@ static int eb_lock_vmas(struct i915_execbuffer *eb)
>  		struct eb_vma *ev = &eb->vma[i];
>  		struct i915_vma *vma = ev->vma;
>  
> -		err = i915_gem_object_lock(vma->obj, &eb->ww);
> +		err = i915_gem_object_lock(vma->obj, eb->ww);
>  		if (err)
>  			return err;
>  	}
> @@ -1020,12 +1020,13 @@ eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
>  	}
>  }
>  
> -static void eb_release_vmas(struct i915_execbuffer *eb, bool final)
> +static void eb_release_vmas(struct i915_execbuffer *eb, bool final,
> +			    bool unreserve)
>  {
>  	const unsigned int count = eb->buffer_count;
>  	unsigned int i;
>  
> -	for (i = 0; i < count; i++) {
> +	for (i = 0; unreserve && i < count; i++) {
>  		struct eb_vma *ev = &eb->vma[i];
>  		struct i915_vma *vma = ev->vma;
>  
> @@ -1237,7 +1238,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
>  		if (err)
>  			return ERR_PTR(err);
>  
> -		vma = i915_gem_object_ggtt_pin_ww(obj, &eb->ww, NULL, 0, 0,
> +		vma = i915_gem_object_ggtt_pin_ww(obj, eb->ww, NULL, 0, 0,
>  						  PIN_MAPPABLE |
>  						  PIN_NONBLOCK /* NOWARN */ |
>  						  PIN_NOEVICT);
> @@ -1361,7 +1362,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>  	}
>  	eb->reloc_pool = NULL;
>  
> -	err = i915_gem_object_lock(pool->obj, &eb->ww);
> +	err = i915_gem_object_lock(pool->obj, eb->ww);
>  	if (err)
>  		goto err_pool;
>  
> @@ -1380,7 +1381,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>  		goto err_unmap;
>  	}
>  
> -	err = i915_vma_pin_ww(batch, &eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
> +	err = i915_vma_pin_ww(batch, eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
>  	if (err)
>  		goto err_unmap;
>  
> @@ -1402,7 +1403,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>  			eb->reloc_context = ce;
>  		}
>  
> -		err = intel_context_pin_ww(ce, &eb->ww);
> +		err = intel_context_pin_ww(ce, eb->ww);
>  		if (err)
>  			goto err_unpin;
>  
> @@ -2017,8 +2018,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	}
>  
>  	/* We may process another execbuffer during the unlock... */
> -	eb_release_vmas(eb, false);
> -	i915_gem_ww_ctx_fini(&eb->ww);
> +	eb_release_vmas(eb, false, true);
> +	i915_gem_ww_ctx_fini(eb->ww);
>  
>  	if (rq) {
>  		/* nonblocking is always false */
> @@ -2062,7 +2063,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  		err = eb_reinit_userptr(eb);
>  
>  err_relock:
> -	i915_gem_ww_ctx_init(&eb->ww, true);
> +	i915_gem_ww_ctx_init(eb->ww, true);
>  	if (err)
>  		goto out;
>  
> @@ -2119,8 +2120,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  
>  err:
>  	if (err == -EDEADLK) {
> -		eb_release_vmas(eb, false);
> -		err = i915_gem_ww_ctx_backoff(&eb->ww);
> +		eb_release_vmas(eb, false, true);
> +		err = i915_gem_ww_ctx_backoff(eb->ww);
>  		if (!err)
>  			goto repeat_validate;
>  	}
> @@ -2152,7 +2153,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
>  	return err;
>  }
>  
> -static int eb_relocate_parse(struct i915_execbuffer *eb)
> +static int eb_relocate_parse(struct i915_execbuffer *eb, bool first)
>  {
>  	int err;
>  	struct i915_request *rq = NULL;
> @@ -2189,14 +2190,16 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  	/* only throttle once, even if we didn't need to throttle */
>  	throttle = false;
>  
> -	err = eb_validate_vmas(eb);
> -	if (err == -EAGAIN)
> -		goto slow;
> -	else if (err)
> -		goto err;
> +	if (first) {
> +		err = eb_validate_vmas(eb);
> +		if (err == -EAGAIN)
> +			goto slow;
> +		else if (err)
> +			goto err;
> +	}
>  
>  	/* The objects are in their final locations, apply the relocations. */
> -	if (eb->args->flags & __EXEC_HAS_RELOC) {
> +	if (eb->args->flags & __EXEC_HAS_RELOC && first) {
>  		struct eb_vma *ev;
>  
>  		list_for_each_entry(ev, &eb->relocs, reloc_link) {
> @@ -2211,13 +2214,13 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
>  			goto slow;
>  	}
>  
> -	if (!err)
> +	if (!err && first)
>  		err = eb_parse(eb);
>  
>  err:
>  	if (err == -EDEADLK) {
> -		eb_release_vmas(eb, false);
> -		err = i915_gem_ww_ctx_backoff(&eb->ww);
> +		eb_release_vmas(eb, false, true);
> +		err = i915_gem_ww_ctx_backoff(eb->ww);
>  		if (!err)
>  			goto retry;
>  	}
> @@ -2398,7 +2401,7 @@ shadow_batch_pin(struct i915_execbuffer *eb,
>  	if (IS_ERR(vma))
>  		return vma;
>  
> -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, flags);
> +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, flags);
>  	if (err)
>  		return ERR_PTR(err);
>  
> @@ -2412,7 +2415,7 @@ static struct i915_vma *eb_dispatch_secure(struct i915_execbuffer *eb, struct i9
>  	 * batch" bit. Hence we need to pin secure batches into the global gtt.
>  	 * hsw should have this fixed, but bdw mucks it up again. */
>  	if (eb->batch_flags & I915_DISPATCH_SECURE)
> -		return i915_gem_object_ggtt_pin_ww(vma->obj, &eb->ww, NULL, 0, 0, 0);
> +		return i915_gem_object_ggtt_pin_ww(vma->obj, eb->ww, NULL, 0, 0, 0);
>  
>  	return NULL;
>  }
> @@ -2458,7 +2461,7 @@ static int eb_parse(struct i915_execbuffer *eb)
>  		eb->batch_pool = pool;
>  	}
>  
> -	err = i915_gem_object_lock(pool->obj, &eb->ww);
> +	err = i915_gem_object_lock(pool->obj, eb->ww);
>  	if (err)
>  		goto err;
>  
> @@ -2666,7 +2669,7 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
>  	 * GGTT space, so do this first before we reserve a seqno for
>  	 * ourselves.
>  	 */
> -	err = intel_context_pin_ww(ce, &eb->ww);
> +	err = intel_context_pin_ww(ce, eb->ww);
>  	if (err)
>  		return ERR_PTR(err);
>  
> @@ -3218,7 +3221,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  		       unsigned int batch_number,
>  		       struct dma_fence *in_fence,
>  		       struct dma_fence *exec_fence,
> -		       struct dma_fence **out_fence)
> +		       struct dma_fence **out_fence,
> +		       struct i915_gem_ww_ctx *ww)
>  {
>  	struct drm_i915_private *i915 = to_i915(dev);
>  	struct i915_execbuffer eb;
> @@ -3239,7 +3243,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  
>  	eb.exec = exec;
>  	eb.vma = (struct eb_vma *)(exec + args->buffer_count + 1);
> -	eb.vma[0].vma = NULL;
> +	if (first)
> +		eb.vma[0].vma = NULL;
>  	eb.reloc_pool = eb.batch_pool = NULL;
>  	eb.reloc_context = NULL;
>  
> @@ -3251,6 +3256,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	eb.batch_len = args->batch_len;
>  	eb.trampoline = NULL;
>  	eb.composite_fence = NULL;
> +	eb.ww = ww;
>  
>  	eb.fences = NULL;
>  	eb.num_fences = 0;
> @@ -3269,9 +3275,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	if (err)
>  		goto err_ext;
>  
> -	err = eb_create(&eb);
> -	if (err)
> -		goto err_ext;
> +	if (first) {
> +		err = eb_create(&eb);
> +		if (err)
> +			goto err_ext;
> +	} else {
> +		eb.lut_size = -eb.buffer_count;
> +	}
> +
>  
>  	GEM_BUG_ON(!eb.lut_size);
>  
> @@ -3286,15 +3297,22 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	if (unlikely(err))
>  		goto err_context;
>  
> -	err = eb_lookup_vmas(&eb);
> -	if (err) {
> -		eb_release_vmas(&eb, true);
> -		goto err_engine;
> +	if (first) {
> +		err = eb_lookup_vmas(&eb);
> +		if (err) {
> +			eb_release_vmas(&eb, true, true);
> +			goto err_engine;
> +		}
> +
> +	} else {
> +		eb.batch = &eb.vma[eb.batch_index];
>  	}
>  
> -	i915_gem_ww_ctx_init(&eb.ww, true);
>  
> -	err = eb_relocate_parse(&eb);
> +	if (first)
> +		i915_gem_ww_ctx_init(eb.ww, true);
> +
> +	err = eb_relocate_parse(&eb, first);
>  	if (err) {
>  		/*
>  		 * If the user expects the execobject.offset and
> @@ -3307,7 +3325,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  		goto err_vma;
>  	}
>  
> -	ww_acquire_done(&eb.ww.ctx);
> +	if (first)
> +		ww_acquire_done(&eb.ww->ctx);
>  
>  	batch = eb.batch->vma;
>  
> @@ -3410,11 +3429,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>  	i915_request_put(eb.request);
>  
>  err_vma:
> -	eb_release_vmas(&eb, true);
> +	eb_release_vmas(&eb, true, err || last);
>  	if (eb.trampoline)
>  		i915_vma_unpin(eb.trampoline);
>  	WARN_ON(err == -EDEADLK);
> -	i915_gem_ww_ctx_fini(&eb.ww);
> +	if (err || last)
> +		i915_gem_ww_ctx_fini(eb.ww);
>  
>  	if (eb.batch_pool)
>  		intel_gt_buffer_pool_put(eb.batch_pool);
> @@ -3476,6 +3496,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  	const size_t count = args->buffer_count;
>  	int err;
>  	struct i915_gem_context *ctx;
> +	struct i915_gem_ww_ctx ww;
>  	struct intel_context *parent = NULL;
>  	unsigned int num_batches = 1, i;
>  	bool is_parallel = false;
> @@ -3602,7 +3623,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  				     0,
>  				     in_fence,
>  				     exec_fence,
> -				     out_fences);
> +				     out_fences,
> +				     &ww);
>  
>  	for (i = 1; err == 0 && i < num_batches; i++)
>  		err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
> @@ -3612,7 +3634,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
>  					     i,
>  					     NULL,
>  					     NULL,
> -					     out_fences);
> +					     out_fences,
> +					     &ww);
>  
>  	if (is_parallel)
>  		mutex_unlock(&parent->parallel_submit);
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> index 16162fc2782d..710d2700e5b4 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> @@ -32,11 +32,11 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
>  	if (IS_ERR(vma))
>  		return PTR_ERR(vma);
>  
> -	err = i915_gem_object_lock(obj, &eb->ww);
> +	err = i915_gem_object_lock(obj, eb->ww);
>  	if (err)
>  		return err;
>  
> -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, PIN_USER | PIN_HIGH);
> +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, PIN_USER | PIN_HIGH);
>  	if (err)
>  		return err;
>  
> @@ -106,10 +106,12 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
>  static int igt_gpu_reloc(void *arg)
>  {
>  	struct i915_execbuffer eb;
> +	struct i915_gem_ww_ctx ww;
>  	struct drm_i915_gem_object *scratch;
>  	int err = 0;
>  	u32 *map;
>  
> +	eb.ww = &ww;
>  	eb.i915 = arg;
>  
>  	scratch = i915_gem_object_create_internal(eb.i915, 4096);
> @@ -141,20 +143,20 @@ static int igt_gpu_reloc(void *arg)
>  		eb.reloc_pool = NULL;
>  		eb.reloc_context = NULL;
>  
> -		i915_gem_ww_ctx_init(&eb.ww, false);
> +		i915_gem_ww_ctx_init(eb.ww, false);
>  retry:
> -		err = intel_context_pin_ww(eb.context, &eb.ww);
> +		err = intel_context_pin_ww(eb.context, eb.ww);
>  		if (!err) {
>  			err = __igt_gpu_reloc(&eb, scratch);
>  
>  			intel_context_unpin(eb.context);
>  		}
>  		if (err == -EDEADLK) {
> -			err = i915_gem_ww_ctx_backoff(&eb.ww);
> +			err = i915_gem_ww_ctx_backoff(eb.ww);
>  			if (!err)
>  				goto retry;
>  		}
> -		i915_gem_ww_ctx_fini(&eb.ww);
> +		i915_gem_ww_ctx_fini(eb.ww);
>  
>  		if (eb.reloc_pool)
>  			intel_gt_buffer_pool_put(eb.reloc_pool);
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission
  2021-08-09 17:07   ` Daniel Vetter
@ 2021-08-09 17:12     ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 17:12 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:07:44PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:38PM -0700, Matthew Brost wrote:
> > Certain VMA functions in the execbuf IOCTL only need to be called on
> > first or last BB of a multi-BB submission. eb_relocate() on the first
> 
> eb_relocate should be outright disallowed on multi lrc execbuf ioctl.
> There's no users of that left, and it does substantially simplify the
> entire locking problem if we outright disallow that.
> 
> > and eb_release_vmas() on the last. Doing so will save CPU / GPU cycles.
> 
> Yah for our mix of pin/unpin vs dma_resv_lock/unlock. Now with the current
> unpin design this move is ok, but we want/need to switch vma over to
> dma_resv_lock. And then it gets really nasty, because you run into a ton
> of problems.

To give a bit more context of how much this is all nasty: When you publish
a fence, which thanks to rcu lookup of dma_resv happens when you install a
fence, not when you unlock the dma_resv_lock, you're not allowed to
allocate _any_ memory anymore until you're request has finished executing.
This means no allocating anything, including kmalloc for your i915_request
struct for the remaining batches, or the composite fence or anything else
you might do.

userptr also makes this requirement even more fun with additional
serialization requirements against mmu notifier invalidations.

The current execbuf code is a mess in this regard, and the idea is to fix
this with the conversion to drm/sched, because that has a very clear point
of no return. With the design you're pushing you're essentially making
this problem unfixable.
-Daniel

> 
> To more I read this the less I like this :-/
> -Daniel
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 127 +++++++++++-------
> >  .../i915/gem/selftests/i915_gem_execbuffer.c  |  14 +-
> >  2 files changed, 83 insertions(+), 58 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index ecdb583cc2eb..70784779872a 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -270,7 +270,7 @@ struct i915_execbuffer {
> >  	/** list of vma that have execobj.relocation_count */
> >  	struct list_head relocs;
> >  
> > -	struct i915_gem_ww_ctx ww;
> > +	struct i915_gem_ww_ctx *ww;
> >  
> >  	/**
> >  	 * Track the most recently used object for relocations, as we
> > @@ -448,7 +448,7 @@ eb_pin_vma(struct i915_execbuffer *eb,
> >  		pin_flags |= PIN_GLOBAL;
> >  
> >  	/* Attempt to reuse the current location if available */
> > -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, pin_flags);
> > +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, pin_flags);
> >  	if (err == -EDEADLK)
> >  		return err;
> >  
> > @@ -457,11 +457,11 @@ eb_pin_vma(struct i915_execbuffer *eb,
> >  			return err;
> >  
> >  		/* Failing that pick any _free_ space if suitable */
> > -		err = i915_vma_pin_ww(vma, &eb->ww,
> > -					     entry->pad_to_size,
> > -					     entry->alignment,
> > -					     eb_pin_flags(entry, ev->flags) |
> > -					     PIN_USER | PIN_NOEVICT);
> > +		err = i915_vma_pin_ww(vma, eb->ww,
> > +				      entry->pad_to_size,
> > +				      entry->alignment,
> > +				      eb_pin_flags(entry, ev->flags) |
> > +				      PIN_USER | PIN_NOEVICT);
> >  		if (unlikely(err))
> >  			return err;
> >  	}
> > @@ -643,9 +643,9 @@ static int eb_reserve_vma(struct i915_execbuffer *eb,
> >  			return err;
> >  	}
> >  
> > -	err = i915_vma_pin_ww(vma, &eb->ww,
> > -			   entry->pad_to_size, entry->alignment,
> > -			   eb_pin_flags(entry, ev->flags) | pin_flags);
> > +	err = i915_vma_pin_ww(vma, eb->ww,
> > +			      entry->pad_to_size, entry->alignment,
> > +			      eb_pin_flags(entry, ev->flags) | pin_flags);
> >  	if (err)
> >  		return err;
> >  
> > @@ -940,7 +940,7 @@ static int eb_lock_vmas(struct i915_execbuffer *eb)
> >  		struct eb_vma *ev = &eb->vma[i];
> >  		struct i915_vma *vma = ev->vma;
> >  
> > -		err = i915_gem_object_lock(vma->obj, &eb->ww);
> > +		err = i915_gem_object_lock(vma->obj, eb->ww);
> >  		if (err)
> >  			return err;
> >  	}
> > @@ -1020,12 +1020,13 @@ eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
> >  	}
> >  }
> >  
> > -static void eb_release_vmas(struct i915_execbuffer *eb, bool final)
> > +static void eb_release_vmas(struct i915_execbuffer *eb, bool final,
> > +			    bool unreserve)
> >  {
> >  	const unsigned int count = eb->buffer_count;
> >  	unsigned int i;
> >  
> > -	for (i = 0; i < count; i++) {
> > +	for (i = 0; unreserve && i < count; i++) {
> >  		struct eb_vma *ev = &eb->vma[i];
> >  		struct i915_vma *vma = ev->vma;
> >  
> > @@ -1237,7 +1238,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
> >  		if (err)
> >  			return ERR_PTR(err);
> >  
> > -		vma = i915_gem_object_ggtt_pin_ww(obj, &eb->ww, NULL, 0, 0,
> > +		vma = i915_gem_object_ggtt_pin_ww(obj, eb->ww, NULL, 0, 0,
> >  						  PIN_MAPPABLE |
> >  						  PIN_NONBLOCK /* NOWARN */ |
> >  						  PIN_NOEVICT);
> > @@ -1361,7 +1362,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
> >  	}
> >  	eb->reloc_pool = NULL;
> >  
> > -	err = i915_gem_object_lock(pool->obj, &eb->ww);
> > +	err = i915_gem_object_lock(pool->obj, eb->ww);
> >  	if (err)
> >  		goto err_pool;
> >  
> > @@ -1380,7 +1381,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
> >  		goto err_unmap;
> >  	}
> >  
> > -	err = i915_vma_pin_ww(batch, &eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
> > +	err = i915_vma_pin_ww(batch, eb->ww, 0, 0, PIN_USER | PIN_NONBLOCK);
> >  	if (err)
> >  		goto err_unmap;
> >  
> > @@ -1402,7 +1403,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
> >  			eb->reloc_context = ce;
> >  		}
> >  
> > -		err = intel_context_pin_ww(ce, &eb->ww);
> > +		err = intel_context_pin_ww(ce, eb->ww);
> >  		if (err)
> >  			goto err_unpin;
> >  
> > @@ -2017,8 +2018,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >  	}
> >  
> >  	/* We may process another execbuffer during the unlock... */
> > -	eb_release_vmas(eb, false);
> > -	i915_gem_ww_ctx_fini(&eb->ww);
> > +	eb_release_vmas(eb, false, true);
> > +	i915_gem_ww_ctx_fini(eb->ww);
> >  
> >  	if (rq) {
> >  		/* nonblocking is always false */
> > @@ -2062,7 +2063,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >  		err = eb_reinit_userptr(eb);
> >  
> >  err_relock:
> > -	i915_gem_ww_ctx_init(&eb->ww, true);
> > +	i915_gem_ww_ctx_init(eb->ww, true);
> >  	if (err)
> >  		goto out;
> >  
> > @@ -2119,8 +2120,8 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >  
> >  err:
> >  	if (err == -EDEADLK) {
> > -		eb_release_vmas(eb, false);
> > -		err = i915_gem_ww_ctx_backoff(&eb->ww);
> > +		eb_release_vmas(eb, false, true);
> > +		err = i915_gem_ww_ctx_backoff(eb->ww);
> >  		if (!err)
> >  			goto repeat_validate;
> >  	}
> > @@ -2152,7 +2153,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb,
> >  	return err;
> >  }
> >  
> > -static int eb_relocate_parse(struct i915_execbuffer *eb)
> > +static int eb_relocate_parse(struct i915_execbuffer *eb, bool first)
> >  {
> >  	int err;
> >  	struct i915_request *rq = NULL;
> > @@ -2189,14 +2190,16 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >  	/* only throttle once, even if we didn't need to throttle */
> >  	throttle = false;
> >  
> > -	err = eb_validate_vmas(eb);
> > -	if (err == -EAGAIN)
> > -		goto slow;
> > -	else if (err)
> > -		goto err;
> > +	if (first) {
> > +		err = eb_validate_vmas(eb);
> > +		if (err == -EAGAIN)
> > +			goto slow;
> > +		else if (err)
> > +			goto err;
> > +	}
> >  
> >  	/* The objects are in their final locations, apply the relocations. */
> > -	if (eb->args->flags & __EXEC_HAS_RELOC) {
> > +	if (eb->args->flags & __EXEC_HAS_RELOC && first) {
> >  		struct eb_vma *ev;
> >  
> >  		list_for_each_entry(ev, &eb->relocs, reloc_link) {
> > @@ -2211,13 +2214,13 @@ static int eb_relocate_parse(struct i915_execbuffer *eb)
> >  			goto slow;
> >  	}
> >  
> > -	if (!err)
> > +	if (!err && first)
> >  		err = eb_parse(eb);
> >  
> >  err:
> >  	if (err == -EDEADLK) {
> > -		eb_release_vmas(eb, false);
> > -		err = i915_gem_ww_ctx_backoff(&eb->ww);
> > +		eb_release_vmas(eb, false, true);
> > +		err = i915_gem_ww_ctx_backoff(eb->ww);
> >  		if (!err)
> >  			goto retry;
> >  	}
> > @@ -2398,7 +2401,7 @@ shadow_batch_pin(struct i915_execbuffer *eb,
> >  	if (IS_ERR(vma))
> >  		return vma;
> >  
> > -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, flags);
> > +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, flags);
> >  	if (err)
> >  		return ERR_PTR(err);
> >  
> > @@ -2412,7 +2415,7 @@ static struct i915_vma *eb_dispatch_secure(struct i915_execbuffer *eb, struct i9
> >  	 * batch" bit. Hence we need to pin secure batches into the global gtt.
> >  	 * hsw should have this fixed, but bdw mucks it up again. */
> >  	if (eb->batch_flags & I915_DISPATCH_SECURE)
> > -		return i915_gem_object_ggtt_pin_ww(vma->obj, &eb->ww, NULL, 0, 0, 0);
> > +		return i915_gem_object_ggtt_pin_ww(vma->obj, eb->ww, NULL, 0, 0, 0);
> >  
> >  	return NULL;
> >  }
> > @@ -2458,7 +2461,7 @@ static int eb_parse(struct i915_execbuffer *eb)
> >  		eb->batch_pool = pool;
> >  	}
> >  
> > -	err = i915_gem_object_lock(pool->obj, &eb->ww);
> > +	err = i915_gem_object_lock(pool->obj, eb->ww);
> >  	if (err)
> >  		goto err;
> >  
> > @@ -2666,7 +2669,7 @@ static struct i915_request *eb_pin_engine(struct i915_execbuffer *eb, bool throt
> >  	 * GGTT space, so do this first before we reserve a seqno for
> >  	 * ourselves.
> >  	 */
> > -	err = intel_context_pin_ww(ce, &eb->ww);
> > +	err = intel_context_pin_ww(ce, eb->ww);
> >  	if (err)
> >  		return ERR_PTR(err);
> >  
> > @@ -3218,7 +3221,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  		       unsigned int batch_number,
> >  		       struct dma_fence *in_fence,
> >  		       struct dma_fence *exec_fence,
> > -		       struct dma_fence **out_fence)
> > +		       struct dma_fence **out_fence,
> > +		       struct i915_gem_ww_ctx *ww)
> >  {
> >  	struct drm_i915_private *i915 = to_i915(dev);
> >  	struct i915_execbuffer eb;
> > @@ -3239,7 +3243,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  
> >  	eb.exec = exec;
> >  	eb.vma = (struct eb_vma *)(exec + args->buffer_count + 1);
> > -	eb.vma[0].vma = NULL;
> > +	if (first)
> > +		eb.vma[0].vma = NULL;
> >  	eb.reloc_pool = eb.batch_pool = NULL;
> >  	eb.reloc_context = NULL;
> >  
> > @@ -3251,6 +3256,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  	eb.batch_len = args->batch_len;
> >  	eb.trampoline = NULL;
> >  	eb.composite_fence = NULL;
> > +	eb.ww = ww;
> >  
> >  	eb.fences = NULL;
> >  	eb.num_fences = 0;
> > @@ -3269,9 +3275,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  	if (err)
> >  		goto err_ext;
> >  
> > -	err = eb_create(&eb);
> > -	if (err)
> > -		goto err_ext;
> > +	if (first) {
> > +		err = eb_create(&eb);
> > +		if (err)
> > +			goto err_ext;
> > +	} else {
> > +		eb.lut_size = -eb.buffer_count;
> > +	}
> > +
> >  
> >  	GEM_BUG_ON(!eb.lut_size);
> >  
> > @@ -3286,15 +3297,22 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  	if (unlikely(err))
> >  		goto err_context;
> >  
> > -	err = eb_lookup_vmas(&eb);
> > -	if (err) {
> > -		eb_release_vmas(&eb, true);
> > -		goto err_engine;
> > +	if (first) {
> > +		err = eb_lookup_vmas(&eb);
> > +		if (err) {
> > +			eb_release_vmas(&eb, true, true);
> > +			goto err_engine;
> > +		}
> > +
> > +	} else {
> > +		eb.batch = &eb.vma[eb.batch_index];
> >  	}
> >  
> > -	i915_gem_ww_ctx_init(&eb.ww, true);
> >  
> > -	err = eb_relocate_parse(&eb);
> > +	if (first)
> > +		i915_gem_ww_ctx_init(eb.ww, true);
> > +
> > +	err = eb_relocate_parse(&eb, first);
> >  	if (err) {
> >  		/*
> >  		 * If the user expects the execobject.offset and
> > @@ -3307,7 +3325,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  		goto err_vma;
> >  	}
> >  
> > -	ww_acquire_done(&eb.ww.ctx);
> > +	if (first)
> > +		ww_acquire_done(&eb.ww->ctx);
> >  
> >  	batch = eb.batch->vma;
> >  
> > @@ -3410,11 +3429,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> >  	i915_request_put(eb.request);
> >  
> >  err_vma:
> > -	eb_release_vmas(&eb, true);
> > +	eb_release_vmas(&eb, true, err || last);
> >  	if (eb.trampoline)
> >  		i915_vma_unpin(eb.trampoline);
> >  	WARN_ON(err == -EDEADLK);
> > -	i915_gem_ww_ctx_fini(&eb.ww);
> > +	if (err || last)
> > +		i915_gem_ww_ctx_fini(eb.ww);
> >  
> >  	if (eb.batch_pool)
> >  		intel_gt_buffer_pool_put(eb.batch_pool);
> > @@ -3476,6 +3496,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
> >  	const size_t count = args->buffer_count;
> >  	int err;
> >  	struct i915_gem_context *ctx;
> > +	struct i915_gem_ww_ctx ww;
> >  	struct intel_context *parent = NULL;
> >  	unsigned int num_batches = 1, i;
> >  	bool is_parallel = false;
> > @@ -3602,7 +3623,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
> >  				     0,
> >  				     in_fence,
> >  				     exec_fence,
> > -				     out_fences);
> > +				     out_fences,
> > +				     &ww);
> >  
> >  	for (i = 1; err == 0 && i < num_batches; i++)
> >  		err = i915_gem_do_execbuffer(dev, file, args, exec2_list,
> > @@ -3612,7 +3634,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void *data,
> >  					     i,
> >  					     NULL,
> >  					     NULL,
> > -					     out_fences);
> > +					     out_fences,
> > +					     &ww);
> >  
> >  	if (is_parallel)
> >  		mutex_unlock(&parent->parallel_submit);
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> > index 16162fc2782d..710d2700e5b4 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
> > @@ -32,11 +32,11 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
> >  	if (IS_ERR(vma))
> >  		return PTR_ERR(vma);
> >  
> > -	err = i915_gem_object_lock(obj, &eb->ww);
> > +	err = i915_gem_object_lock(obj, eb->ww);
> >  	if (err)
> >  		return err;
> >  
> > -	err = i915_vma_pin_ww(vma, &eb->ww, 0, 0, PIN_USER | PIN_HIGH);
> > +	err = i915_vma_pin_ww(vma, eb->ww, 0, 0, PIN_USER | PIN_HIGH);
> >  	if (err)
> >  		return err;
> >  
> > @@ -106,10 +106,12 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
> >  static int igt_gpu_reloc(void *arg)
> >  {
> >  	struct i915_execbuffer eb;
> > +	struct i915_gem_ww_ctx ww;
> >  	struct drm_i915_gem_object *scratch;
> >  	int err = 0;
> >  	u32 *map;
> >  
> > +	eb.ww = &ww;
> >  	eb.i915 = arg;
> >  
> >  	scratch = i915_gem_object_create_internal(eb.i915, 4096);
> > @@ -141,20 +143,20 @@ static int igt_gpu_reloc(void *arg)
> >  		eb.reloc_pool = NULL;
> >  		eb.reloc_context = NULL;
> >  
> > -		i915_gem_ww_ctx_init(&eb.ww, false);
> > +		i915_gem_ww_ctx_init(eb.ww, false);
> >  retry:
> > -		err = intel_context_pin_ww(eb.context, &eb.ww);
> > +		err = intel_context_pin_ww(eb.context, eb.ww);
> >  		if (!err) {
> >  			err = __igt_gpu_reloc(&eb, scratch);
> >  
> >  			intel_context_unpin(eb.context);
> >  		}
> >  		if (err == -EDEADLK) {
> > -			err = i915_gem_ww_ctx_backoff(&eb.ww);
> > +			err = i915_gem_ww_ctx_backoff(eb.ww);
> >  			if (!err)
> >  				goto retry;
> >  		}
> > -		i915_gem_ww_ctx_fini(&eb.ww);
> > +		i915_gem_ww_ctx_fini(eb.ww);
> >  
> >  		if (eb.reloc_pool)
> >  			intel_gt_buffer_pool_put(eb.reloc_pool);
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts Matthew Brost
@ 2021-08-09 17:17   ` Daniel Vetter
  2021-08-09 19:32     ` Matthew Brost
  2021-08-12 19:26   ` Daniel Vetter
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-09 17:17 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> Some workloads use lots of contexts that continually pin / unpin
> contexts. With GuC submission an unpin translates to a schedule disable
> H2G which puts pressure on both the i915 and GuC. A schedule disable can
> also block future requests from being submitted until the operation
> completes. None of this is ideal.
> 
> Add a configurable, via debugfs, delay period before the schedule
> disable is issued. Default delay period is 1 second. The delay period is
> skipped if more than 3/4 of the guc_ids are in use.
> 
> This patch also updates the selftests to turn off this delay period as
> this extra time would likely cause many selftests to fail. Follow up
> patches will fix all the selftests and enable the delay period.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

I think this is more evidence that we should just pin/unpin context at
create/destruction time. The current scheme doesn't really work that well
and causes way more pain than benefits it seems.

If anyone screams, and that's a big if aside of some igts, we can come up
with a proper scheme to evict contexts without pin/unpin and layer hacks
over that misdesign.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
>  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
>  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
>  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
>  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
>  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
>  drivers/gpu/drm/i915/i915_trace.h             |  10 +
>  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
>  18 files changed, 405 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index b199d59bd2c4..1553287e5491 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
>  		int err;
>  
>  		/* serialises with execbuf */
> -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +		intel_context_close(ce);
>  		if (!intel_context_pin_if_active(ce))
>  			continue;
>  
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> index 13b088cc787e..a666d7e610f5 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_gem_coherency),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> index ffae7df5e4d7..2c92afa9d608 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> index b20f5621f62b..4745c78a48de 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_mmap_gpu),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> index 740ee8086a27..ae1361c7c4cf 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_gem_huge),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 8e90a4a0b7b0..96643040defd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  	ce->guc_id = GUC_INVALID_LRC_ID;
>  	INIT_LIST_HEAD(&ce->guc_id_link);
>  
> +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> +
>  	mutex_init(&ce->parallel_submit);
>  	ce->fence_context = dma_fence_context_alloc(1);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index a302599e436a..f4c9036f7f03 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
>  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
>  }
>  
> +static inline void intel_context_close(struct intel_context *ce)
> +{
> +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +
> +	trace_intel_context_close(ce);
> +	if (ce->ops->close)
> +		ce->ops->close(ce);
> +}
> +
>  static inline bool intel_context_is_closed(const struct intel_context *ce)
>  {
>  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 8af9ace4c052..53f00657a45c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -11,6 +11,7 @@
>  #include <linux/list.h>
>  #include <linux/mutex.h>
>  #include <linux/types.h>
> +#include <linux/ktime.h>
>  
>  #include "i915_active_types.h"
>  #include "i915_sw_fence.h"
> @@ -38,6 +39,7 @@ struct intel_context_ops {
>  	int (*alloc)(struct intel_context *ce);
>  
>  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> +	void (*close)(struct intel_context *ce);
>  
>  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
>  	int (*pin)(struct intel_context *ce);
> @@ -203,6 +205,12 @@ struct intel_context {
>  	 */
>  	struct list_head guc_id_link;
>  
> +	/*
> +	 * GuC schedule disable link / time
> +	 */
> +	struct list_head guc_sched_disable_link;
> +	ktime_t guc_sched_disable_time;
> +
>  	/* GuC context blocked fence */
>  	struct i915_sw_fence guc_blocked;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 30a0f364db8f..90b5b657d411 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -60,6 +60,7 @@ struct intel_guc {
>  	struct ida guc_ids;
>  	u32 num_guc_ids;
>  	u32 max_guc_ids;
> +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
>  	unsigned long *guc_ids_bitmap;
>  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
>  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> @@ -69,6 +70,12 @@ struct intel_guc {
>  	struct list_head destroyed_contexts;
>  	struct intel_gt_pm_unpark_work destroy_worker;
>  
> +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> +	struct list_head sched_disable_list;
> +	struct hrtimer sched_disable_timer;
> +#define SCHED_DISABLE_DELAY_NS	1000000000
> +	u64 sched_disable_delay_ns;
> +
>  	bool submission_supported;
>  	bool submission_selected;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index 7c479c5e7b3a..53a6f3da6cce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
>  }
>  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
>  
> +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	*val = guc->sched_disable_delay_ns;
> +
> +	return 0;
> +}
> +
> +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	guc->sched_disable_delay_ns = val;
> +
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> +			guc_sched_disable_delay_ns_get,
> +			guc_sched_disable_delay_ns_set, "%lld\n");
> +
>  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
>  {
>  	static const struct debugfs_gt_file files[] = {
>  		{ "guc_info", &guc_info_fops, NULL },
>  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
>  		{ "guc_num_id", &guc_num_id_fops, NULL },
> +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
>  	};
>  
>  	if (!intel_guc_is_supported(guc))
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cd1893edf43a..dc0d6a099bee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
>  	return (timeout < 0) ? timeout : 0;
>  }
>  
> +static void sched_disable_contexts_flush(struct intel_guc *guc);
> +
>  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>  {
>  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
>  		return 0;
>  
> +	sched_disable_contexts_flush(guc);
> +
>  	return intel_guc_wait_for_pending_msg(guc,
>  					      &guc->outstanding_submission_g2h,
>  					      true, timeout);
> @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
>  static void guc_signal_context_fence(struct intel_context *ce);
>  static void guc_cancel_context_requests(struct intel_context *ce);
>  static void guc_blocked_fence_complete(struct intel_context *ce);
> +static void sched_disable_context_delete(struct intel_context *ce);
>  
>  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  {
> @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  		deregister = context_wait_for_deregister_to_register(ce);
>  		banned = context_banned(ce);
>  		init_sched_state(ce);
> +		sched_disable_context_delete(ce);
>  
>  		if (pending_enable || destroyed || deregister) {
>  			atomic_dec(&guc->outstanding_submission_g2h);
> @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  
>  	intel_gt_park_heartbeats(guc_to_gt(guc));
>  	disable_submission(guc);
> +	hrtimer_cancel(&guc->sched_disable_timer);
>  	guc->interrupts.disable(guc);
>  
>  	/* Flush IRQ handler */
> @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
>  
>  static void destroy_worker_func(struct work_struct *w);
>  
> +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> +
>  /*
>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>   * at firmware loading time.
> @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	INIT_LIST_HEAD(&guc->destroyed_contexts);
>  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
>  
> +	spin_lock_init(&guc->sched_disable_lock);
> +	INIT_LIST_HEAD(&guc->sched_disable_list);
> +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> +		     HRTIMER_MODE_REL);
> +	guc->sched_disable_timer.function = sched_disable_timer_func;
> +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> +
>  	return 0;
>  }
>  
> @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  	if (unlikely(ret < 0))
>  		return ret;
>  
> +	if (intel_context_is_parent(ce))
> +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> +			order_base_2(ce->guc_number_children + 1);
> +	else
> +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> +
>  	ce->guc_id = ret;
>  	return 0;
>  }
> @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  {
>  	GEM_BUG_ON(intel_context_is_child(ce));
>  	if (!context_guc_id_invalid(ce)) {
> -		if (intel_context_is_parent(ce))
> +		if (intel_context_is_parent(ce)) {
> +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> +				order_base_2(ce->guc_number_children + 1);
>  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
>  					      order_base_2(ce->guc_number_children
>  							   + 1));
> -		else
> +		} else {
> +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
>  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> +		}
>  		clr_lrc_desc_registered(guc, ce->guc_id);
> +
>  		set_context_guc_id_invalid(ce);
>  	}
>  	if (!list_empty(&ce->guc_id_link))
> @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
>  			 * from another context that has more guc_id that itself.
>  			 */
>  			if (cn_o2 != ce_o2) {
> +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> +					order_base_2(cn->guc_number_children + 1);
>  				bitmap_release_region(guc->guc_ids_bitmap,
>  						      cn->guc_id,
>  						      cn_o2);
> +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> +					order_base_2(ce->guc_number_children + 1);
>  				bitmap_allocate_region(guc->guc_ids_bitmap,
>  						       ce->guc_id,
>  						       ce_o2);
> @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
>  	__guc_context_unpin(ce);
>  
>  	if (likely(!intel_context_is_barrier(ce)))
> -		intel_engine_pm_put(ce->engine);
> +		intel_engine_pm_put_async(ce->engine);
>  }
>  
>  static void guc_context_post_unpin(struct intel_context *ce)
> @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>  
>  	for_each_engine_masked(engine, ce->engine->gt,
>  			       ce->engine->mask, tmp)
> -		intel_engine_pm_put(engine);
> +		intel_engine_pm_put_async(engine);
>  	for_each_child(ce, child)
>  		for_each_engine_masked(engine, child->engine->gt,
>  				       child->engine->mask, tmp)
> -			intel_engine_pm_put(engine);
> +			intel_engine_pm_put_async(engine);
>  }
>  
>  static void __guc_context_sched_enable(struct intel_guc *guc,
> @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
> +	sched_disable_context_delete(ce);
> +
>  	with_intel_runtime_pm(runtime_pm, wakeref)
>  		__guc_context_sched_disable(guc, ce, guc_id);
>  
> @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>  								     1);
>  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  	}
> +
> +	sched_disable_context_delete(ce);
> +}
> +
> +#define next_sched_disable_time(guc, now, ce) \
> +	(guc->sched_disable_delay_ns - \
> +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> +static void ____sched_disable_context_delete(struct intel_guc *guc,
> +					     struct intel_context *ce)
> +{
> +	bool is_first;
> +
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> +
> +	is_first = list_is_first(&ce->guc_sched_disable_link,
> +				 &guc->sched_disable_list);
> +	list_del_init(&ce->guc_sched_disable_link);
> +	if (list_empty(&guc->sched_disable_list)) {
> +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> +	} else if (is_first) {
> +		struct intel_context *first =
> +			list_first_entry(&guc->sched_disable_list,
> +					 typeof(*first),
> +					 guc_sched_disable_link);
> +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> +							first);
> +
> +		hrtimer_start(&guc->sched_disable_timer,
> +			      ns_to_ktime(next_time),
> +			      HRTIMER_MODE_REL_PINNED);
> +	}
> +}
> +
> +static void __sched_disable_context_delete(struct intel_guc *guc,
> +					   struct intel_context *ce)
> +{
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		intel_context_sched_disable_unpin(ce);
> +		____sched_disable_context_delete(guc, ce);
> +	}
> +}
> +
> +static void sched_disable_context_delete(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +		__sched_disable_context_delete(guc, ce);
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +	}
> +}
> +
> +static void sched_disable_context_add(struct intel_guc *guc,
> +				      struct intel_context *ce)
> +{
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> +
> +	ce->guc_sched_disable_time = ktime_get();
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +	if (list_empty(&guc->sched_disable_list))
> +		hrtimer_start(&guc->sched_disable_timer,
> +			      ns_to_ktime(guc->sched_disable_delay_ns),
> +			      HRTIMER_MODE_REL_PINNED);
> +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +}
> +
> +static void sched_disable_contexts_flush(struct intel_guc *guc)
> +{
> +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +
> +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> +					 guc_sched_disable_link) {
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		list_del_init(&ce->guc_sched_disable_link);
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			continue;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			continue;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +	}
> +
> +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> +
> +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
>  }
>  
> +#define should_sched_be_disabled(guc, now, ce) \
> +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> +	(guc->sched_disable_delay_ns / 4) * 3)
> +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> +{
> +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> +					     sched_disable_timer);
> +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +	ktime_t now;
> +
> +	if (list_empty(&guc->sched_disable_list))
> +		return HRTIMER_NORESTART;
> +
> +	now = ktime_get();
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +
> +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> +					 guc_sched_disable_link) {
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		/*
> +		 * If a context has been waiting for 3/4 of its delay or more,
> +		 * issue the schedule disable. Using this heuristic allows more
> +		 * than 1 context to have its scheduling disabled when this
> +		 * timer is run.
> +		 */
> +		if (!should_sched_be_disabled(guc, now, ce))
> +			break;
> +
> +		list_del_init(&ce->guc_sched_disable_link);
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			continue;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			continue;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +	}
> +
> +	if (!list_empty(&guc->sched_disable_list)) {
> +		struct intel_context *first =
> +			list_first_entry(&guc->sched_disable_list,
> +					 typeof(*first),
> +					 guc_sched_disable_link);
> +		u64 next_time = next_sched_disable_time(guc, now, first);
> +
> +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +
> +		return HRTIMER_RESTART;
> +	} else {
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +
> +		return HRTIMER_NORESTART;
> +	}
> +}
> +
> +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
>  static void guc_context_sched_disable(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
> @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
>  	intel_wakeref_t wakeref;
>  	u16 guc_id;
>  	bool enabled;
> +	int guc_id_index = intel_context_is_parent(ce) ?
> +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> +	int max_guc_ids = intel_context_is_parent(ce) ?
> +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
>  
>  	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
>  
>  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
>  	    !lrc_desc_registered(guc, ce->guc_id)) {
> @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
>  	if (!context_enabled(ce))
>  		goto unpin;
>  
> +	/*
> +	 * If no guc_id pressure and the context isn't closed we delay the
> +	 * schedule disable to not to continuously disable / enable scheduling
> +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> +	 * debugfs, default 1s.
> +	 */
> +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> +		sched_disable_context_add(guc, ce);
> +		return;
> +	}
> +
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	/*
> @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
>  	i915_request_notify_execute_cb_imm(rq);
>  }
>  
> +static void __guc_context_close(struct intel_guc *guc,
> +				struct intel_context *ce)
> +{
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		struct intel_runtime_pm *runtime_pm =
> +			ce->engine->uncore->rpm;
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			goto update_list;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			goto update_list;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +update_list:
> +		____sched_disable_context_delete(guc, ce);
> +	}
> +}
> +
> +static void guc_context_close(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	unsigned long flags;
> +
> +	/*
> +	 * If we close the context and a schedule disable is pending a delay, do
> +	 * it immediately.
> +	 */
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +		__guc_context_close(guc, ce);
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +	}
> +}
> +
>  static struct intel_context *
>  guc_create_parallel(struct intel_engine_cs **engines,
>  		    unsigned int num_siblings,
> @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
>  	.post_unpin = guc_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.cancel_request = guc_context_cancel_request,
>  
> @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
>  
>  	rq->reserved_space -= GUC_REQUEST_SIZE;
>  
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> +		   atomic_read(&ce->pin_count) < 3);
> +	sched_disable_context_delete(ce);
> +
>  	/*
>  	 * guc_ids are exhausted or a heuristic is met indicating too many
>  	 * guc_ids are waiting on requests with submission dependencies (not
> @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
>  	__guc_context_unpin(ce);
>  
>  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> -		intel_engine_pm_put(engine);
> +		intel_engine_pm_put_async(engine);
>  }
>  
>  static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>  	.post_unpin = guc_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.cancel_request = guc_context_cancel_request,
>  
> @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
>  	.post_unpin = guc_parent_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.enter = guc_virtual_context_enter,
>  	.exit = guc_virtual_context_exit,
> @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>  		   atomic_read(&guc->outstanding_submission_g2h));
>  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
>  	drm_printf(p, "GuC max context registered: %u\n\n",
>  		   guc->lrcd_reg.max_idx);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> index 9cfecf9d368e..ad70b3159ce4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
>  #define NUM_RQ_PER_CONTEXT	2
>  #define HEARTBEAT_INTERVAL	1500
>  
> -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> +					bool hang, bool sched_disable_delay)
>  {
>  	struct intel_gt *gt = arg;
>  	struct intel_guc *guc = &gt->uc.guc;
> @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  	if (limit_guc_ids)
>  		guc->num_guc_ids = NUM_GUC_ID;
>  
> +	if (sched_disable_delay)
> +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> +
>  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
>  	if (IS_ERR(ce)) {
>  		ret = PTR_ERR(ce);
> @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  	guc->num_guc_ids = guc->max_guc_ids;
>  	guc->gse_hang_expected = false;
>  	guc->inject_bad_sched_disable = false;
> +	guc->sched_disable_delay_ns = 0;
>  	kfree(contexts);
>  
>  	return ret;
> @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  
>  static int intel_guc_flow_control_guc_ids(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, true, false);
> +	return __intel_guc_flow_control_guc(arg, true, false, false);
> +}
> +
> +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> +{
> +	return __intel_guc_flow_control_guc(arg, true, false, true);
>  }
>  
>  static int intel_guc_flow_control_lrcd_reg(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, false, false);
> +	return __intel_guc_flow_control_guc(arg, false, false, false);
>  }
>  
>  static int intel_guc_flow_control_hang_state_machine(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, true, true);
> +	return __intel_guc_flow_control_guc(arg, true, true, false);
>  }
>  
>  #define NUM_RQ_STRESS_CTBS	0x4000
> @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
>  	static const struct i915_subtest tests[] = {
>  		SUBTEST(intel_guc_flow_control_stress_ctbs),
>  		SUBTEST(intel_guc_flow_control_guc_ids),
> +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
>  		SUBTEST(intel_guc_flow_control_lrcd_reg),
>  		SUBTEST(intel_guc_flow_control_hang_state_machine),
>  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> index f54de0499be7..bf464db7affe 100644
> --- a/drivers/gpu/drm/i915/i915_selftest.h
> +++ b/drivers/gpu/drm/i915/i915_selftest.h
> @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
>  			T, ARRAY_SIZE(T), data)
>  #define i915_live_subtests(T, data) ({ \
>  	typecheck(struct drm_i915_private *, data); \
> +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
>  	__i915_subtests(__func__, \
>  			__i915_live_setup, __i915_live_teardown, \
>  			T, ARRAY_SIZE(T), data); \
>  })
>  #define intel_gt_live_subtests(T, data) ({ \
>  	typecheck(struct intel_gt *, data); \
> +	(data)->uc.guc.sched_disable_delay_ns = 0; \
>  	__i915_subtests(__func__, \
>  			__intel_gt_live_setup, __intel_gt_live_teardown, \
>  			T, ARRAY_SIZE(T), data); \
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 806ad688274b..57ba7065d5ab 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
>  	     TP_ARGS(ce)
>  );
>  
> +DEFINE_EVENT(intel_context, intel_context_close,
> +	     TP_PROTO(struct intel_context *ce),
> +	     TP_ARGS(ce)
> +);
> +
>  DEFINE_EVENT(intel_context, intel_context_ban,
>  	     TP_PROTO(struct intel_context *ce),
>  	     TP_ARGS(ce)
> @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
>  {
>  }
>  
> +static inline void
> +trace_intel_context_close(struct intel_context *ce)
> +{
> +}
> +
>  static inline void
>  trace_intel_context_ban(struct intel_context *ce)
>  {
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> index f843a5040706..d54c280217fe 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
>  
>  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> index 9e9a6cb1d9e5..86bad00cca95 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
>  	if (err)
>  		return err;
>  
> -	err = i915_subtests(tests, i915);
> +	err = i915_live_subtests(tests, i915);
>  
>  	destroy_empty_config(&i915->perf);
>  
> diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> index d67710d10615..afbf88865a8b 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
>  	if (intel_gt_is_wedged(&i915->gt))
>  		return 0;
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
>  
>  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> index dd0607254a95..f4b157451851 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_vma_remapped_gtt),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-09 14:23   ` Daniel Vetter
@ 2021-08-09 18:11     ` Matthew Brost
  2021-08-10  6:43       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:11 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:23:42PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:07PM -0700, Matthew Brost wrote:
> > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > circuiting while a scheduling of user context could be enabled.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/Makefile                 |  1 +
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> >  2 files changed, 34 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > index 903de270f2db..5e3a1e2095b0 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -103,6 +103,7 @@ gt-y += \
> >  	gt/intel_gt_clock_utils.o \
> >  	gt/intel_gt_irq.o \
> >  	gt/intel_gt_pm.o \
> > +	gt/intel_gt_pm_unpark_work.o \
> 
> This file isn't here?
> 

Yep, included this in the wrong patch. Should be in:
https://patchwork.freedesktop.org/patch/448462/?series=92789&rev=2

> Also pm stuff tends to have very nasty locking requirements, doing special
> stuff like this in the backend tends to lead to really big surprises. I
> think two options to make sure our locking design stays consistent:
> - Lift this to generic code.

Not sure I'm following this, intel_engine_pm_get/put are generic calls.
Those calls should have all the correct annoations. If they don't we can
add them.

Matt

> - expose some engine_pm_migt_get/put() calls which do have the right set
>   of might_lock annoations, and call those in the generic code.
> 
> Imo the worst kernel abstractions are those where all implementations
> look&act the same, except for locking. Unfortunately i915-gem code is full
> of this stuff, and we need to stop this by enlisting lockdep to check the
> contracts for us.
> -Daniel
> 
> >  	gt/intel_gt_pm_irq.o \
> >  	gt/intel_gt_requests.o \
> >  	gt/intel_gtt.o \
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 7fe4d1559a81..c5d9548bfd00 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -2056,7 +2056,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> >  
> >  static int guc_context_pin(struct intel_context *ce, void *vaddr)
> >  {
> > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > +
> > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > +		intel_engine_pm_get(ce->engine);
> > +
> > +	return ret;
> >  }
> >  
> >  static void guc_context_unpin(struct intel_context *ce)
> > @@ -2067,6 +2072,9 @@ static void guc_context_unpin(struct intel_context *ce)
> >  
> >  	unpin_guc_id(guc, ce, true);
> >  	lrc_unpin(ce);
> > +
> > +	if (likely(!intel_context_is_barrier(ce)))
> > +		intel_engine_pm_put(ce->engine);
> >  }
> >  
> >  static void guc_context_post_unpin(struct intel_context *ce)
> > @@ -3002,8 +3010,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> >  static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> >  {
> >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +
> > +	if (likely(!ret))
> > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +			intel_engine_pm_get(engine);
> >  
> > -	return __guc_context_pin(ce, engine, vaddr);
> > +	return ret;
> > +}
> > +
> > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > +{
> > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > +	struct intel_engine_cs *engine;
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +
> > +	GEM_BUG_ON(context_enabled(ce));
> > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > +
> > +	unpin_guc_id(guc, ce, true);
> > +	lrc_unpin(ce);
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > +		intel_engine_pm_put(engine);
> >  }
> >  
> >  static void guc_virtual_context_enter(struct intel_context *ce)
> > @@ -3040,7 +3070,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >  
> >  	.pre_pin = guc_virtual_context_pre_pin,
> >  	.pin = guc_virtual_context_pin,
> > -	.unpin = guc_context_unpin,
> > +	.unpin = guc_virtual_context_unpin,
> >  	.post_unpin = guc_context_post_unpin,
> >  
> >  	.ban = guc_context_ban,
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-09 14:27   ` Daniel Vetter
@ 2021-08-09 18:20     ` Matthew Brost
  2021-08-10  6:47       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:27:01PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:08PM -0700, Matthew Brost wrote:
> > Calling switch_to_kernel_context isn't needed if the engine PM reference
> > is taken while all contexts are pinned. By not calling
> > switch_to_kernel_context we save on issuing a request to the engine.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > index 1f07ac4e0672..58099de6bf07 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > @@ -162,6 +162,10 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
> >  	unsigned long flags;
> >  	bool result = true;
> >  
> > +	/* No need to switch_to_kernel_context if GuC submission */
> 
> Maybe whack a big FIXME on here that we should unravel this properly.

Sure, can add a FIXME here.

> Currently the execlist backend assumptions are leaked all over the place,
> leading to stuff like this. Which means extremely fragile code.
>

Yes, this something required for execlists implemented in what should be
generic code. 

> I currently don't have a great idea on how exactly we should do that, but
> oh well.

Me either, it will be a process.

> 
> btw just in case we ever want to make guc lrc properly evictable (which as
> the og use-case for this function, way, way back), would we need to fully

Can you explain what you mean by fully evictable? Not getting what you
mean in this context.

> unregister them from guc? At least I'm assuming there's no other trick

If scheduling is disabled on the context (currently done on unpin) you are
free move anything around as the GuC is guaranteed not to touch the
context state. If on re-pin something has moved (e.g. the LRC vaddr is
different), you need to unregister and re-register the context with the
GuC.

> like the below one.
> 
> Another aside: How does the perf/OA patching work on GuC?
>

Not my area of expertise but perf somewhat a WIP. The plan is for the
GuC to write out some stats to HWSP I think? John Harrison is working to
get this fully implemented.

OA is working afaik, with Umesh Nerlige Ramappa being the expert here.

Matt

> Anyway, patch looks legit:
> 
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> 
> > +	if (intel_engine_uses_guc(engine))
> > +		return true;
> > +
> >  	/* GPU is pointing to the void, as good as in the kernel context. */
> >  	if (intel_gt_is_wedged(engine->gt))
> >  		return true;
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping
  2021-08-09 14:28   ` Daniel Vetter
@ 2021-08-09 18:28     ` Matthew Brost
  2021-08-10  6:49       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:28 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:28:04PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:10PM -0700, Matthew Brost wrote:
> > Add logical engine mapping. This is required for split-frame, as
> > workloads need to be placed on engines in a logically contiguous manner.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
> >  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  1 +
> >  .../drm/i915/gt/intel_execlists_submission.c  |  1 +
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
> >  5 files changed, 56 insertions(+), 29 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > index 0d9105a31d84..4d790f9a65dd 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
> >  	GEM_DEBUG_WARN_ON(iir);
> >  }
> >  
> > -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> > +			      u8 logical_instance)
> >  {
> >  	const struct engine_info *info = &intel_engines[id];
> >  	struct drm_i915_private *i915 = gt->i915;
> > @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> >  
> >  	engine->class = info->class;
> >  	engine->instance = info->instance;
> > +	engine->logical_mask = BIT(logical_instance);
> >  	__sprint_engine_name(engine);
> >  
> >  	engine->props.heartbeat_interval_ms =
> > @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
> >  	return info->engine_mask;
> >  }
> >  
> > +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> > +				 u8 class, const u8 *map, u8 num_instances)
> > +{
> > +	int i, j;
> > +	u8 current_logical_id = 0;
> > +
> > +	for (j = 0; j < num_instances; ++j) {
> > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > +			if (!HAS_ENGINE(gt, i) ||
> > +			    intel_engines[i].class != class)
> > +				continue;
> > +
> > +			if (intel_engines[i].instance == map[j]) {
> > +				logical_ids[intel_engines[i].instance] =
> > +					current_logical_id++;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> > +{
> > +	int i;
> > +	u8 map[MAX_ENGINE_INSTANCE + 1];
> > +
> > +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> > +		map[i] = i;
> > +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> > +}
> > +
> >  /**
> >   * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
> >   * @gt: pointer to struct intel_gt
> > @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> >  	struct drm_i915_private *i915 = gt->i915;
> >  	const unsigned int engine_mask = init_engine_mask(gt);
> >  	unsigned int mask = 0;
> > -	unsigned int i;
> > +	unsigned int i, class;
> > +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
> >  	int err;
> >  
> >  	drm_WARN_ON(&i915->drm, engine_mask == 0);
> > @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> >  	if (i915_inject_probe_failure(i915))
> >  		return -ENODEV;
> >  
> > -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> > -		if (!HAS_ENGINE(gt, i))
> > -			continue;
> > +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> > +		setup_logical_ids(gt, logical_ids, class);
> >  
> > -		err = intel_engine_setup(gt, i);
> > -		if (err)
> > -			goto cleanup;
> > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > +			u8 instance = intel_engines[i].instance;
> > +
> > +			if (intel_engines[i].class != class ||
> > +			    !HAS_ENGINE(gt, i))
> > +				continue;
> >  
> > -		mask |= BIT(i);
> > +			err = intel_engine_setup(gt, i,
> > +						 logical_ids[instance]);
> > +			if (err)
> > +				goto cleanup;
> > +
> > +			mask |= BIT(i);
> > +		}
> >  	}
> >  
> >  	/*
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > index ed91bcff20eb..85e5c9a9e502 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > @@ -266,6 +266,7 @@ struct intel_engine_cs {
> >  	unsigned int guc_id;
> >  
> >  	intel_engine_mask_t mask;
> > +	intel_engine_mask_t logical_mask;
> 
> Kerneldoc at least for new stuff. Bonus points if you get the
> struct/header file up to speed (with dummy/fixme comments if need be) so

Sure can add Kerneldoc for new variables. Def don't have time to get all
structs kerneldoc up to speed at moment as by backlog is about a mile
long. Perhaps after we get all of GuC submission upstream I can take
sometime to go through all the structures and update the DoC.

Matt

> we can include it into our overall html hierarchy).
> -Daniel
> 
> >  
> >  	u8 class;
> >  	u8 instance;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index de5f9c86b9a4..baa1797af1c8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -3879,6 +3879,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> >  
> >  		ve->siblings[ve->num_siblings++] = sibling;
> >  		ve->base.mask |= sibling->mask;
> > +		ve->base.logical_mask |= sibling->logical_mask;
> >  
> >  		/*
> >  		 * All physical engines must be compatible for their emission
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > index 6926919bcac6..9f5f43a16182 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
> >  	for_each_engine(engine, gt, id) {
> >  		u8 guc_class = engine_class_to_guc_class(engine->class);
> >  
> > -		system_info->mapping_table[guc_class][engine->instance] =
> > +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
> >  			engine->instance;
> >  	}
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 310116f40509..dec757d319a2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1795,23 +1795,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> >  	return __guc_action_deregister_context(guc, guc_id, loop);
> >  }
> >  
> > -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> > -{
> > -	switch (class) {
> > -	case RENDER_CLASS:
> > -		return mask >> RCS0;
> > -	case VIDEO_ENHANCEMENT_CLASS:
> > -		return mask >> VECS0;
> > -	case VIDEO_DECODE_CLASS:
> > -		return mask >> VCS0;
> > -	case COPY_ENGINE_CLASS:
> > -		return mask >> BCS0;
> > -	default:
> > -		MISSING_CASE(class);
> > -		return 0;
> > -	}
> > -}
> > -
> >  static void guc_context_policy_init(struct intel_engine_cs *engine,
> >  				    struct guc_lrc_desc *desc)
> >  {
> > @@ -1952,8 +1935,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >  
> >  	desc = __get_lrc_desc(guc, ce->guc_lrcd_reg_idx);
> >  	desc->engine_class = engine_class_to_guc_class(engine->class);
> > -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> > -						      engine->mask);
> > +	desc->engine_submit_mask = engine->logical_mask;
> >  	desc->hw_context_desc = ce->lrc.lrca;
> >  	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
> >  	desc->priority = ce->guc_prio;
> > @@ -3978,6 +3960,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> >  		}
> >  
> >  		ve->base.mask |= sibling->mask;
> > +		ve->base.logical_mask |= sibling->logical_mask;
> >  
> >  		if (n != 0 && ve->base.class != sibling->class) {
> >  			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user
  2021-08-09 14:30   ` Daniel Vetter
@ 2021-08-09 18:37     ` Matthew Brost
  2021-08-10  6:53       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:37 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:30:06PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:11PM -0700, Matthew Brost wrote:
> > Expose logical engine instance to user via query engine info IOCTL. This
> > is required for split-frame workloads as these needs to be placed on
> > engines in a logically contiguous order. The logical mapping can change
> > based on fusing. Rather than having user have knowledge of the fusing we
> > simply just expose the logical mapping with the existing query engine
> > info IOCTL.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> Uapi must have a link to the userspace MR/patch set using this, and to the
> igt patch set validating it.
> 

Have an IGT:
https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1

Not sure when the media UMD is going to be updated upstream to use this.
Does that mean I can't merge this until the media UMD is ready? Seems
like it but isn't that a circular dependency? How can the media team
develop for a new uAPI that isn't in the kernel yet?

For what it is worth the downstream release is already using this.

Matt

> Ideally in each patch, since it's way too hard to unfortunately find the
> cover letter late on.
> 
> Jason even went as far as making this a hard requirement because he wasted
> a bit too much time trying to find the userspace for new uapi:
> 
> https://lore.kernel.org/dri-devel/20210804185704.624883-1-jason@jlekstrand.net/
> 
> Cheers, Daniel
> 
> >---
> >  drivers/gpu/drm/i915/i915_query.c | 2 ++
> >  include/uapi/drm/i915_drm.h       | 8 +++++++-
> >  2 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> > index e49da36c62fb..8a72923fbdba 100644
> > --- a/drivers/gpu/drm/i915/i915_query.c
> > +++ b/drivers/gpu/drm/i915/i915_query.c
> > @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
> >  	for_each_uabi_engine(engine, i915) {
> >  		info.engine.engine_class = engine->uabi_class;
> >  		info.engine.engine_instance = engine->uabi_instance;
> > +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
> >  		info.capabilities = engine->uabi_capabilities;
> > +		info.logical_instance = ilog2(engine->logical_mask);
> >  
> >  		if (copy_to_user(info_ptr, &info, sizeof(info)))
> >  			return -EFAULT;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 7f13d241417f..ef72e07fe08c 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -2706,14 +2706,20 @@ struct drm_i915_engine_info {
> >  
> >  	/** @flags: Engine flags. */
> >  	__u64 flags;
> > +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
> >  
> >  	/** @capabilities: Capabilities of this engine. */
> >  	__u64 capabilities;
> >  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
> >  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
> >  
> > +	/** @logical_instance: Logical instance of engine */
> > +	__u16 logical_instance;
> > +
> >  	/** @rsvd1: Reserved fields. */
> > -	__u64 rsvd1[4];
> > +	__u16 rsvd1[3];
> > +	/** @rsvd2: Reserved fields. */
> > +	__u64 rsvd2[3];
> >  };
> >  
> >  /**
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-09 14:37   ` Daniel Vetter
  2021-08-09 14:40     ` Daniel Vetter
@ 2021-08-09 18:44     ` Matthew Brost
  2021-08-10  8:45       ` Daniel Vetter
  1 sibling, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:37:55PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:12PM -0700, Matthew Brost wrote:
> > Introduce context parent-child relationship. Once this relationship is
> > created all pinning / unpinning operations are directed to the parent
> > context. The parent context is responsible for pinning all of its'
> > children and itself.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - a single H2G is used
> > register / deregister all of the contexts simultaneously.
> > 
> > Subsequent patches in the series will implement the pinning / unpinning
> > operations for parent / child contexts.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
> >  drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
> >  drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
> >  3 files changed, 59 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 745e84c72c90..8cb92b10b547 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >  	spin_lock_init(&ce->guc_state.lock);
> >  	INIT_LIST_HEAD(&ce->guc_state.fences);
> >  
> > +	INIT_LIST_HEAD(&ce->guc_child_list);
> > +
> >  	spin_lock_init(&ce->guc_active.lock);
> >  	INIT_LIST_HEAD(&ce->guc_active.requests);
> >  
> > @@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >  
> >  void intel_context_fini(struct intel_context *ce)
> >  {
> > +	struct intel_context *child, *next;
> > +
> >  	if (ce->timeline)
> >  		intel_timeline_put(ce->timeline);
> >  	i915_vm_put(ce->vm);
> >  
> > +	/* Need to put the creation ref for the children */
> > +	if (intel_context_is_parent(ce))
> > +		for_each_child_safe(ce, child, next)
> > +			intel_context_put(child);
> > +
> >  	mutex_destroy(&ce->pin_mutex);
> >  	i915_active_fini(&ce->active);
> >  }
> > @@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> >  	return active;
> >  }
> >  
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child)
> > +{
> > +	/*
> > +	 * Callers responsibility to validate that this function is used
> > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > +	 */
> > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > +	GEM_BUG_ON(intel_context_is_child(parent));
> > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > +	GEM_BUG_ON(intel_context_is_child(child));
> > +	GEM_BUG_ON(intel_context_is_parent(child));
> > +
> > +	parent->guc_number_children++;
> > +	list_add_tail(&child->guc_child_link,
> > +		      &parent->guc_child_list);
> > +	child->parent = parent;
> > +}
> > +
> >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >  #include "selftest_context.c"
> >  #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c41098950746..ad6ce5ac4824 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
> >  int intel_context_reconfigure_sseu(struct intel_context *ce,
> >  				   const struct intel_sseu sseu);
> >  
> > +static inline bool intel_context_is_child(struct intel_context *ce)
> > +{
> > +	return !!ce->parent;
> > +}
> > +
> > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > +{
> > +	return !!ce->guc_number_children;
> > +}
> > +
> > +void intel_context_bind_parent_child(struct intel_context *parent,
> > +				     struct intel_context *child);
> > +
> > +#define for_each_child(parent, ce)\
> > +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> > +#define for_each_child_safe(parent, ce, cn)\
> > +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> > +
> >  /**
> >   * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> >   * @ce - the context
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 2df79ba39867..66b22b370a72 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -202,6 +202,18 @@ struct intel_context {
> >  	/* GuC context blocked fence */
> >  	struct i915_sw_fence guc_blocked;
> >  
> > +	/* Head of children list or link in parent's children list */
> 
> Kerneldoc layout would be nice, plus explaining when exactly this is
> set or the list empty (e.g. guch_child_list is empty if and only if
> guc_number_children > 0 and parent == NULL).
> 

Sure.

> Also mentionting that these are invariant over the lifetime of the object
> would be nice.
>

Yes, this is a context creation setup step that is done exactly once and
is invariant over the lifetime of these contexts.

> Finally some words on refcounting (like who holds a reference on whom and
> how we guarantee that use-after-free doesn't go boom since you have links
> both ways). It looks like parent holds a reference on the child, so how do
> you make sure the child looking at the parent doesn't go boom?

I hadn't really thought about the child looking at the parent but I
believe it is safe. The child only looks up the parent when submissions
are in flight. We always have refs on the contexts when submissions are
in flight so we should be good - e.g. the last ref to parent is dropped
only after all submissions are done and the context is closed.

Matt

> -Daniel
> 
> > +	union {
> > +		struct list_head guc_child_list;	/* parent */
> > +		struct list_head guc_child_link;	/* child */
> > +	};
> > +
> > +	/* Pointer to parent */
> > +	struct intel_context *parent;
> > +
> > +	/* Number of children if parent */
> > +	u8 guc_number_children;
> > +
> >  	/*
> >  	 * GuC priority management
> >  	 */
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-09 14:40     ` Daniel Vetter
@ 2021-08-09 18:45       ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 04:40:11PM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 04:37:55PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:12PM -0700, Matthew Brost wrote:
> > > Introduce context parent-child relationship. Once this relationship is
> > > created all pinning / unpinning operations are directed to the parent
> > > context. The parent context is responsible for pinning all of its'
> > > children and itself.
> > > 
> > > This is a precursor to the full GuC multi-lrc implementation but aligns
> > > to how GuC mutli-lrc interface is defined - a single H2G is used
> > > register / deregister all of the contexts simultaneously.
> > > 
> > > Subsequent patches in the series will implement the pinning / unpinning
> > > operations for parent / child contexts.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
> > >  drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
> > >  3 files changed, 59 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > index 745e84c72c90..8cb92b10b547 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > @@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > >  	spin_lock_init(&ce->guc_state.lock);
> > >  	INIT_LIST_HEAD(&ce->guc_state.fences);
> > >  
> > > +	INIT_LIST_HEAD(&ce->guc_child_list);
> > > +
> > >  	spin_lock_init(&ce->guc_active.lock);
> > >  	INIT_LIST_HEAD(&ce->guc_active.requests);
> > >  
> > > @@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > >  
> > >  void intel_context_fini(struct intel_context *ce)
> > >  {
> > > +	struct intel_context *child, *next;
> > > +
> > >  	if (ce->timeline)
> > >  		intel_timeline_put(ce->timeline);
> > >  	i915_vm_put(ce->vm);
> > >  
> > > +	/* Need to put the creation ref for the children */
> > > +	if (intel_context_is_parent(ce))
> > > +		for_each_child_safe(ce, child, next)
> > > +			intel_context_put(child);
> > > +
> > >  	mutex_destroy(&ce->pin_mutex);
> > >  	i915_active_fini(&ce->active);
> > >  }
> > > @@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> > >  	return active;
> > >  }
> > >  
> > > +void intel_context_bind_parent_child(struct intel_context *parent,
> > > +				     struct intel_context *child)
> > > +{
> > > +	/*
> > > +	 * Callers responsibility to validate that this function is used
> > > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > > +	 */
> > > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > > +	GEM_BUG_ON(intel_context_is_child(parent));
> > > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > > +	GEM_BUG_ON(intel_context_is_child(child));
> > > +	GEM_BUG_ON(intel_context_is_parent(child));
> > > +
> > > +	parent->guc_number_children++;
> > > +	list_add_tail(&child->guc_child_link,
> > > +		      &parent->guc_child_list);
> > > +	child->parent = parent;
> > > +}
> > > +
> > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > >  #include "selftest_context.c"
> > >  #endif
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index c41098950746..ad6ce5ac4824 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
> > >  int intel_context_reconfigure_sseu(struct intel_context *ce,
> > >  				   const struct intel_sseu sseu);
> > >  
> > > +static inline bool intel_context_is_child(struct intel_context *ce)
> > > +{
> > > +	return !!ce->parent;
> > > +}
> > > +
> > > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > > +{
> > > +	return !!ce->guc_number_children;
> > > +}
> > > +
> > > +void intel_context_bind_parent_child(struct intel_context *parent,
> > > +				     struct intel_context *child);
> > > +
> > > +#define for_each_child(parent, ce)\
> > > +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> > > +#define for_each_child_safe(parent, ce, cn)\
> > > +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> > > +
> > >  /**
> > >   * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> > >   * @ce - the context
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 2df79ba39867..66b22b370a72 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -202,6 +202,18 @@ struct intel_context {
> > >  	/* GuC context blocked fence */
> > >  	struct i915_sw_fence guc_blocked;
> > >  
> > > +	/* Head of children list or link in parent's children list */
> > 
> > Kerneldoc layout would be nice, plus explaining when exactly this is
> > set or the list empty (e.g. guch_child_list is empty if and only if
> > guc_number_children > 0 and parent == NULL).
> > 
> > Also mentionting that these are invariant over the lifetime of the object
> > would be nice.
> > 
> > Finally some words on refcounting (like who holds a reference on whom and
> > how we guarantee that use-after-free doesn't go boom since you have links
> > both ways). It looks like parent holds a reference on the child, so how do
> > you make sure the child looking at the parent doesn't go boom?
> > -Daniel
> > 
> > > +	union {
> > > +		struct list_head guc_child_list;	/* parent */
> > > +		struct list_head guc_child_link;	/* child */
> > > +	};
> > > +
> > > +	/* Pointer to parent */
> > > +	struct intel_context *parent;
> > > +
> > > +	/* Number of children if parent */
> > > +	u8 guc_number_children;
> 
> Another one: Can we really not afford a int here? The nasty thing about
> unsigned is that wrap-around is well-defined, which is why gcc won't ever
> complain about it. Which hides bugs. Same for next patch, which also
> micro-optimizes a few fields to be tiny.
> 
> We generally don't have thousands of contexts hanging around, unless
> there's a reason (which should be documented) this feels like it's
> squarely on the wrong side of "don't prematurely optimize".

Ok, int it is.

Matt

> -Daniel
> 
> > > +
> > >  	/*
> > >  	 * GuC priority management
> > >  	 */
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-09 15:17   ` Daniel Vetter
@ 2021-08-09 18:58     ` Matthew Brost
  2021-08-10  8:53       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 18:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > Implement GuC parent-child context pin / unpin functions in which in any
> > contexts in the relationship are pinned all the contexts are pinned. The
> > parent owns most of the pinning / unpinning process and the children
> > direct any pins / unpins to the parent.
> > 
> > Patch implements a number of unused functions that will be connected
> > later in the series.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> >  9 files changed, 371 insertions(+), 112 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 8cb92b10b547..bb4c14656067 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> >  	intel_ring_unpin(ring);
> >  }
> >  
> > -static int intel_context_pre_pin(struct intel_context *ce,
> > -				 struct i915_gem_ww_ctx *ww)
> > +static int __intel_context_pre_pin(struct intel_context *ce,
> > +				   struct i915_gem_ww_ctx *ww)
> >  {
> >  	int err;
> >  
> > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> >  	return err;
> >  }
> >  
> > -static void intel_context_post_unpin(struct intel_context *ce)
> > +static void __intel_context_post_unpin(struct intel_context *ce)
> >  {
> >  	if (ce->state)
> >  		__context_unpin_state(ce->state);
> > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> >  	__ring_retire(ce->ring);
> >  }
> >  
> > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > -			      struct i915_gem_ww_ctx *ww)
> > +static int intel_context_pre_pin(struct intel_context *ce,
> > +				 struct i915_gem_ww_ctx *ww)
> >  {
> > -	bool handoff = false;
> > -	void *vaddr;
> > +	struct intel_context *child;
> > +	int err, i = 0;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	for_each_child(ce, child) {
> > +		err = __intel_context_pre_pin(child, ww);
> > +		if (unlikely(err))
> > +			goto unwind;
> > +		++i;
> > +	}
> > +
> > +	err = __intel_context_pre_pin(ce, ww);
> > +	if (unlikely(err))
> > +		goto unwind;
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	for_each_child(ce, child) {
> > +		if (!i--)
> > +			break;
> > +		__intel_context_post_unpin(ce);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +static void intel_context_post_unpin(struct intel_context *ce)
> > +{
> > +	struct intel_context *child;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	for_each_child(ce, child)
> > +		__intel_context_post_unpin(child);
> > +
> > +	__intel_context_post_unpin(ce);
> > +}
> > +
> > +static int __do_ww_lock(struct intel_context *ce,
> > +			struct i915_gem_ww_ctx *ww)
> > +{
> > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > +
> > +	if (!err && ce->ring->vma->obj)
> > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > +	if (!err && ce->state)
> > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > +
> > +	return err;
> > +}
> > +
> > +static int do_ww_lock(struct intel_context *ce,
> > +		      struct i915_gem_ww_ctx *ww)
> > +{
> > +	struct intel_context *child;
> >  	int err = 0;
> >  
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	for_each_child(ce, child) {
> > +		err = __do_ww_lock(child, ww);
> > +		if (unlikely(err))
> > +			return err;
> > +	}
> > +
> > +	return __do_ww_lock(ce, ww);
> > +}
> > +
> > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > +				     struct i915_gem_ww_ctx *ww)
> > +{
> > +	bool handoff = false;
> > +	int err;
> > +
> >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> >  		err = intel_context_alloc_state(ce);
> >  		if (err)
> > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >  	 * refcount for __intel_context_active(), which prevent a lock
> >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> >  	 */
> > +	err = do_ww_lock(ce, ww);
> > +	if (err)
> > +		return err;
> >  
> > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > -	if (!err && ce->ring->vma->obj)
> > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > -	if (!err && ce->state)
> > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > -	if (!err)
> > -		err = intel_context_pre_pin(ce, ww);
> > +	err = intel_context_pre_pin(ce, ww);
> >  	if (err)
> >  		return err;
> >  
> > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >  	if (err)
> >  		goto err_ctx_unpin;
> >  
> > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > +	err = ce->ops->pre_pin(ce, ww);
> >  	if (err)
> >  		goto err_release;
> >  
> > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >  		if (unlikely(err))
> >  			goto err_unlock;
> >  
> > -		err = ce->ops->pin(ce, vaddr);
> > +		err = ce->ops->pin(ce);
> >  		if (err) {
> >  			intel_context_active_release(ce);
> >  			goto err_unlock;
> > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> >  	return err;
> >  }
> >  
> > -int __intel_context_do_pin(struct intel_context *ce)
> > +static int __intel_context_do_pin(struct intel_context *ce)
> >  {
> >  	struct i915_gem_ww_ctx ww;
> >  	int err;
> > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> >  		 intel_context_get_avg_runtime_ns(ce));
> >  
> >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > -	intel_context_post_unpin(ce);
> > +	__intel_context_post_unpin(ce);
> >  	intel_context_put(ce);
> >  }
> >  
> > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> >  	child->parent = parent;
> >  }
> >  
> > +static inline int ____intel_context_pin(struct intel_context *ce)
> > +{
> > +	if (likely(intel_context_pin_if_active(ce)))
> > +		return 0;
> > +
> > +	return __intel_context_do_pin(ce);
> > +}
> > +
> > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > +					 struct i915_gem_ww_ctx *ww)
> > +{
> > +	if (likely(intel_context_pin_if_active(ce)))
> > +		return 0;
> > +
> > +	return __intel_context_do_pin_ww(ce, ww);
> > +}
> > +
> > +static inline void __intel_context_unpin(struct intel_context *ce)
> > +{
> > +	if (!ce->ops->sched_disable) {
> > +		__intel_context_do_unpin(ce, 1);
> > +	} else {
> > +		/*
> > +		 * Move ownership of this pin to the scheduling disable which is
> > +		 * an async operation. When that operation completes the above
> > +		 * intel_context_sched_disable_unpin is called potentially
> > +		 * unpinning the context.
> > +		 */
> > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> 
> Uh man lockless algorithms.
> 
> Unless this comes:
> - with essentially an academic looking paper that describes the abstract
>   model of the lockless algorithm and proves it against the linux kernel
>   meory model.
> 
> - lockless stuff generally needs barriers, and those barriers must be all
>   documented. This means a) a comment next to each barrier in the code b)
>   pointing to its counterparty c) with the overall design also explained
>   in the kerneldoc for those datastructres.
> 
>   If you don't know where your barriers are, see above point about "it
>   should look more like an academic paper in the commit message"
> 
> - hard perf data about how this is absolutely required, based on a
>   real-world use-case (which then sometimes justifies a microbenchmark
>   metric for the details, but it always needs to be real-world based). And
>   also a throughrough explainer how the perf issue isn't fixable through
>   better design. If that's not doable, just protect the state machine with
>   a big dumb lock and move on.
> 
> - Also, because the current code is in such bad shape wrt lockless
>   algorithms and premature optimizations: Overall complexity should go
>   down (it's way too high right now), so pay down your new lockless trick
>   by removing one of the existing ones that we only have because we can.
> 
> Yes this is steep, but we're way out in the woods here and need to smoehow
> get back.

See below FIXME. At one point all of this was hidden in the backend but
the dma-resv patches that landed upstream completely broke the layering,
hence the need for the code here.

I guess I don't really understand what mean when you say lockless alg
needs barriers, if the atomic functions are not really atomic wouldn't
the world be broken?

Also here I don't think it is really as simple as grab big dump lock for
a variety of reasons, at least with the current dynamic pin / unpin code
in place. If we move a perma-pinned contexts this could be cleaned up
then.

Matt

> -Daniel
> 
> > +				ce->ops->sched_disable(ce);
> > +				break;
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +/*
> > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > + * GuC submission. Basically the idea is if any of the contexts, that are
> > + * configured for parallel submission, are pinned all the contexts need to be
> > + * pinned in order to register these contexts with the GuC. We are adding the
> > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > + * needs a bit of rework how to properly layer / structure this code path. What
> > + * is in place works but is not ideal.
> > + */
> > +int intel_context_pin(struct intel_context *ce)
> > +{
> > +	if (intel_context_is_child(ce)) {
> > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > +			return ____intel_context_pin(ce->parent);
> > +		else
> > +			return 0;
> > +	} else {
> > +		return ____intel_context_pin(ce);
> > +	}
> > +}
> > +
> > +int intel_context_pin_ww(struct intel_context *ce,
> > +			 struct i915_gem_ww_ctx *ww)
> > +{
> > +	if (intel_context_is_child(ce)) {
> > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > +			return __intel_context_pin_ww(ce->parent, ww);
> > +		else
> > +			return 0;
> > +	} else {
> > +		return __intel_context_pin_ww(ce, ww);
> > +	}
> > +}
> > +
> > +void intel_context_unpin(struct intel_context *ce)
> > +{
> > +	if (intel_context_is_child(ce)) {
> > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > +			__intel_context_unpin(ce->parent);
> > +	} else {
> > +		__intel_context_unpin(ce);
> > +	}
> > +}
> > +
> >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> >  #include "selftest_context.c"
> >  #endif
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index ad6ce5ac4824..c208691fc87d 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> >  	mutex_unlock(&ce->pin_mutex);
> >  }
> >  
> > -int __intel_context_do_pin(struct intel_context *ce);
> > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > -			      struct i915_gem_ww_ctx *ww);
> > -
> >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> >  {
> >  	return atomic_inc_not_zero(&ce->pin_count);
> >  }
> >  
> > -static inline int intel_context_pin(struct intel_context *ce)
> > -{
> > -	if (likely(intel_context_pin_if_active(ce)))
> > -		return 0;
> > -
> > -	return __intel_context_do_pin(ce);
> > -}
> > -
> > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > -				       struct i915_gem_ww_ctx *ww)
> > -{
> > -	if (likely(intel_context_pin_if_active(ce)))
> > -		return 0;
> > +int intel_context_pin(struct intel_context *ce);
> >  
> > -	return __intel_context_do_pin_ww(ce, ww);
> > -}
> > +int intel_context_pin_ww(struct intel_context *ce,
> > +			 struct i915_gem_ww_ctx *ww);
> >  
> >  static inline void __intel_context_pin(struct intel_context *ce)
> >  {
> > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> >  
> >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> >  {
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  	__intel_context_do_unpin(ce, 2);
> >  }
> >  
> > -static inline void intel_context_unpin(struct intel_context *ce)
> > -{
> > -	if (!ce->ops->sched_disable) {
> > -		__intel_context_do_unpin(ce, 1);
> > -	} else {
> > -		/*
> > -		 * Move ownership of this pin to the scheduling disable which is
> > -		 * an async operation. When that operation completes the above
> > -		 * intel_context_sched_disable_unpin is called potentially
> > -		 * unpinning the context.
> > -		 */
> > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > -				ce->ops->sched_disable(ce);
> > -				break;
> > -			}
> > -		}
> > -	}
> > -}
> > +void intel_context_unpin(struct intel_context *ce);
> >  
> >  void intel_context_enter_engine(struct intel_context *ce);
> >  void intel_context_exit_engine(struct intel_context *ce);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 66b22b370a72..eb82be15b7a2 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -39,8 +39,8 @@ struct intel_context_ops {
> >  
> >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> >  
> > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > +	int (*pin)(struct intel_context *ce);
> >  	void (*unpin)(struct intel_context *ce);
> >  	void (*post_unpin)(struct intel_context *ce);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index baa1797af1c8..fc74ca28f245 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> >  static int
> >  __execlists_context_pre_pin(struct intel_context *ce,
> >  			    struct intel_engine_cs *engine,
> > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > +			    struct i915_gem_ww_ctx *ww)
> >  {
> >  	int err;
> >  
> > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > +	err = lrc_pre_pin(ce, engine, ww);
> >  	if (err)
> >  		return err;
> >  
> >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > -		lrc_init_state(ce, engine, *vaddr);
> > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> >  
> >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> >  	}
> > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> >  }
> >  
> >  static int execlists_context_pre_pin(struct intel_context *ce,
> > -				     struct i915_gem_ww_ctx *ww,
> > -				     void **vaddr)
> > +				     struct i915_gem_ww_ctx *ww)
> >  {
> > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> >  }
> >  
> > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > +static int execlists_context_pin(struct intel_context *ce)
> >  {
> > -	return lrc_pin(ce, ce->engine, vaddr);
> > +	return lrc_pin(ce, ce->engine);
> >  }
> >  
> >  static int execlists_context_alloc(struct intel_context *ce)
> > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> >  }
> >  
> >  static int virtual_context_pre_pin(struct intel_context *ce,
> > -				   struct i915_gem_ww_ctx *ww,
> > -				   void **vaddr)
> > +				   struct i915_gem_ww_ctx *ww)
> >  {
> >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> >  
> >  	 /* Note: we must use a real engine class for setting up reg state */
> > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> >  }
> >  
> > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > +static int virtual_context_pin(struct intel_context *ce)
> >  {
> >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> >  
> > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > +	return lrc_pin(ce, ve->siblings[0]);
> >  }
> >  
> >  static void virtual_context_enter(struct intel_context *ce)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index bb4af4977920..c466fc966005 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> >  int
> >  lrc_pre_pin(struct intel_context *ce,
> >  	    struct intel_engine_cs *engine,
> > -	    struct i915_gem_ww_ctx *ww,
> > -	    void **vaddr)
> > +	    struct i915_gem_ww_ctx *ww)
> >  {
> > +	void *vaddr;
> >  	GEM_BUG_ON(!ce->state);
> >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> >  
> > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > -					 i915_coherent_map_type(ce->engine->i915,
> > -								ce->state->obj,
> > -								false) |
> > -					 I915_MAP_OVERRIDE);
> > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > +					i915_coherent_map_type(ce->engine->i915,
> > +							       ce->state->obj,
> > +							       false) |
> > +					I915_MAP_OVERRIDE);
> >  
> > -	return PTR_ERR_OR_ZERO(*vaddr);
> > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > +
> > +	return PTR_ERR_OR_ZERO(vaddr);
> >  }
> >  
> >  int
> >  lrc_pin(struct intel_context *ce,
> > -	struct intel_engine_cs *engine,
> > -	void *vaddr)
> > +	struct intel_engine_cs *engine)
> >  {
> > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > -
> >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > -		lrc_init_state(ce, engine, vaddr);
> > +		lrc_init_state(ce, engine,
> > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> >  
> >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> >  	return 0;
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > index 7f697845c4cf..837fcf00270d 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> >  int
> >  lrc_pre_pin(struct intel_context *ce,
> >  	    struct intel_engine_cs *engine,
> > -	    struct i915_gem_ww_ctx *ww,
> > -	    void **vaddr);
> > +	    struct i915_gem_ww_ctx *ww);
> >  int
> >  lrc_pin(struct intel_context *ce,
> > -	struct intel_engine_cs *engine,
> > -	void *vaddr);
> > +	struct intel_engine_cs *engine);
> >  void lrc_unpin(struct intel_context *ce);
> >  void lrc_post_unpin(struct intel_context *ce);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > index 2958e2fae380..f4f301bfb9f7 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> >  }
> >  
> >  static int ring_context_pre_pin(struct intel_context *ce,
> > -				struct i915_gem_ww_ctx *ww,
> > -				void **unused)
> > +				struct i915_gem_ww_ctx *ww)
> >  {
> >  	struct i915_address_space *vm;
> >  	int err = 0;
> > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> >  	return 0;
> >  }
> >  
> > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > +static int ring_context_pin(struct intel_context *ce)
> >  {
> >  	return 0;
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > index 2c1af030310c..826b5d7a4573 100644
> > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> >  }
> >  
> >  static int mock_context_pre_pin(struct intel_context *ce,
> > -				struct i915_gem_ww_ctx *ww, void **unused)
> > +				struct i915_gem_ww_ctx *ww)
> >  {
> >  	return 0;
> >  }
> >  
> > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > +static int mock_context_pin(struct intel_context *ce)
> >  {
> >  	return 0;
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index dec757d319a2..c5c73c42bcf7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >  
> >  	GEM_BUG_ON(!engine->mask);
> >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  
> >  	/*
> >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> >  
> >  static int __guc_context_pre_pin(struct intel_context *ce,
> >  				 struct intel_engine_cs *engine,
> > -				 struct i915_gem_ww_ctx *ww,
> > -				 void **vaddr)
> > +				 struct i915_gem_ww_ctx *ww)
> >  {
> > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > +	return lrc_pre_pin(ce, engine, ww);
> >  }
> >  
> >  static int __guc_context_pin(struct intel_context *ce,
> > -			     struct intel_engine_cs *engine,
> > -			     void *vaddr)
> > +			     struct intel_engine_cs *engine)
> >  {
> >  	if (i915_ggtt_offset(ce->state) !=
> >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> >  	 * explaination of why.
> >  	 */
> >  
> > -	return lrc_pin(ce, engine, vaddr);
> > +	return lrc_pin(ce, engine);
> > +}
> > +
> > +static void __guc_context_unpin(struct intel_context *ce)
> > +{
> > +	lrc_unpin(ce);
> > +}
> > +
> > +static void __guc_context_post_unpin(struct intel_context *ce)
> > +{
> > +	lrc_post_unpin(ce);
> >  }
> >  
> >  static int guc_context_pre_pin(struct intel_context *ce,
> > -			       struct i915_gem_ww_ctx *ww,
> > -			       void **vaddr)
> > +			       struct i915_gem_ww_ctx *ww)
> >  {
> > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> >  }
> >  
> > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > +static int guc_context_pin(struct intel_context *ce)
> >  {
> > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > +	int ret;
> >  
> > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > +		   intel_context_is_child(ce));
> > +
> > +	ret = __guc_context_pin(ce, ce->engine);
> >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> >  		intel_engine_pm_get(ce->engine);
> >  
> > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> >  	GEM_BUG_ON(context_enabled(ce));
> >  
> >  	unpin_guc_id(guc, ce, true);
> > -	lrc_unpin(ce);
> > +	__guc_context_unpin(ce);
> >  
> >  	if (likely(!intel_context_is_barrier(ce)))
> >  		intel_engine_pm_put(ce->engine);
> > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> >  
> >  static void guc_context_post_unpin(struct intel_context *ce)
> >  {
> > -	lrc_post_unpin(ce);
> > +	__guc_context_post_unpin(ce);
> > +}
> > +
> > +/* Future patches will use this function */
> > +__maybe_unused
> > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > +				      struct i915_gem_ww_ctx *ww)
> > +{
> > +	struct intel_context *child;
> > +	int err, i = 0, j = 0;
> > +
> > +	for_each_child(ce, child) {
> > +		err = i915_active_acquire(&child->active);
> > +		if (unlikely(err))
> > +			goto unwind_active;
> > +		++i;
> > +	}
> > +
> > +	for_each_child(ce, child) {
> > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > +		if (unlikely(err))
> > +			goto unwind_pre_pin;
> > +		++j;
> > +	}
> > +
> > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > +	if (unlikely(err))
> > +		goto unwind_pre_pin;
> > +
> > +	return 0;
> > +
> > +unwind_pre_pin:
> > +	for_each_child(ce, child) {
> > +		if (!j--)
> > +			break;
> > +		__guc_context_post_unpin(child);
> > +	}
> > +
> > +unwind_active:
> > +	for_each_child(ce, child) {
> > +		if (!i--)
> > +			break;
> > +		i915_active_release(&child->active);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +/* Future patches will use this function */
> > +__maybe_unused
> > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > +{
> > +	struct intel_context *child;
> > +
> > +	for_each_child(ce, child)
> > +		__guc_context_post_unpin(child);
> > +	__guc_context_post_unpin(ce);
> > +
> > +	for_each_child(ce, child) {
> > +		intel_context_get(child);
> > +		i915_active_release(&child->active);
> > +		intel_context_put(child);
> > +	}
> > +}
> > +
> > +/* Future patches will use this function */
> > +__maybe_unused
> > +static int guc_parent_context_pin(struct intel_context *ce)
> > +{
> > +	int ret, i = 0, j = 0;
> > +	struct intel_context *child;
> > +	struct intel_engine_cs *engine;
> > +	intel_engine_mask_t tmp;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +
> > +	for_each_child(ce, child) {
> > +		ret = __guc_context_pin(child, child->engine);
> > +		if (unlikely(ret))
> > +			goto unwind_pin;
> > +		++i;
> > +	}
> > +	ret = __guc_context_pin(ce, ce->engine);
> > +	if (unlikely(ret))
> > +		goto unwind_pin;
> > +
> > +	for_each_child(ce, child)
> > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > +			break;
> > +		}
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt,
> > +			       ce->engine->mask, tmp)
> > +		intel_engine_pm_get(engine);
> > +	for_each_child(ce, child)
> > +		for_each_engine_masked(engine, child->engine->gt,
> > +				       child->engine->mask, tmp)
> > +			intel_engine_pm_get(engine);
> > +
> > +	return 0;
> > +
> > +unwind_pin:
> > +	for_each_child(ce, child) {
> > +		if (++j > i)
> > +			break;
> > +		__guc_context_unpin(child);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/* Future patches will use this function */
> > +__maybe_unused
> > +static void guc_parent_context_unpin(struct intel_context *ce)
> > +{
> > +	struct intel_context *child;
> > +	struct intel_engine_cs *engine;
> > +	intel_engine_mask_t tmp;
> > +
> > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > +	GEM_BUG_ON(context_enabled(ce));
> > +
> > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > +	for_each_child(ce, child)
> > +		__guc_context_unpin(child);
> > +	__guc_context_unpin(ce);
> > +
> > +	for_each_engine_masked(engine, ce->engine->gt,
> > +			       ce->engine->mask, tmp)
> > +		intel_engine_pm_put(engine);
> > +	for_each_child(ce, child)
> > +		for_each_engine_masked(engine, child->engine->gt,
> > +				       child->engine->mask, tmp)
> > +			intel_engine_pm_put(engine);
> >  }
> >  
> >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> >  }
> >  
> >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > -				       struct i915_gem_ww_ctx *ww,
> > -				       void **vaddr)
> > +				       struct i915_gem_ww_ctx *ww)
> >  {
> >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> >  
> > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > +	return __guc_context_pre_pin(ce, engine, ww);
> >  }
> >  
> > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > +static int guc_virtual_context_pin(struct intel_context *ce)
> >  {
> >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > +	int ret = __guc_context_pin(ce, engine);
> >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> >  
> >  	if (likely(!ret))
> > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> >  
> >  	unpin_guc_id(guc, ce, true);
> > -	lrc_unpin(ce);
> > +	__guc_context_unpin(ce);
> >  
> >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> >  		intel_engine_pm_put(engine);
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-09 15:31   ` Daniel Vetter
@ 2021-08-09 19:03     ` Matthew Brost
  2021-08-10  9:12       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 19:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 05:31:38PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:16PM -0700, Matthew Brost wrote:
> > Assign contexts in parent-child relationship consecutive guc_ids. This
> > is accomplished by partitioning guc_id space between ones that need to
> > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > available guc_ids). The consecutive search is implemented via the bitmap
> > API.
> > 
> > This is a precursor to the full GuC multi-lrc implementation but aligns
> > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > when using the GuC multi-lrc interface.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_context.h       |   6 +
> >  drivers/gpu/drm/i915/gt/intel_reset.c         |   3 +-
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +-
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 222 ++++++++++++------
> >  .../i915/gt/uc/intel_guc_submission_types.h   |  10 +
> >  5 files changed, 179 insertions(+), 69 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index c208691fc87d..7ce3b3d2edb7 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -54,6 +54,12 @@ static inline bool intel_context_is_parent(struct intel_context *ce)
> >  	return !!ce->guc_number_children;
> >  }
> >  
> > +static inline struct intel_context *
> > +intel_context_to_parent(struct intel_context *ce)
> > +{
> > +	return intel_context_is_child(ce) ? ce->parent : ce;
> > +}
> > +
> >  void intel_context_bind_parent_child(struct intel_context *parent,
> >  				     struct intel_context *child);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index ea763138197f..c3d4baa1b2b8 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -849,6 +849,7 @@ static void reset_finish(struct intel_gt *gt, intel_engine_mask_t awake)
> >  
> >  static void nop_submit_request(struct i915_request *request)
> >  {
> > +	struct intel_context *ce = intel_context_to_parent(request->context);
> >  	RQ_TRACE(request, "-EIO\n");
> >  
> >  	/*
> > @@ -857,7 +858,7 @@ static void nop_submit_request(struct i915_request *request)
> >  	 * this for now.
> >  	 */
> >  	if (intel_engine_uses_guc(request->engine))
> > -		intel_guc_decr_num_rq_not_ready(request->context);
> > +		intel_guc_decr_num_rq_not_ready(ce);
> >  
> >  	request = i915_request_mark_eio(request);
> >  	if (request) {
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index c0c60ccabfa4..30a0f364db8f 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -24,6 +24,7 @@ struct __guc_ads_blob;
> >  
> >  enum {
> >  	GUC_SUBMIT_ENGINE_SINGLE_LRC,
> > +	GUC_SUBMIT_ENGINE_MULTI_LRC,
> >  	GUC_SUBMIT_ENGINE_MAX
> >  };
> >  
> > @@ -59,8 +60,10 @@ struct intel_guc {
> >  	struct ida guc_ids;
> >  	u32 num_guc_ids;
> >  	u32 max_guc_ids;
> > -	struct list_head guc_id_list_no_ref;
> > -	struct list_head guc_id_list_unpinned;
> > +	unsigned long *guc_ids_bitmap;
> > +#define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> > +	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > +	struct list_head guc_id_list_unpinned[MAX_GUC_ID_ORDER + 1];
> 
> Random new global lists definitely need kerneldoc about what is on them,
> how they're linked, what their lifetime rules are and what locks we're
> holding.
> 
> Leaving this all to reviews to figure out, and worse, future readers of
> your code, is not kind.
>

Got it.
 
> >  	spinlock_t destroy_lock;	/* protects list / worker */
> >  	struct list_head destroyed_contexts;
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index f23dd716723f..afb9b4bb8971 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -169,6 +169,15 @@ static void clr_guc_ids_exhausted(struct guc_submit_engine *gse)
> >  	clear_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
> >  }
> >  
> > +/*
> > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> 
> I think it'd be good to put down the reason here for why. Is this a
> requirement of the guc interface, or just an artifact of our current
> implementation? In the latter case also explain what exactly the
> contstraint is (but honestly I can't think of much reasons for that)

Multi-lrc guc_ids need to be sequential between the parent and children
- this is a requirement of the GuC submission interface. Can explicitly
state that here.

Matt

> -Daniel
> 
> > + * and a different allocation algorithm is used (bitmap vs. ida). We believe the
> > + * number of multi-lrc contexts in use should be low and 1/16 should be
> > + * sufficient. Minimum of 32 ids for multi-lrc.
> > + */
> > +#define NUMBER_MULTI_LRC_GUC_ID(guc) \
> > +	((guc)->num_guc_ids / 16 > 32 ? (guc)->num_guc_ids / 16 : 32)
> > +
> >  /*
> >   * Below is a set of functions which control the GuC scheduling state which do
> >   * not require a lock as all state transitions are mutually exclusive. i.e. It
> > @@ -405,16 +414,10 @@ static inline void decr_context_blocked(struct intel_context *ce)
> >  	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
> >  }
> >  
> > -static inline struct intel_context *
> > -to_parent(struct intel_context *ce)
> > -{
> > -	return intel_context_is_child(ce) ? ce->parent : ce;
> > -}
> > -
> >  static inline struct intel_context *
> >  request_to_scheduling_context(struct i915_request *rq)
> >  {
> > -	return to_parent(rq->context);
> > +	return intel_context_to_parent(rq->context);
> >  }
> >  
> >  static inline bool context_guc_id_invalid(struct intel_context *ce)
> > @@ -1436,7 +1439,7 @@ static void destroy_worker_func(struct work_struct *w);
> >   */
> >  int intel_guc_submission_init(struct intel_guc *guc)
> >  {
> > -	int ret;
> > +	int ret, i;
> >  
> >  	if (guc_submission_initialized(guc))
> >  		return 0;
> > @@ -1448,9 +1451,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >  	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> >  
> >  	spin_lock_init(&guc->contexts_lock);
> > -	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
> > -	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
> > +	for (i = 0; i < MAX_GUC_ID_ORDER + 1; ++i) {
> > +		INIT_LIST_HEAD(&guc->guc_id_list_no_ref[i]);
> > +		INIT_LIST_HEAD(&guc->guc_id_list_unpinned[i]);
> > +	}
> >  	ida_init(&guc->guc_ids);
> > +	guc->guc_ids_bitmap =
> > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> >  
> >  	spin_lock_init(&guc->destroy_lock);
> >  
> > @@ -1476,6 +1483,8 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> >  
> >  		i915_sched_engine_put(sched_engine);
> >  	}
> > +
> > +	bitmap_free(guc->guc_ids_bitmap);
> >  }
> >  
> >  static inline void queue_request(struct i915_sched_engine *sched_engine,
> > @@ -1499,11 +1508,13 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
> >  static bool too_many_guc_ids_not_ready(struct guc_submit_engine *gse,
> >  				       struct intel_context *ce)
> >  {
> > -	u32 available_guc_ids, guc_ids_consumed;
> >  	struct intel_guc *guc = gse->sched_engine.private_data;
> > +	u32 available_guc_ids = intel_context_is_parent(ce) ?
> > +		NUMBER_MULTI_LRC_GUC_ID(guc) :
> > +		guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> > +	u32 guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
> >  
> > -	available_guc_ids = guc->num_guc_ids;
> > -	guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  
> >  	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
> >  		set_and_update_guc_ids_exhausted(gse);
> > @@ -1517,17 +1528,26 @@ static void incr_num_rq_not_ready(struct intel_context *ce)
> >  {
> >  	struct guc_submit_engine *gse = ce_to_gse(ce);
> >  
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(!intel_context_is_parent(ce) &&
> > +		   ce->guc_number_children);
> > +
> >  	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
> > -		atomic_inc(&gse->num_guc_ids_not_ready);
> > +		atomic_add(ce->guc_number_children + 1,
> > +			   &gse->num_guc_ids_not_ready);
> >  }
> >  
> >  void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
> >  {
> >  	struct guc_submit_engine *gse = ce_to_gse(ce);
> >  
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >  	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1) {
> >  		GEM_BUG_ON(!atomic_read(&gse->num_guc_ids_not_ready));
> > -		atomic_dec(&gse->num_guc_ids_not_ready);
> > +
> > +		atomic_sub(ce->guc_number_children + 1,
> > +			   &gse->num_guc_ids_not_ready);
> >  	}
> >  }
> >  
> > @@ -1579,20 +1599,42 @@ static void guc_submit_request(struct i915_request *rq)
> >  
> >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> >  
> > -	intel_guc_decr_num_rq_not_ready(rq->context);
> > +	intel_guc_decr_num_rq_not_ready(request_to_scheduling_context(rq));
> >  }
> >  
> > -static int new_guc_id(struct intel_guc *guc)
> > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  {
> > -	return ida_simple_get(&guc->guc_ids, 0,
> > -			      guc->num_guc_ids, GFP_KERNEL |
> > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +	int ret;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (intel_context_is_parent(ce))
> > +		ret = bitmap_find_free_region(guc->guc_ids_bitmap,
> > +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> > +					      order_base_2(ce->guc_number_children
> > +							   + 1));
> > +	else
> > +		ret = ida_simple_get(&guc->guc_ids,
> > +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> > +				     guc->num_guc_ids, GFP_KERNEL |
> > +				     __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +	if (unlikely(ret < 0))
> > +		return ret;
> > +
> > +	ce->guc_id = ret;
> > +	return 0;
> >  }
> >  
> >  static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  {
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  	if (!context_guc_id_invalid(ce)) {
> > -		ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > +		if (intel_context_is_parent(ce))
> > +			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> > +					      order_base_2(ce->guc_number_children
> > +							   + 1));
> > +		else
> > +			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> >  		clr_lrc_desc_registered(guc, ce->guc_id);
> >  		set_context_guc_id_invalid(ce);
> >  	}
> > @@ -1604,6 +1646,8 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  {
> >  	unsigned long flags;
> >  
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> >  	spin_lock_irqsave(&guc->contexts_lock, flags);
> >  	__release_guc_id(guc, ce);
> >  	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > @@ -1618,54 +1662,93 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >   * schedule disable H2G + a deregister H2G.
> >   */
> >  static struct list_head *get_guc_id_list(struct intel_guc *guc,
> > +					 u8 number_children,
> >  					 bool unpinned)
> >  {
> > +	GEM_BUG_ON(order_base_2(number_children + 1) > MAX_GUC_ID_ORDER);
> > +
> >  	if (unpinned)
> > -		return &guc->guc_id_list_unpinned;
> > +		return &guc->guc_id_list_unpinned[order_base_2(number_children + 1)];
> >  	else
> > -		return &guc->guc_id_list_no_ref;
> > +		return &guc->guc_id_list_no_ref[order_base_2(number_children + 1)];
> >  }
> >  
> > -static int steal_guc_id(struct intel_guc *guc, bool unpinned)
> > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > +			bool unpinned)
> >  {
> > -	struct intel_context *ce;
> > -	int guc_id;
> > -	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
> > +	struct intel_context *cn;
> > +	u8 number_children = ce->guc_number_children;
> >  
> >  	lockdep_assert_held(&guc->contexts_lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  
> > -	if (!list_empty(guc_id_list)) {
> > -		ce = list_first_entry(guc_id_list,
> > -				      struct intel_context,
> > -				      guc_id_link);
> > +	do {
> > +		struct list_head *guc_id_list =
> > +			get_guc_id_list(guc, number_children, unpinned);
> >  
> > -		/* Ensure context getting stolen in expected state */
> > -		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > -		GEM_BUG_ON(context_guc_id_stolen(ce));
> > +		if (!list_empty(guc_id_list)) {
> > +			u8 cn_o2, ce_o2 =
> > +				order_base_2(ce->guc_number_children + 1);
> >  
> > -		list_del_init(&ce->guc_id_link);
> > -		guc_id = ce->guc_id;
> > -		clr_context_registered(ce);
> > +			cn = list_first_entry(guc_id_list,
> > +					      struct intel_context,
> > +					      guc_id_link);
> > +			cn_o2 = order_base_2(cn->guc_number_children + 1);
> > +
> > +			/*
> > +			 * Corner case where a multi-lrc context steals a guc_id
> > +			 * from another context that has more guc_id that itself.
> > +			 */
> > +			if (cn_o2 != ce_o2) {
> > +				bitmap_release_region(guc->guc_ids_bitmap,
> > +						      cn->guc_id,
> > +						      cn_o2);
> > +				bitmap_allocate_region(guc->guc_ids_bitmap,
> > +						       ce->guc_id,
> > +						       ce_o2);
> > +			}
> > +
> > +			/* Ensure context getting stolen in expected state */
> > +			GEM_BUG_ON(atomic_read(&cn->guc_id_ref));
> > +			GEM_BUG_ON(context_guc_id_invalid(cn));
> > +			GEM_BUG_ON(context_guc_id_stolen(cn));
> > +			GEM_BUG_ON(ce_to_gse(ce) != ce_to_gse(cn));
> > +
> > +			list_del_init(&cn->guc_id_link);
> > +			ce->guc_id = cn->guc_id;
> > +
> > +			/*
> > +			 * If stealing from the pinned list, defer invalidating
> > +			 * the guc_id until the retire workqueue processes this
> > +			 * context.
> > +			 */
> > +			clr_context_registered(cn);
> > +			if (!unpinned) {
> > +				GEM_BUG_ON(ce_to_gse(cn)->stalled_context);
> > +				ce_to_gse(cn)->stalled_context =
> > +					intel_context_get(cn);
> > +				set_context_guc_id_stolen(cn);
> > +			} else {
> > +				set_context_guc_id_invalid(cn);
> > +			}
> > +
> > +			return 0;
> > +		}
> >  
> >  		/*
> > -		 * If stealing from the pinned list, defer invalidating
> > -		 * the guc_id until the retire workqueue processes this
> > -		 * context.
> > +		 * When using multi-lrc we search the guc_id_lists with the
> > +		 * least amount of guc_ids required first but will consume a
> > +		 * block larger of guc_ids if necessary. 2x the children always
> > +		 * moves you two the next list.
> >  		 */
> > -		if (!unpinned) {
> > -			GEM_BUG_ON(ce_to_gse(ce)->stalled_context);
> > +		if (!number_children ||
> > +		    order_base_2(number_children + 1) == MAX_GUC_ID_ORDER)
> > +			break;
> >  
> > -			ce_to_gse(ce)->stalled_context = intel_context_get(ce);
> > -			set_context_guc_id_stolen(ce);
> > -		} else {
> > -			set_context_guc_id_invalid(ce);
> > -		}
> > +		number_children *= 2;
> > +	} while (true);
> >  
> > -		return guc_id;
> > -	} else {
> > -		return -EAGAIN;
> > -	}
> > +	return -EAGAIN;
> >  }
> >  
> >  enum {	/* Return values for pin_guc_id / assign_guc_id */
> > @@ -1674,17 +1757,18 @@ enum {	/* Return values for pin_guc_id / assign_guc_id */
> >  	NEW_GUC_ID_ENABLED	= 2,
> >  };
> >  
> > -static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
> > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > +			 bool tasklet)
> >  {
> >  	int ret;
> >  
> >  	lockdep_assert_held(&guc->contexts_lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  
> > -	ret = new_guc_id(guc);
> > +	ret = new_guc_id(guc, ce);
> >  	if (unlikely(ret < 0)) {
> > -		ret = steal_guc_id(guc, true);
> > -		if (ret >= 0) {
> > -			*out = ret;
> > +		ret = steal_guc_id(guc, ce, true);
> > +		if (!ret) {
> >  			ret = NEW_GUC_ID_DISABLED;
> >  		} else if (ret < 0 && tasklet) {
> >  			/*
> > @@ -1692,15 +1776,18 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
> >  			 * enabled if guc_ids are exhausted and we are submitting
> >  			 * from the tasklet.
> >  			 */
> > -			ret = steal_guc_id(guc, false);
> > -			if (ret >= 0) {
> > -				*out = ret;
> > +			ret = steal_guc_id(guc, ce, false);
> > +			if (!ret)
> >  				ret = NEW_GUC_ID_ENABLED;
> > -			}
> >  		}
> > -	} else {
> > -		*out = ret;
> > -		ret = SAME_GUC_ID;
> > +	}
> > +
> > +	if (!(ret < 0) && intel_context_is_parent(ce)) {
> > +		struct intel_context *child;
> > +		int i = 1;
> > +
> > +		for_each_child(ce, child)
> > +			child->guc_id = ce->guc_id + i++;
> >  	}
> >  
> >  	return ret;
> > @@ -1713,6 +1800,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
> >  	int ret = 0;
> >  	unsigned long flags, tries = PIN_GUC_ID_TRIES;
> >  
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> >  
> >  try_again:
> > @@ -1724,7 +1812,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
> >  	}
> >  
> >  	if (context_guc_id_invalid(ce)) {
> > -		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
> > +		ret = assign_guc_id(guc, ce, tasklet);
> >  		if (unlikely(ret < 0))
> >  			goto out_unlock;
> >  	}
> > @@ -1770,6 +1858,7 @@ static void unpin_guc_id(struct intel_guc *guc,
> >  	unsigned long flags;
> >  
> >  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> >  
> >  	if (unlikely(context_guc_id_invalid(ce)))
> >  		return;
> > @@ -1781,7 +1870,8 @@ static void unpin_guc_id(struct intel_guc *guc,
> >  
> >  	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
> >  	    !atomic_read(&ce->guc_id_ref)) {
> > -		struct list_head *head = get_guc_id_list(guc, unpinned);
> > +		struct list_head *head =
> > +			get_guc_id_list(guc, ce->guc_number_children, unpinned);
> >  
> >  		list_add_tail(&ce->guc_id_link, head);
> >  	}
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > index 7069b7248f55..a5933e07bdd2 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > @@ -22,6 +22,16 @@ struct guc_virtual_engine {
> >  /*
> >   * Object which encapsulates the globally operated on i915_sched_engine +
> >   * the GuC submission state machine described in intel_guc_submission.c.
> > + *
> > + * Currently we have two instances of these per GuC. One for single-lrc and one
> > + * for multi-lrc submission. We split these into two submission engines as they
> > + * can operate in parallel allowing a blocking condition on one not to affect
> > + * the other. i.e. guc_ids are statically allocated between these two submission
> > + * modes. One mode may have guc_ids exhausted which requires blocking while the
> > + * other has plenty of guc_ids and can make forward progres.
> > + *
> > + * In the future if different submission use cases arise we can simply
> > + * instantiate another of these objects and assign it to the context.
> >   */
> >  struct guc_submit_engine {
> >  	struct i915_sched_engine sched_engine;
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine
  2021-08-09 15:35   ` Daniel Vetter
@ 2021-08-09 19:05     ` Matthew Brost
  2021-08-10  9:18       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 19:05 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > the hang check. As such if a different GSE's state machine hangs, the
> > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > can disable all submissions if it is hung.
> > 
> > Cc: John Harrison <John.C.Harrison@Intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
> >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index afb9b4bb8971..2d8296bcc583 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
> >  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > +#define MAX_TASKLET_BLOCKED_NS	2000000000
> >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> >  	lockdep_assert_held(&gse->sched_engine.lock);
> > +	hrtimer_start_range_ns(&gse->hang_timer,
> > +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > +			       HRTIMER_MODE_REL_PINNED);
> >  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> 
> So with drm/scheduler the reset handling is assumed to be
> single-threaded, and there's quite complex rules around that. I've
> recently worked with Boris Brezillion to clarify all this a bit and
> improve docs. Does this all still work in that glorious future? Might be
> good to at least sprinkle some comments/thoughts around in the commit
> message about the envisaged future direction for all this stuff, to keep
> people in the loop. Especially future people.
> 
> Ofc plan is still to just largely land all this.
> 
> Also: set_bit is an unordered atomic, which means you need barriers, which
> meanes ... *insert the full rant about justifying/documenting lockless
> algorithms from earlier *
>

lockdep_assert_held(&gse->sched_engine.lock);

Not lockless. Also spin locks act as barriers, right?
 
> But I think this all falls out with the removal of the guc-id allocation
> scheme?

Yes, this patch is getting deleted.

Matt

> -Daniel
> 
> >  }
> >  
> >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> >  	lockdep_assert_held(&gse->sched_engine.lock);
> > +	hrtimer_cancel(&gse->hang_timer);
> >  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
> >  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> >  			GEM_BUG_ON(!guc->ct.enabled);
> >  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> > +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> >  			sched_engine->tasklet.callback = NULL;
> >  		}
> >  	}
> > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
> >  	kfree(gse);
> >  }
> >  
> > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > +{
> > +	struct guc_submit_engine *gse =
> > +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > +	struct intel_guc *guc = gse->sched_engine.private_data;
> > +
> > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > +	if (guc->gse_hang_expected)
> > +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> > +			"GSE[%i] hung, disabling submission", gse->id);
> > +	else
> > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > +			"GSE[%i] hung, disabling submission", gse->id);
> > +#else
> > +	drm_err(&guc_to_gt(guc)->i915->drm,
> > +		"GSE[%i] hung, disabling submission", gse->id);
> > +#endif
> > +
> > +	/*
> > +	 * Tasklet not making forward progress, disable submission which in turn
> > +	 * will kick in the heartbeat to do a full GPU reset.
> > +	 */
> > +	disable_submission(guc);
> > +
> > +	return HRTIMER_NORESTART;
> > +}
> > +
> >  static void guc_submit_engine_init(struct intel_guc *guc,
> >  				   struct guc_submit_engine *gse,
> >  				   int id)
> > @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
> >  	sched_engine->retire_inflight_request_prio =
> >  		guc_retire_inflight_request_prio;
> >  	sched_engine->private_data = guc;
> > +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +	gse->hang_timer.function = gse_hang;
> >  	gse->id = id;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > index a5933e07bdd2..eae2e9725ede 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > @@ -6,6 +6,8 @@
> >  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
> >  #define _INTEL_GUC_SUBMISSION_TYPES_H_
> >  
> > +#include <linux/xarray.h>
> > +
> >  #include "gt/intel_engine_types.h"
> >  #include "gt/intel_context_types.h"
> >  #include "i915_scheduler_types.h"
> > @@ -41,6 +43,7 @@ struct guc_submit_engine {
> >  	unsigned long flags;
> >  	int total_num_rq_with_no_guc_id;
> >  	atomic_t num_guc_ids_not_ready;
> > +	struct hrtimer hang_timer;
> >  	int id;
> >  
> >  	/*
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy
  2021-08-09 15:36   ` Daniel Vetter
@ 2021-08-09 19:06     ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 19:06 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 05:36:12PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:18PM -0700, Matthew Brost wrote:
> > Since child contexts do not own the guc_ids or GuC context registration,
> > child contexts can simply be freed on destroy. Add
> > guc_child_context_destroy context operation to do this.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 2d8296bcc583..850edeff9230 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -2828,6 +2828,13 @@ static void destroy_worker_func(struct work_struct *w)
> >  		intel_gt_pm_unpark_work_add(gt, destroy_worker);
> >  }
> >  
> > +/* Future patches will use this function */
> > +__maybe_unused
> 
> Pure bikeshed, but for something this small just squash it in with the
> first user. This kinda does nothing alone.
> -Daniel
> 

Sure.

Matt

> > +static void guc_child_context_destroy(struct kref *kref)
> > +{
> > +	__guc_context_destroy(container_of(kref, struct intel_context, ref));
> > +}
> > +
> >  static void guc_context_destroy(struct kref *kref)
> >  {
> >  	struct intel_context *ce = container_of(kref, typeof(*ce), ref);
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-09 16:36   ` Daniel Vetter
@ 2021-08-09 19:13     ` Matthew Brost
  2021-08-10  9:23       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 19:13 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > Display the workqueue status in debugfs for GuC contexts that are in
> > parent-child relationship.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> >  1 file changed, 39 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 30df1c8db491..44a7582c9aed 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> >  		gse_log_submission_info(guc->gse[i], p, i);
> >  }
> >  
> > +static inline void guc_log_context(struct drm_printer *p,
> > +				   struct intel_context *ce)
> > +{
> > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > +		   ce->ring->head,
> > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > +		   ce->ring->tail,
> > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > +		   atomic_read(&ce->pin_count));
> > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > +		   atomic_read(&ce->guc_id_ref));
> > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > +		   ce->guc_state.sched_state,
> > +		   atomic_read(&ce->guc_sched_state_no_lock));
> 
> It's all debugfs, but I think proper locking even there is good. It at
> least reduces the confusion when the locking scheme is largely
> undocumented. Also given how much we have rcu for everything would be good
> to double-check all pointer dererences are properly protected.
>

Not sure if I 100% follow this but I don't think any of the pointers
dref here are RCU protected. Certainly none of the GuC ones are.

Will double before the next respin though.

> > +}
> > +
> >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> >  					     struct drm_printer *p)
> >  {
> >  	struct intel_context *ce;
> >  	unsigned long index;
> >  	xa_for_each(&guc->context_lookup, index, ce) {
> 
> xa_for_each doesn't provide any guarantees, so doesn't protect against
> concurrent removeal or anything like that. We need to do better than that.

https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
'It is safe to modify the array during the iteration.'

Matt

> -Daniel
> 
> > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > -			   ce->ring->head,
> > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > -			   ce->ring->tail,
> > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > -			   atomic_read(&ce->pin_count));
> > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > -			   atomic_read(&ce->guc_id_ref));
> > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > -			   ce->guc_state.sched_state,
> > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > +		GEM_BUG_ON(intel_context_is_child(ce));
> >  
> > +		guc_log_context(p, ce);
> >  		guc_log_context_priority(p, ce);
> > +
> > +		if (intel_context_is_parent(ce)) {
> > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > +			struct intel_context *child;
> > +
> > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > +				   READ_ONCE(desc->head));
> > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > +				   READ_ONCE(desc->tail));
> > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > +				   READ_ONCE(desc->wq_status));
> > +
> > +			for_each_child(ce, child)
> > +				guc_log_context(p, child);
> > +		}
> >  	}
> >  }
> >  
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-09 17:17   ` Daniel Vetter
@ 2021-08-09 19:32     ` Matthew Brost
  2021-08-11  9:55       ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-09 19:32 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:17:27PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> > Some workloads use lots of contexts that continually pin / unpin
> > contexts. With GuC submission an unpin translates to a schedule disable
> > H2G which puts pressure on both the i915 and GuC. A schedule disable can
> > also block future requests from being submitted until the operation
> > completes. None of this is ideal.
> > 
> > Add a configurable, via debugfs, delay period before the schedule
> > disable is issued. Default delay period is 1 second. The delay period is
> > skipped if more than 3/4 of the guc_ids are in use.
> > 
> > This patch also updates the selftests to turn off this delay period as
> > this extra time would likely cause many selftests to fail. Follow up
> > patches will fix all the selftests and enable the delay period.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> I think this is more evidence that we should just pin/unpin context at
> create/destruction time. The current scheme doesn't really work that well
> and causes way more pain than benefits it seems.
> 

Well that choice is above my pay grade, but for what it is worth it
would simplify the GuC backend quite a bit if we perma-pin contexts. By
quite a bit, I actually mean a lot of complexity goes away.

In the meantime I think we probably need this code though to avoid
trashes on the scheduling enable / disable.

Matt

> If anyone screams, and that's a big if aside of some igts, we can come up
> with a proper scheme to evict contexts without pin/unpin and layer hacks
> over that misdesign.
> -Daniel
> 
> > ---
> >  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
> >  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
> >  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
> >  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
> >  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
> >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> >  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
> >  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
> >  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
> >  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
> >  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
> >  drivers/gpu/drm/i915/i915_trace.h             |  10 +
> >  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
> >  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
> >  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
> >  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
> >  18 files changed, 405 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index b199d59bd2c4..1553287e5491 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
> >  		int err;
> >  
> >  		/* serialises with execbuf */
> > -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > +		intel_context_close(ce);
> >  		if (!intel_context_pin_if_active(ce))
> >  			continue;
> >  
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > index 13b088cc787e..a666d7e610f5 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
> >  		SUBTEST(igt_gem_coherency),
> >  	};
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > index ffae7df5e4d7..2c92afa9d608 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
> >  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
> >  	};
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > index b20f5621f62b..4745c78a48de 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
> >  		SUBTEST(igt_mmap_gpu),
> >  	};
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > index 740ee8086a27..ae1361c7c4cf 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
> >  		SUBTEST(igt_gem_huge),
> >  	};
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index 8e90a4a0b7b0..96643040defd 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> >  	ce->guc_id = GUC_INVALID_LRC_ID;
> >  	INIT_LIST_HEAD(&ce->guc_id_link);
> >  
> > +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> > +
> >  	mutex_init(&ce->parallel_submit);
> >  	ce->fence_context = dma_fence_context_alloc(1);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > index a302599e436a..f4c9036f7f03 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
> >  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
> >  }
> >  
> > +static inline void intel_context_close(struct intel_context *ce)
> > +{
> > +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > +
> > +	trace_intel_context_close(ce);
> > +	if (ce->ops->close)
> > +		ce->ops->close(ce);
> > +}
> > +
> >  static inline bool intel_context_is_closed(const struct intel_context *ce)
> >  {
> >  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index 8af9ace4c052..53f00657a45c 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -11,6 +11,7 @@
> >  #include <linux/list.h>
> >  #include <linux/mutex.h>
> >  #include <linux/types.h>
> > +#include <linux/ktime.h>
> >  
> >  #include "i915_active_types.h"
> >  #include "i915_sw_fence.h"
> > @@ -38,6 +39,7 @@ struct intel_context_ops {
> >  	int (*alloc)(struct intel_context *ce);
> >  
> >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > +	void (*close)(struct intel_context *ce);
> >  
> >  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> >  	int (*pin)(struct intel_context *ce);
> > @@ -203,6 +205,12 @@ struct intel_context {
> >  	 */
> >  	struct list_head guc_id_link;
> >  
> > +	/*
> > +	 * GuC schedule disable link / time
> > +	 */
> > +	struct list_head guc_sched_disable_link;
> > +	ktime_t guc_sched_disable_time;
> > +
> >  	/* GuC context blocked fence */
> >  	struct i915_sw_fence guc_blocked;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 30a0f364db8f..90b5b657d411 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -60,6 +60,7 @@ struct intel_guc {
> >  	struct ida guc_ids;
> >  	u32 num_guc_ids;
> >  	u32 max_guc_ids;
> > +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
> >  	unsigned long *guc_ids_bitmap;
> >  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> >  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > @@ -69,6 +70,12 @@ struct intel_guc {
> >  	struct list_head destroyed_contexts;
> >  	struct intel_gt_pm_unpark_work destroy_worker;
> >  
> > +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> > +	struct list_head sched_disable_list;
> > +	struct hrtimer sched_disable_timer;
> > +#define SCHED_DISABLE_DELAY_NS	1000000000
> > +	u64 sched_disable_delay_ns;
> > +
> >  	bool submission_supported;
> >  	bool submission_selected;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > index 7c479c5e7b3a..53a6f3da6cce 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
> >  }
> >  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> >  
> > +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> > +{
> > +	struct intel_guc *guc = data;
> > +
> > +	if (!intel_guc_submission_is_used(guc))
> > +		return -ENODEV;
> > +
> > +	*val = guc->sched_disable_delay_ns;
> > +
> > +	return 0;
> > +}
> > +
> > +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> > +{
> > +	struct intel_guc *guc = data;
> > +
> > +	if (!intel_guc_submission_is_used(guc))
> > +		return -ENODEV;
> > +
> > +	guc->sched_disable_delay_ns = val;
> > +
> > +	return 0;
> > +}
> > +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> > +			guc_sched_disable_delay_ns_get,
> > +			guc_sched_disable_delay_ns_set, "%lld\n");
> > +
> >  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
> >  {
> >  	static const struct debugfs_gt_file files[] = {
> >  		{ "guc_info", &guc_info_fops, NULL },
> >  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
> >  		{ "guc_num_id", &guc_num_id_fops, NULL },
> > +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
> >  	};
> >  
> >  	if (!intel_guc_is_supported(guc))
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index cd1893edf43a..dc0d6a099bee 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> >  	return (timeout < 0) ? timeout : 0;
> >  }
> >  
> > +static void sched_disable_contexts_flush(struct intel_guc *guc);
> > +
> >  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> >  {
> >  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
> >  		return 0;
> >  
> > +	sched_disable_contexts_flush(guc);
> > +
> >  	return intel_guc_wait_for_pending_msg(guc,
> >  					      &guc->outstanding_submission_g2h,
> >  					      true, timeout);
> > @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
> >  static void guc_signal_context_fence(struct intel_context *ce);
> >  static void guc_cancel_context_requests(struct intel_context *ce);
> >  static void guc_blocked_fence_complete(struct intel_context *ce);
> > +static void sched_disable_context_delete(struct intel_context *ce);
> >  
> >  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >  {
> > @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> >  		deregister = context_wait_for_deregister_to_register(ce);
> >  		banned = context_banned(ce);
> >  		init_sched_state(ce);
> > +		sched_disable_context_delete(ce);
> >  
> >  		if (pending_enable || destroyed || deregister) {
> >  			atomic_dec(&guc->outstanding_submission_g2h);
> > @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> >  
> >  	intel_gt_park_heartbeats(guc_to_gt(guc));
> >  	disable_submission(guc);
> > +	hrtimer_cancel(&guc->sched_disable_timer);
> >  	guc->interrupts.disable(guc);
> >  
> >  	/* Flush IRQ handler */
> > @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
> >  
> >  static void destroy_worker_func(struct work_struct *w);
> >  
> > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> > +
> >  /*
> >   * Set up the memory resources to be shared with the GuC (via the GGTT)
> >   * at firmware loading time.
> > @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> >  	INIT_LIST_HEAD(&guc->destroyed_contexts);
> >  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
> >  
> > +	spin_lock_init(&guc->sched_disable_lock);
> > +	INIT_LIST_HEAD(&guc->sched_disable_list);
> > +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> > +		     HRTIMER_MODE_REL);
> > +	guc->sched_disable_timer.function = sched_disable_timer_func;
> > +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> > +
> >  	return 0;
> >  }
> >  
> > @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  	if (unlikely(ret < 0))
> >  		return ret;
> >  
> > +	if (intel_context_is_parent(ce))
> > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > +			order_base_2(ce->guc_number_children + 1);
> > +	else
> > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> > +
> >  	ce->guc_id = ret;
> >  	return 0;
> >  }
> > @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> >  {
> >  	GEM_BUG_ON(intel_context_is_child(ce));
> >  	if (!context_guc_id_invalid(ce)) {
> > -		if (intel_context_is_parent(ce))
> > +		if (intel_context_is_parent(ce)) {
> > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > +				order_base_2(ce->guc_number_children + 1);
> >  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> >  					      order_base_2(ce->guc_number_children
> >  							   + 1));
> > -		else
> > +		} else {
> > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
> >  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > +		}
> >  		clr_lrc_desc_registered(guc, ce->guc_id);
> > +
> >  		set_context_guc_id_invalid(ce);
> >  	}
> >  	if (!list_empty(&ce->guc_id_link))
> > @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> >  			 * from another context that has more guc_id that itself.
> >  			 */
> >  			if (cn_o2 != ce_o2) {
> > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > +					order_base_2(cn->guc_number_children + 1);
> >  				bitmap_release_region(guc->guc_ids_bitmap,
> >  						      cn->guc_id,
> >  						      cn_o2);
> > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > +					order_base_2(ce->guc_number_children + 1);
> >  				bitmap_allocate_region(guc->guc_ids_bitmap,
> >  						       ce->guc_id,
> >  						       ce_o2);
> > @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
> >  	__guc_context_unpin(ce);
> >  
> >  	if (likely(!intel_context_is_barrier(ce)))
> > -		intel_engine_pm_put(ce->engine);
> > +		intel_engine_pm_put_async(ce->engine);
> >  }
> >  
> >  static void guc_context_post_unpin(struct intel_context *ce)
> > @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> >  
> >  	for_each_engine_masked(engine, ce->engine->gt,
> >  			       ce->engine->mask, tmp)
> > -		intel_engine_pm_put(engine);
> > +		intel_engine_pm_put_async(engine);
> >  	for_each_child(ce, child)
> >  		for_each_engine_masked(engine, child->engine->gt,
> >  				       child->engine->mask, tmp)
> > -			intel_engine_pm_put(engine);
> > +			intel_engine_pm_put_async(engine);
> >  }
> >  
> >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> >  
> >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  
> > +	sched_disable_context_delete(ce);
> > +
> >  	with_intel_runtime_pm(runtime_pm, wakeref)
> >  		__guc_context_sched_disable(guc, ce, guc_id);
> >  
> > @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> >  								     1);
> >  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> >  	}
> > +
> > +	sched_disable_context_delete(ce);
> > +}
> > +
> > +#define next_sched_disable_time(guc, now, ce) \
> > +	(guc->sched_disable_delay_ns - \
> > +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> > +static void ____sched_disable_context_delete(struct intel_guc *guc,
> > +					     struct intel_context *ce)
> > +{
> > +	bool is_first;
> > +
> > +	lockdep_assert_held(&guc->sched_disable_lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> > +
> > +	is_first = list_is_first(&ce->guc_sched_disable_link,
> > +				 &guc->sched_disable_list);
> > +	list_del_init(&ce->guc_sched_disable_link);
> > +	if (list_empty(&guc->sched_disable_list)) {
> > +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > +	} else if (is_first) {
> > +		struct intel_context *first =
> > +			list_first_entry(&guc->sched_disable_list,
> > +					 typeof(*first),
> > +					 guc_sched_disable_link);
> > +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> > +							first);
> > +
> > +		hrtimer_start(&guc->sched_disable_timer,
> > +			      ns_to_ktime(next_time),
> > +			      HRTIMER_MODE_REL_PINNED);
> > +	}
> > +}
> > +
> > +static void __sched_disable_context_delete(struct intel_guc *guc,
> > +					   struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&guc->sched_disable_lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > +		intel_context_sched_disable_unpin(ce);
> > +		____sched_disable_context_delete(guc, ce);
> > +	}
> > +}
> > +
> > +static void sched_disable_context_delete(struct intel_context *ce)
> > +{
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > +		__sched_disable_context_delete(guc, ce);
> > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > +	}
> > +}
> > +
> > +static void sched_disable_context_add(struct intel_guc *guc,
> > +				      struct intel_context *ce)
> > +{
> > +	unsigned long flags;
> > +
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > +
> > +	ce->guc_sched_disable_time = ktime_get();
> > +
> > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > +	if (list_empty(&guc->sched_disable_list))
> > +		hrtimer_start(&guc->sched_disable_timer,
> > +			      ns_to_ktime(guc->sched_disable_delay_ns),
> > +			      HRTIMER_MODE_REL_PINNED);
> > +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > +}
> > +
> > +static void sched_disable_contexts_flush(struct intel_guc *guc)
> > +{
> > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > +
> > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > +					 guc_sched_disable_link) {
> > +		intel_wakeref_t wakeref;
> > +		bool enabled;
> > +		u16 guc_id;
> > +
> > +		list_del_init(&ce->guc_sched_disable_link);
> > +
> > +		spin_lock(&ce->guc_state.lock);
> > +		enabled = context_enabled(ce);
> > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > +			if (enabled)
> > +				clr_context_enabled(ce);
> > +			spin_unlock(&ce->guc_state.lock);
> > +			intel_context_sched_disable_unpin(ce);
> > +			continue;
> > +		}
> > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > +			spin_unlock(&ce->guc_state.lock);
> > +			continue;
> > +		}
> > +		guc_id = prep_context_pending_disable(ce);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > +			__guc_context_sched_disable(guc, ce, guc_id);
> > +	}
> > +
> > +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > +
> > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> >  }
> >  
> > +#define should_sched_be_disabled(guc, now, ce) \
> > +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> > +	(guc->sched_disable_delay_ns / 4) * 3)
> > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> > +{
> > +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> > +					     sched_disable_timer);
> > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > +	struct intel_context *ce, *cn;
> > +	unsigned long flags;
> > +	ktime_t now;
> > +
> > +	if (list_empty(&guc->sched_disable_list))
> > +		return HRTIMER_NORESTART;
> > +
> > +	now = ktime_get();
> > +
> > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > +
> > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > +					 guc_sched_disable_link) {
> > +		intel_wakeref_t wakeref;
> > +		bool enabled;
> > +		u16 guc_id;
> > +
> > +		/*
> > +		 * If a context has been waiting for 3/4 of its delay or more,
> > +		 * issue the schedule disable. Using this heuristic allows more
> > +		 * than 1 context to have its scheduling disabled when this
> > +		 * timer is run.
> > +		 */
> > +		if (!should_sched_be_disabled(guc, now, ce))
> > +			break;
> > +
> > +		list_del_init(&ce->guc_sched_disable_link);
> > +
> > +		spin_lock(&ce->guc_state.lock);
> > +		enabled = context_enabled(ce);
> > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > +			if (enabled)
> > +				clr_context_enabled(ce);
> > +			spin_unlock(&ce->guc_state.lock);
> > +			intel_context_sched_disable_unpin(ce);
> > +			continue;
> > +		}
> > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > +			spin_unlock(&ce->guc_state.lock);
> > +			continue;
> > +		}
> > +		guc_id = prep_context_pending_disable(ce);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > +			__guc_context_sched_disable(guc, ce, guc_id);
> > +	}
> > +
> > +	if (!list_empty(&guc->sched_disable_list)) {
> > +		struct intel_context *first =
> > +			list_first_entry(&guc->sched_disable_list,
> > +					 typeof(*first),
> > +					 guc_sched_disable_link);
> > +		u64 next_time = next_sched_disable_time(guc, now, first);
> > +
> > +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > +
> > +		return HRTIMER_RESTART;
> > +	} else {
> > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > +
> > +		return HRTIMER_NORESTART;
> > +	}
> > +}
> > +
> > +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
> >  static void guc_context_sched_disable(struct intel_context *ce)
> >  {
> >  	struct intel_guc *guc = ce_to_guc(ce);
> > @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >  	intel_wakeref_t wakeref;
> >  	u16 guc_id;
> >  	bool enabled;
> > +	int guc_id_index = intel_context_is_parent(ce) ?
> > +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> > +	int max_guc_ids = intel_context_is_parent(ce) ?
> > +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> > +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> >  
> >  	GEM_BUG_ON(intel_context_is_child(ce));
> > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> >  
> >  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> >  	    !lrc_desc_registered(guc, ce->guc_id)) {
> > @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
> >  	if (!context_enabled(ce))
> >  		goto unpin;
> >  
> > +	/*
> > +	 * If no guc_id pressure and the context isn't closed we delay the
> > +	 * schedule disable to not to continuously disable / enable scheduling
> > +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> > +	 * debugfs, default 1s.
> > +	 */
> > +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> > +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> > +		sched_disable_context_add(guc, ce);
> > +		return;
> > +	}
> > +
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	/*
> > @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
> >  	i915_request_notify_execute_cb_imm(rq);
> >  }
> >  
> > +static void __guc_context_close(struct intel_guc *guc,
> > +				struct intel_context *ce)
> > +{
> > +	lockdep_assert_held(&guc->sched_disable_lock);
> > +	GEM_BUG_ON(intel_context_is_child(ce));
> > +
> > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > +		struct intel_runtime_pm *runtime_pm =
> > +			ce->engine->uncore->rpm;
> > +		intel_wakeref_t wakeref;
> > +		bool enabled;
> > +		u16 guc_id;
> > +
> > +		spin_lock(&ce->guc_state.lock);
> > +		enabled = context_enabled(ce);
> > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > +			if (enabled)
> > +				clr_context_enabled(ce);
> > +			spin_unlock(&ce->guc_state.lock);
> > +			intel_context_sched_disable_unpin(ce);
> > +			goto update_list;
> > +		}
> > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > +			spin_unlock(&ce->guc_state.lock);
> > +			goto update_list;
> > +		}
> > +		guc_id = prep_context_pending_disable(ce);
> > +		spin_unlock(&ce->guc_state.lock);
> > +
> > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > +			__guc_context_sched_disable(guc, ce, guc_id);
> > +update_list:
> > +		____sched_disable_context_delete(guc, ce);
> > +	}
> > +}
> > +
> > +static void guc_context_close(struct intel_context *ce)
> > +{
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * If we close the context and a schedule disable is pending a delay, do
> > +	 * it immediately.
> > +	 */
> > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > +		__guc_context_close(guc, ce);
> > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > +	}
> > +}
> > +
> >  static struct intel_context *
> >  guc_create_parallel(struct intel_engine_cs **engines,
> >  		    unsigned int num_siblings,
> > @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
> >  	.post_unpin = guc_context_post_unpin,
> >  
> >  	.ban = guc_context_ban,
> > +	.close = guc_context_close,
> >  
> >  	.cancel_request = guc_context_cancel_request,
> >  
> > @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
> >  
> >  	rq->reserved_space -= GUC_REQUEST_SIZE;
> >  
> > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> > +		   atomic_read(&ce->pin_count) < 3);
> > +	sched_disable_context_delete(ce);
> > +
> >  	/*
> >  	 * guc_ids are exhausted or a heuristic is met indicating too many
> >  	 * guc_ids are waiting on requests with submission dependencies (not
> > @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> >  	__guc_context_unpin(ce);
> >  
> >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > -		intel_engine_pm_put(engine);
> > +		intel_engine_pm_put_async(engine);
> >  }
> >  
> >  static void guc_virtual_context_enter(struct intel_context *ce)
> > @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> >  	.post_unpin = guc_context_post_unpin,
> >  
> >  	.ban = guc_context_ban,
> > +	.close = guc_context_close,
> >  
> >  	.cancel_request = guc_context_cancel_request,
> >  
> > @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
> >  	.post_unpin = guc_parent_context_post_unpin,
> >  
> >  	.ban = guc_context_ban,
> > +	.close = guc_context_close,
> >  
> >  	.enter = guc_virtual_context_enter,
> >  	.exit = guc_virtual_context_exit,
> > @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> >  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> >  		   atomic_read(&guc->outstanding_submission_g2h));
> >  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> > -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> > +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> > +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> > +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
> >  	drm_printf(p, "GuC max context registered: %u\n\n",
> >  		   guc->lrcd_reg.max_idx);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > index 9cfecf9d368e..ad70b3159ce4 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
> >  #define NUM_RQ_PER_CONTEXT	2
> >  #define HEARTBEAT_INTERVAL	1500
> >  
> > -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> > +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> > +					bool hang, bool sched_disable_delay)
> >  {
> >  	struct intel_gt *gt = arg;
> >  	struct intel_guc *guc = &gt->uc.guc;
> > @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> >  	if (limit_guc_ids)
> >  		guc->num_guc_ids = NUM_GUC_ID;
> >  
> > +	if (sched_disable_delay)
> > +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> > +
> >  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
> >  	if (IS_ERR(ce)) {
> >  		ret = PTR_ERR(ce);
> > @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> >  	guc->num_guc_ids = guc->max_guc_ids;
> >  	guc->gse_hang_expected = false;
> >  	guc->inject_bad_sched_disable = false;
> > +	guc->sched_disable_delay_ns = 0;
> >  	kfree(contexts);
> >  
> >  	return ret;
> > @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> >  
> >  static int intel_guc_flow_control_guc_ids(void *arg)
> >  {
> > -	return __intel_guc_flow_control_guc(arg, true, false);
> > +	return __intel_guc_flow_control_guc(arg, true, false, false);
> > +}
> > +
> > +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> > +{
> > +	return __intel_guc_flow_control_guc(arg, true, false, true);
> >  }
> >  
> >  static int intel_guc_flow_control_lrcd_reg(void *arg)
> >  {
> > -	return __intel_guc_flow_control_guc(arg, false, false);
> > +	return __intel_guc_flow_control_guc(arg, false, false, false);
> >  }
> >  
> >  static int intel_guc_flow_control_hang_state_machine(void *arg)
> >  {
> > -	return __intel_guc_flow_control_guc(arg, true, true);
> > +	return __intel_guc_flow_control_guc(arg, true, true, false);
> >  }
> >  
> >  #define NUM_RQ_STRESS_CTBS	0x4000
> > @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
> >  	static const struct i915_subtest tests[] = {
> >  		SUBTEST(intel_guc_flow_control_stress_ctbs),
> >  		SUBTEST(intel_guc_flow_control_guc_ids),
> > +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
> >  		SUBTEST(intel_guc_flow_control_lrcd_reg),
> >  		SUBTEST(intel_guc_flow_control_hang_state_machine),
> >  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> > diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> > index f54de0499be7..bf464db7affe 100644
> > --- a/drivers/gpu/drm/i915/i915_selftest.h
> > +++ b/drivers/gpu/drm/i915/i915_selftest.h
> > @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
> >  			T, ARRAY_SIZE(T), data)
> >  #define i915_live_subtests(T, data) ({ \
> >  	typecheck(struct drm_i915_private *, data); \
> > +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
> >  	__i915_subtests(__func__, \
> >  			__i915_live_setup, __i915_live_teardown, \
> >  			T, ARRAY_SIZE(T), data); \
> >  })
> >  #define intel_gt_live_subtests(T, data) ({ \
> >  	typecheck(struct intel_gt *, data); \
> > +	(data)->uc.guc.sched_disable_delay_ns = 0; \
> >  	__i915_subtests(__func__, \
> >  			__intel_gt_live_setup, __intel_gt_live_teardown, \
> >  			T, ARRAY_SIZE(T), data); \
> > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > index 806ad688274b..57ba7065d5ab 100644
> > --- a/drivers/gpu/drm/i915/i915_trace.h
> > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
> >  	     TP_ARGS(ce)
> >  );
> >  
> > +DEFINE_EVENT(intel_context, intel_context_close,
> > +	     TP_PROTO(struct intel_context *ce),
> > +	     TP_ARGS(ce)
> > +);
> > +
> >  DEFINE_EVENT(intel_context, intel_context_ban,
> >  	     TP_PROTO(struct intel_context *ce),
> >  	     TP_ARGS(ce)
> > @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
> >  {
> >  }
> >  
> > +static inline void
> > +trace_intel_context_close(struct intel_context *ce)
> > +{
> > +}
> > +
> >  static inline void
> >  trace_intel_context_ban(struct intel_context *ce)
> >  {
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > index f843a5040706..d54c280217fe 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
> >  
> >  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > index 9e9a6cb1d9e5..86bad00cca95 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
> >  	if (err)
> >  		return err;
> >  
> > -	err = i915_subtests(tests, i915);
> > +	err = i915_live_subtests(tests, i915);
> >  
> >  	destroy_empty_config(&i915->perf);
> >  
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> > index d67710d10615..afbf88865a8b 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> > @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
> >  	if (intel_gt_is_wedged(&i915->gt))
> >  		return 0;
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> >  
> >  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > index dd0607254a95..f4b157451851 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
> >  		SUBTEST(igt_vma_remapped_gtt),
> >  	};
> >  
> > -	return i915_subtests(tests, i915);
> > +	return i915_live_subtests(tests, i915);
> >  }
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-09 18:11     ` Matthew Brost
@ 2021-08-10  6:43       ` Daniel Vetter
  2021-08-10 21:29         ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  6:43 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:11:37PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:23:42PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:07PM -0700, Matthew Brost wrote:
> > > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > > circuiting while a scheduling of user context could be enabled.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/Makefile                 |  1 +
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> > >  2 files changed, 34 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > > index 903de270f2db..5e3a1e2095b0 100644
> > > --- a/drivers/gpu/drm/i915/Makefile
> > > +++ b/drivers/gpu/drm/i915/Makefile
> > > @@ -103,6 +103,7 @@ gt-y += \
> > >  	gt/intel_gt_clock_utils.o \
> > >  	gt/intel_gt_irq.o \
> > >  	gt/intel_gt_pm.o \
> > > +	gt/intel_gt_pm_unpark_work.o \
> > 
> > This file isn't here?
> > 
> 
> Yep, included this in the wrong patch. Should be in:
> https://patchwork.freedesktop.org/patch/448462/?series=92789&rev=2
> 
> > Also pm stuff tends to have very nasty locking requirements, doing special
> > stuff like this in the backend tends to lead to really big surprises. I
> > think two options to make sure our locking design stays consistent:
> > - Lift this to generic code.
> 
> Not sure I'm following this, intel_engine_pm_get/put are generic calls.
> Those calls should have all the correct annoations. If they don't we can
> add them.

But you only call them in the GuC backend, not in all of them. Which is an
inconsistency in locking, and unfortunately runtime pm is extremely nasty,
so having potentially very divergent locking behind the same interface in
the same driver is a recipe for an unmaintainable mess.

Iow, if the high-level code runs on execlist or the ringbuffer backend we
still need to go through at least the lockdep motions of what you're
adding here.

This is similar in spirit to all the might_sleep/might_lock calls we have
all over the kernel where in many cases something doesn't happen, but we
need to make sure it's allowed to have a consistent design.

So essentially in the intel_context_pin and all these functions put a
intel_engine_pm_might_get (which compiles out without debugging enabled),
unconditionally, across all platforms and sched backends.

In general I think backend specific locking (irrespective of what kind of
backend or interface you implement) is a pretty bad idea in the kernel,
and needs to be avoided if at all possible. Avoid here means "pull the
might_lock/might_sleep/might_whatever checks into generic code".
-Daniel

> Matt
> 
> > - expose some engine_pm_migt_get/put() calls which do have the right set
> >   of might_lock annoations, and call those in the generic code.
> > 
> > Imo the worst kernel abstractions are those where all implementations
> > look&act the same, except for locking. Unfortunately i915-gem code is full
> > of this stuff, and we need to stop this by enlisting lockdep to check the
> > contracts for us.
> > -Daniel
> > 
> > >  	gt/intel_gt_pm_irq.o \
> > >  	gt/intel_gt_requests.o \
> > >  	gt/intel_gtt.o \
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 7fe4d1559a81..c5d9548bfd00 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -2056,7 +2056,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> > >  
> > >  static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > >  {
> > > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > +
> > > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > +		intel_engine_pm_get(ce->engine);
> > > +
> > > +	return ret;
> > >  }
> > >  
> > >  static void guc_context_unpin(struct intel_context *ce)
> > > @@ -2067,6 +2072,9 @@ static void guc_context_unpin(struct intel_context *ce)
> > >  
> > >  	unpin_guc_id(guc, ce, true);
> > >  	lrc_unpin(ce);
> > > +
> > > +	if (likely(!intel_context_is_barrier(ce)))
> > > +		intel_engine_pm_put(ce->engine);
> > >  }
> > >  
> > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > @@ -3002,8 +3010,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > >  static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > >  {
> > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > +
> > > +	if (likely(!ret))
> > > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > +			intel_engine_pm_get(engine);
> > >  
> > > -	return __guc_context_pin(ce, engine, vaddr);
> > > +	return ret;
> > > +}
> > > +
> > > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > > +{
> > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > +	struct intel_engine_cs *engine;
> > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > +
> > > +	GEM_BUG_ON(context_enabled(ce));
> > > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > +
> > > +	unpin_guc_id(guc, ce, true);
> > > +	lrc_unpin(ce);
> > > +
> > > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > +		intel_engine_pm_put(engine);
> > >  }
> > >  
> > >  static void guc_virtual_context_enter(struct intel_context *ce)
> > > @@ -3040,7 +3070,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > >  
> > >  	.pre_pin = guc_virtual_context_pre_pin,
> > >  	.pin = guc_virtual_context_pin,
> > > -	.unpin = guc_context_unpin,
> > > +	.unpin = guc_virtual_context_unpin,
> > >  	.post_unpin = guc_context_post_unpin,
> > >  
> > >  	.ban = guc_context_ban,
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-09 18:20     ` Matthew Brost
@ 2021-08-10  6:47       ` Daniel Vetter
  2021-08-11 17:47         ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  6:47 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:20:51PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:27:01PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:08PM -0700, Matthew Brost wrote:
> > > Calling switch_to_kernel_context isn't needed if the engine PM reference
> > > is taken while all contexts are pinned. By not calling
> > > switch_to_kernel_context we save on issuing a request to the engine.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > index 1f07ac4e0672..58099de6bf07 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > @@ -162,6 +162,10 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
> > >  	unsigned long flags;
> > >  	bool result = true;
> > >  
> > > +	/* No need to switch_to_kernel_context if GuC submission */
> > 
> > Maybe whack a big FIXME on here that we should unravel this properly.
> 
> Sure, can add a FIXME here.
> 
> > Currently the execlist backend assumptions are leaked all over the place,
> > leading to stuff like this. Which means extremely fragile code.
> >
> 
> Yes, this something required for execlists implemented in what should be
> generic code. 
> 
> > I currently don't have a great idea on how exactly we should do that, but
> > oh well.
> 
> Me either, it will be a process.
> 
> > 
> > btw just in case we ever want to make guc lrc properly evictable (which as
> > the og use-case for this function, way, way back), would we need to fully
> 
> Can you explain what you mean by fully evictable? Not getting what you
> mean in this context.
> 
> > unregister them from guc? At least I'm assuming there's no other trick
> 
> If scheduling is disabled on the context (currently done on unpin) you are
> free move anything around as the GuC is guaranteed not to touch the
> context state. If on re-pin something has moved (e.g. the LRC vaddr is
> different), you need to unregister and re-register the context with the
> GuC.

So at that point GuC also guarantees that it's not left in the hw engine?
Execlist has this barrier request to fully unload the ctx from the hw, and
that's also why I cam on the topic of OA.

> > like the below one.
> > 
> > Another aside: How does the perf/OA patching work on GuC?
> >
> 
> Not my area of expertise but perf somewhat a WIP. The plan is for the
> GuC to write out some stats to HWSP I think? John Harrison is working to
> get this fully implemented.
> 
> OA is working afaik, with Umesh Nerlige Ramappa being the expert here.

I think it's OA that I'm thinking of here: We have code in i915_perf.c to
patch all the ctx currently in the system, so that they have a consistent
OA config. That's also relying on this barrier stuff, and I was wondering
how that will work with GuC.
-Daniel

> 
> Matt
> 
> > Anyway, patch looks legit:
> > 
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > 
> > 
> > > +	if (intel_engine_uses_guc(engine))
> > > +		return true;
> > > +
> > >  	/* GPU is pointing to the void, as good as in the kernel context. */
> > >  	if (intel_gt_is_wedged(engine->gt))
> > >  		return true;
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping
  2021-08-09 18:28     ` Matthew Brost
@ 2021-08-10  6:49       ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  6:49 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:28:58PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:28:04PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:10PM -0700, Matthew Brost wrote:
> > > Add logical engine mapping. This is required for split-frame, as
> > > workloads need to be placed on engines in a logically contiguous manner.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 60 ++++++++++++++++---
> > >  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  1 +
> > >  .../drm/i915/gt/intel_execlists_submission.c  |  1 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  2 +-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 +------
> > >  5 files changed, 56 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > index 0d9105a31d84..4d790f9a65dd 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > @@ -290,7 +290,8 @@ static void nop_irq_handler(struct intel_engine_cs *engine, u16 iir)
> > >  	GEM_DEBUG_WARN_ON(iir);
> > >  }
> > >  
> > > -static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > > +static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
> > > +			      u8 logical_instance)
> > >  {
> > >  	const struct engine_info *info = &intel_engines[id];
> > >  	struct drm_i915_private *i915 = gt->i915;
> > > @@ -334,6 +335,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> > >  
> > >  	engine->class = info->class;
> > >  	engine->instance = info->instance;
> > > +	engine->logical_mask = BIT(logical_instance);
> > >  	__sprint_engine_name(engine);
> > >  
> > >  	engine->props.heartbeat_interval_ms =
> > > @@ -572,6 +574,37 @@ static intel_engine_mask_t init_engine_mask(struct intel_gt *gt)
> > >  	return info->engine_mask;
> > >  }
> > >  
> > > +static void populate_logical_ids(struct intel_gt *gt, u8 *logical_ids,
> > > +				 u8 class, const u8 *map, u8 num_instances)
> > > +{
> > > +	int i, j;
> > > +	u8 current_logical_id = 0;
> > > +
> > > +	for (j = 0; j < num_instances; ++j) {
> > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > +			if (!HAS_ENGINE(gt, i) ||
> > > +			    intel_engines[i].class != class)
> > > +				continue;
> > > +
> > > +			if (intel_engines[i].instance == map[j]) {
> > > +				logical_ids[intel_engines[i].instance] =
> > > +					current_logical_id++;
> > > +				break;
> > > +			}
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void setup_logical_ids(struct intel_gt *gt, u8 *logical_ids, u8 class)
> > > +{
> > > +	int i;
> > > +	u8 map[MAX_ENGINE_INSTANCE + 1];
> > > +
> > > +	for (i = 0; i < MAX_ENGINE_INSTANCE + 1; ++i)
> > > +		map[i] = i;
> > > +	populate_logical_ids(gt, logical_ids, class, map, ARRAY_SIZE(map));
> > > +}
> > > +
> > >  /**
> > >   * intel_engines_init_mmio() - allocate and prepare the Engine Command Streamers
> > >   * @gt: pointer to struct intel_gt
> > > @@ -583,7 +616,8 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > >  	struct drm_i915_private *i915 = gt->i915;
> > >  	const unsigned int engine_mask = init_engine_mask(gt);
> > >  	unsigned int mask = 0;
> > > -	unsigned int i;
> > > +	unsigned int i, class;
> > > +	u8 logical_ids[MAX_ENGINE_INSTANCE + 1];
> > >  	int err;
> > >  
> > >  	drm_WARN_ON(&i915->drm, engine_mask == 0);
> > > @@ -593,15 +627,23 @@ int intel_engines_init_mmio(struct intel_gt *gt)
> > >  	if (i915_inject_probe_failure(i915))
> > >  		return -ENODEV;
> > >  
> > > -	for (i = 0; i < ARRAY_SIZE(intel_engines); i++) {
> > > -		if (!HAS_ENGINE(gt, i))
> > > -			continue;
> > > +	for (class = 0; class < MAX_ENGINE_CLASS + 1; ++class) {
> > > +		setup_logical_ids(gt, logical_ids, class);
> > >  
> > > -		err = intel_engine_setup(gt, i);
> > > -		if (err)
> > > -			goto cleanup;
> > > +		for (i = 0; i < ARRAY_SIZE(intel_engines); ++i) {
> > > +			u8 instance = intel_engines[i].instance;
> > > +
> > > +			if (intel_engines[i].class != class ||
> > > +			    !HAS_ENGINE(gt, i))
> > > +				continue;
> > >  
> > > -		mask |= BIT(i);
> > > +			err = intel_engine_setup(gt, i,
> > > +						 logical_ids[instance]);
> > > +			if (err)
> > > +				goto cleanup;
> > > +
> > > +			mask |= BIT(i);
> > > +		}
> > >  	}
> > >  
> > >  	/*
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > index ed91bcff20eb..85e5c9a9e502 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > @@ -266,6 +266,7 @@ struct intel_engine_cs {
> > >  	unsigned int guc_id;
> > >  
> > >  	intel_engine_mask_t mask;
> > > +	intel_engine_mask_t logical_mask;
> > 
> > Kerneldoc at least for new stuff. Bonus points if you get the
> > struct/header file up to speed (with dummy/fixme comments if need be) so
> 
> Sure can add Kerneldoc for new variables. Def don't have time to get all
> structs kerneldoc up to speed at moment as by backlog is about a mile
> long. Perhaps after we get all of GuC submission upstream I can take
> sometime to go through all the structures and update the DoC.

The idea isn't to add comments that are actually meaningful to all of
them, but just enough to be able to pull in the header without warnings.
Once you have that then any new addition will cause a warning in the doc
build, which CI iirc checks. And that's pretty good baseline to have, and
hence why I think it'd be good to quickly go through the motions to add
these.

Really fixing this is likely a few years of work across all the structs,
because in many cases the locking/coherency design is somewhere between
very tricky and outright broken. Doing that in one go makes no sense.
-Daniel

> 
> Matt
> 
> > we can include it into our overall html hierarchy).
> > -Daniel
> > 
> > >  
> > >  	u8 class;
> > >  	u8 instance;
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index de5f9c86b9a4..baa1797af1c8 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -3879,6 +3879,7 @@ execlists_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > >  
> > >  		ve->siblings[ve->num_siblings++] = sibling;
> > >  		ve->base.mask |= sibling->mask;
> > > +		ve->base.logical_mask |= sibling->logical_mask;
> > >  
> > >  		/*
> > >  		 * All physical engines must be compatible for their emission
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > index 6926919bcac6..9f5f43a16182 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > @@ -176,7 +176,7 @@ static void guc_mapping_table_init(struct intel_gt *gt,
> > >  	for_each_engine(engine, gt, id) {
> > >  		u8 guc_class = engine_class_to_guc_class(engine->class);
> > >  
> > > -		system_info->mapping_table[guc_class][engine->instance] =
> > > +		system_info->mapping_table[guc_class][ilog2(engine->logical_mask)] =
> > >  			engine->instance;
> > >  	}
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 310116f40509..dec757d319a2 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -1795,23 +1795,6 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > >  	return __guc_action_deregister_context(guc, guc_id, loop);
> > >  }
> > >  
> > > -static intel_engine_mask_t adjust_engine_mask(u8 class, intel_engine_mask_t mask)
> > > -{
> > > -	switch (class) {
> > > -	case RENDER_CLASS:
> > > -		return mask >> RCS0;
> > > -	case VIDEO_ENHANCEMENT_CLASS:
> > > -		return mask >> VECS0;
> > > -	case VIDEO_DECODE_CLASS:
> > > -		return mask >> VCS0;
> > > -	case COPY_ENGINE_CLASS:
> > > -		return mask >> BCS0;
> > > -	default:
> > > -		MISSING_CASE(class);
> > > -		return 0;
> > > -	}
> > > -}
> > > -
> > >  static void guc_context_policy_init(struct intel_engine_cs *engine,
> > >  				    struct guc_lrc_desc *desc)
> > >  {
> > > @@ -1952,8 +1935,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > >  
> > >  	desc = __get_lrc_desc(guc, ce->guc_lrcd_reg_idx);
> > >  	desc->engine_class = engine_class_to_guc_class(engine->class);
> > > -	desc->engine_submit_mask = adjust_engine_mask(engine->class,
> > > -						      engine->mask);
> > > +	desc->engine_submit_mask = engine->logical_mask;
> > >  	desc->hw_context_desc = ce->lrc.lrca;
> > >  	ce->guc_prio = map_i915_prio_to_guc_prio(prio);
> > >  	desc->priority = ce->guc_prio;
> > > @@ -3978,6 +3960,7 @@ guc_create_virtual(struct intel_engine_cs **siblings, unsigned int count)
> > >  		}
> > >  
> > >  		ve->base.mask |= sibling->mask;
> > > +		ve->base.logical_mask |= sibling->logical_mask;
> > >  
> > >  		if (n != 0 && ve->base.class != sibling->class) {
> > >  			DRM_DEBUG("invalid mixing of engine class, sibling %d, already %d\n",
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user
  2021-08-09 18:37     ` Matthew Brost
@ 2021-08-10  6:53       ` Daniel Vetter
  2021-08-11 17:55         ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  6:53 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:37:01PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:30:06PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:11PM -0700, Matthew Brost wrote:
> > > Expose logical engine instance to user via query engine info IOCTL. This
> > > is required for split-frame workloads as these needs to be placed on
> > > engines in a logically contiguous order. The logical mapping can change
> > > based on fusing. Rather than having user have knowledge of the fusing we
> > > simply just expose the logical mapping with the existing query engine
> > > info IOCTL.
> > > 
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > Uapi must have a link to the userspace MR/patch set using this, and to the
> > igt patch set validating it.
> > 
> 
> Have an IGT:
> https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> 
> Not sure when the media UMD is going to be updated upstream to use this.
> Does that mean I can't merge this until the media UMD is ready? Seems
> like it but isn't that a circular dependency? How can the media team
> develop for a new uAPI that isn't in the kernel yet?

Yes and no. Full explainer here:

https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

In the drm subsystem this is pretty much the only rule where if you break
it the book will be thrown at you with extreme prejudice.

Also wrt circular: If the umd aren't set up to test their branches against
kernel branches they need to fix their stuff. I know that internally
that's not been done, and its a disaster, but in upstream there's no room
for excuses. Both kernel and userspace needs to be in branches until it's
ready for merging.

> For what it is worth the downstream release is already using this.

Yeah which is another problem, shipping new uapi in downstream before it's
in upstream is decidedly not great.
-Daniel

> 
> Matt
> 
> > Ideally in each patch, since it's way too hard to unfortunately find the
> > cover letter late on.
> > 
> > Jason even went as far as making this a hard requirement because he wasted
> > a bit too much time trying to find the userspace for new uapi:
> > 
> > https://lore.kernel.org/dri-devel/20210804185704.624883-1-jason@jlekstrand.net/
> > 
> > Cheers, Daniel
> > 
> > >---
> > >  drivers/gpu/drm/i915/i915_query.c | 2 ++
> > >  include/uapi/drm/i915_drm.h       | 8 +++++++-
> > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> > > index e49da36c62fb..8a72923fbdba 100644
> > > --- a/drivers/gpu/drm/i915/i915_query.c
> > > +++ b/drivers/gpu/drm/i915/i915_query.c
> > > @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
> > >  	for_each_uabi_engine(engine, i915) {
> > >  		info.engine.engine_class = engine->uabi_class;
> > >  		info.engine.engine_instance = engine->uabi_instance;
> > > +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
> > >  		info.capabilities = engine->uabi_capabilities;
> > > +		info.logical_instance = ilog2(engine->logical_mask);
> > >  
> > >  		if (copy_to_user(info_ptr, &info, sizeof(info)))
> > >  			return -EFAULT;
> > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > index 7f13d241417f..ef72e07fe08c 100644
> > > --- a/include/uapi/drm/i915_drm.h
> > > +++ b/include/uapi/drm/i915_drm.h
> > > @@ -2706,14 +2706,20 @@ struct drm_i915_engine_info {
> > >  
> > >  	/** @flags: Engine flags. */
> > >  	__u64 flags;
> > > +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
> > >  
> > >  	/** @capabilities: Capabilities of this engine. */
> > >  	__u64 capabilities;
> > >  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
> > >  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
> > >  
> > > +	/** @logical_instance: Logical instance of engine */
> > > +	__u16 logical_instance;
> > > +
> > >  	/** @rsvd1: Reserved fields. */
> > > -	__u64 rsvd1[4];
> > > +	__u16 rsvd1[3];
> > > +	/** @rsvd2: Reserved fields. */
> > > +	__u64 rsvd2[3];
> > >  };
> > >  
> > >  /**
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship
  2021-08-09 18:44     ` Matthew Brost
@ 2021-08-10  8:45       ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  8:45 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:44:16PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 04:37:55PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:12PM -0700, Matthew Brost wrote:
> > > Introduce context parent-child relationship. Once this relationship is
> > > created all pinning / unpinning operations are directed to the parent
> > > context. The parent context is responsible for pinning all of its'
> > > children and itself.
> > > 
> > > This is a precursor to the full GuC multi-lrc implementation but aligns
> > > to how GuC mutli-lrc interface is defined - a single H2G is used
> > > register / deregister all of the contexts simultaneously.
> > > 
> > > Subsequent patches in the series will implement the pinning / unpinning
> > > operations for parent / child contexts.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context.c       | 29 +++++++++++++++++++
> > >  drivers/gpu/drm/i915/gt/intel_context.h       | 18 ++++++++++++
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h | 12 ++++++++
> > >  3 files changed, 59 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > index 745e84c72c90..8cb92b10b547 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > @@ -395,6 +395,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > >  	spin_lock_init(&ce->guc_state.lock);
> > >  	INIT_LIST_HEAD(&ce->guc_state.fences);
> > >  
> > > +	INIT_LIST_HEAD(&ce->guc_child_list);
> > > +
> > >  	spin_lock_init(&ce->guc_active.lock);
> > >  	INIT_LIST_HEAD(&ce->guc_active.requests);
> > >  
> > > @@ -414,10 +416,17 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > >  
> > >  void intel_context_fini(struct intel_context *ce)
> > >  {
> > > +	struct intel_context *child, *next;
> > > +
> > >  	if (ce->timeline)
> > >  		intel_timeline_put(ce->timeline);
> > >  	i915_vm_put(ce->vm);
> > >  
> > > +	/* Need to put the creation ref for the children */
> > > +	if (intel_context_is_parent(ce))
> > > +		for_each_child_safe(ce, child, next)
> > > +			intel_context_put(child);
> > > +
> > >  	mutex_destroy(&ce->pin_mutex);
> > >  	i915_active_fini(&ce->active);
> > >  }
> > > @@ -533,6 +542,26 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> > >  	return active;
> > >  }
> > >  
> > > +void intel_context_bind_parent_child(struct intel_context *parent,
> > > +				     struct intel_context *child)
> > > +{
> > > +	/*
> > > +	 * Callers responsibility to validate that this function is used
> > > +	 * correctly but we use GEM_BUG_ON here ensure that they do.
> > > +	 */
> > > +	GEM_BUG_ON(!intel_engine_uses_guc(parent->engine));
> > > +	GEM_BUG_ON(intel_context_is_pinned(parent));
> > > +	GEM_BUG_ON(intel_context_is_child(parent));
> > > +	GEM_BUG_ON(intel_context_is_pinned(child));
> > > +	GEM_BUG_ON(intel_context_is_child(child));
> > > +	GEM_BUG_ON(intel_context_is_parent(child));
> > > +
> > > +	parent->guc_number_children++;
> > > +	list_add_tail(&child->guc_child_link,
> > > +		      &parent->guc_child_list);
> > > +	child->parent = parent;
> > > +}
> > > +
> > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > >  #include "selftest_context.c"
> > >  #endif
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index c41098950746..ad6ce5ac4824 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -44,6 +44,24 @@ void intel_context_free(struct intel_context *ce);
> > >  int intel_context_reconfigure_sseu(struct intel_context *ce,
> > >  				   const struct intel_sseu sseu);
> > >  
> > > +static inline bool intel_context_is_child(struct intel_context *ce)
> > > +{
> > > +	return !!ce->parent;
> > > +}
> > > +
> > > +static inline bool intel_context_is_parent(struct intel_context *ce)
> > > +{
> > > +	return !!ce->guc_number_children;
> > > +}
> > > +
> > > +void intel_context_bind_parent_child(struct intel_context *parent,
> > > +				     struct intel_context *child);
> > > +
> > > +#define for_each_child(parent, ce)\
> > > +	list_for_each_entry(ce, &(parent)->guc_child_list, guc_child_link)
> > > +#define for_each_child_safe(parent, ce, cn)\
> > > +	list_for_each_entry_safe(ce, cn, &(parent)->guc_child_list, guc_child_link)
> > > +
> > >  /**
> > >   * intel_context_lock_pinned - Stablises the 'pinned' status of the HW context
> > >   * @ce - the context
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 2df79ba39867..66b22b370a72 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -202,6 +202,18 @@ struct intel_context {
> > >  	/* GuC context blocked fence */
> > >  	struct i915_sw_fence guc_blocked;
> > >  
> > > +	/* Head of children list or link in parent's children list */
> > 
> > Kerneldoc layout would be nice, plus explaining when exactly this is
> > set or the list empty (e.g. guch_child_list is empty if and only if
> > guc_number_children > 0 and parent == NULL).
> > 
> 
> Sure.
> 
> > Also mentionting that these are invariant over the lifetime of the object
> > would be nice.
> >
> 
> Yes, this is a context creation setup step that is done exactly once and
> is invariant over the lifetime of these contexts.
> 
> > Finally some words on refcounting (like who holds a reference on whom and
> > how we guarantee that use-after-free doesn't go boom since you have links
> > both ways). It looks like parent holds a reference on the child, so how do
> > you make sure the child looking at the parent doesn't go boom?
> 
> I hadn't really thought about the child looking at the parent but I
> believe it is safe. The child only looks up the parent when submissions
> are in flight. We always have refs on the contexts when submissions are
> in flight so we should be good - e.g. the last ref to parent is dropped
> only after all submissions are done and the context is closed.

Yeah that's pretty much the only safe option I could come up with too.
Please
- document this
- enforce it with checks. I think a wrapper to get at the parent, which a)
  can fail and b) checks that the child request is not yet signalled
  should do. Something with try_get or whatever it the name to signal it
  can fail is best.

Then the rule is that the unsignalled child request has an implicit
reference on the parent as long as it's unsignalled, but not afterwards.
It might also be good to clear out the parent pointer before signalling
the request. If that races in funny ways there's definitely more problems.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > +	union {
> > > +		struct list_head guc_child_list;	/* parent */
> > > +		struct list_head guc_child_link;	/* child */
> > > +	};
> > > +
> > > +	/* Pointer to parent */
> > > +	struct intel_context *parent;
> > > +
> > > +	/* Number of children if parent */
> > > +	u8 guc_number_children;
> > > +
> > >  	/*
> > >  	 * GuC priority management
> > >  	 */
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-09 18:58     ` Matthew Brost
@ 2021-08-10  8:53       ` Daniel Vetter
  2021-08-10  9:07         ` Daniel Vetter
  2021-08-11 18:23         ` Matthew Brost
  0 siblings, 2 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  8:53 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > Implement GuC parent-child context pin / unpin functions in which in any
> > > contexts in the relationship are pinned all the contexts are pinned. The
> > > parent owns most of the pinning / unpinning process and the children
> > > direct any pins / unpins to the parent.
> > > 
> > > Patch implements a number of unused functions that will be connected
> > > later in the series.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > index 8cb92b10b547..bb4c14656067 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > >  	intel_ring_unpin(ring);
> > >  }
> > >  
> > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > -				 struct i915_gem_ww_ctx *ww)
> > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > +				   struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	int err;
> > >  
> > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > >  	return err;
> > >  }
> > >  
> > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > >  {
> > >  	if (ce->state)
> > >  		__context_unpin_state(ce->state);
> > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > >  	__ring_retire(ce->ring);
> > >  }
> > >  
> > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > -			      struct i915_gem_ww_ctx *ww)
> > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > +				 struct i915_gem_ww_ctx *ww)
> > >  {
> > > -	bool handoff = false;
> > > -	void *vaddr;
> > > +	struct intel_context *child;
> > > +	int err, i = 0;
> > > +
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	for_each_child(ce, child) {
> > > +		err = __intel_context_pre_pin(child, ww);
> > > +		if (unlikely(err))
> > > +			goto unwind;
> > > +		++i;
> > > +	}
> > > +
> > > +	err = __intel_context_pre_pin(ce, ww);
> > > +	if (unlikely(err))
> > > +		goto unwind;
> > > +
> > > +	return 0;
> > > +
> > > +unwind:
> > > +	for_each_child(ce, child) {
> > > +		if (!i--)
> > > +			break;
> > > +		__intel_context_post_unpin(ce);
> > > +	}
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > +{
> > > +	struct intel_context *child;
> > > +
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	for_each_child(ce, child)
> > > +		__intel_context_post_unpin(child);
> > > +
> > > +	__intel_context_post_unpin(ce);
> > > +}
> > > +
> > > +static int __do_ww_lock(struct intel_context *ce,
> > > +			struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > +
> > > +	if (!err && ce->ring->vma->obj)
> > > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > +	if (!err && ce->state)
> > > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +static int do_ww_lock(struct intel_context *ce,
> > > +		      struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	struct intel_context *child;
> > >  	int err = 0;
> > >  
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	for_each_child(ce, child) {
> > > +		err = __do_ww_lock(child, ww);
> > > +		if (unlikely(err))
> > > +			return err;
> > > +	}
> > > +
> > > +	return __do_ww_lock(ce, ww);
> > > +}
> > > +
> > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > +				     struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	bool handoff = false;
> > > +	int err;
> > > +
> > >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > >  		err = intel_context_alloc_state(ce);
> > >  		if (err)
> > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > >  	 * refcount for __intel_context_active(), which prevent a lock
> > >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> > >  	 */
> > > +	err = do_ww_lock(ce, ww);
> > > +	if (err)
> > > +		return err;
> > >  
> > > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > -	if (!err && ce->ring->vma->obj)
> > > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > -	if (!err && ce->state)
> > > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > > -	if (!err)
> > > -		err = intel_context_pre_pin(ce, ww);
> > > +	err = intel_context_pre_pin(ce, ww);
> > >  	if (err)
> > >  		return err;
> > >  
> > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > >  	if (err)
> > >  		goto err_ctx_unpin;
> > >  
> > > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > +	err = ce->ops->pre_pin(ce, ww);
> > >  	if (err)
> > >  		goto err_release;
> > >  
> > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > >  		if (unlikely(err))
> > >  			goto err_unlock;
> > >  
> > > -		err = ce->ops->pin(ce, vaddr);
> > > +		err = ce->ops->pin(ce);
> > >  		if (err) {
> > >  			intel_context_active_release(ce);
> > >  			goto err_unlock;
> > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > >  	return err;
> > >  }
> > >  
> > > -int __intel_context_do_pin(struct intel_context *ce)
> > > +static int __intel_context_do_pin(struct intel_context *ce)
> > >  {
> > >  	struct i915_gem_ww_ctx ww;
> > >  	int err;
> > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > >  		 intel_context_get_avg_runtime_ns(ce));
> > >  
> > >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > -	intel_context_post_unpin(ce);
> > > +	__intel_context_post_unpin(ce);
> > >  	intel_context_put(ce);
> > >  }
> > >  
> > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > >  	child->parent = parent;
> > >  }
> > >  
> > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > +{
> > > +	if (likely(intel_context_pin_if_active(ce)))
> > > +		return 0;
> > > +
> > > +	return __intel_context_do_pin(ce);
> > > +}
> > > +
> > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > +					 struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	if (likely(intel_context_pin_if_active(ce)))
> > > +		return 0;
> > > +
> > > +	return __intel_context_do_pin_ww(ce, ww);
> > > +}
> > > +
> > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > +{
> > > +	if (!ce->ops->sched_disable) {
> > > +		__intel_context_do_unpin(ce, 1);
> > > +	} else {
> > > +		/*
> > > +		 * Move ownership of this pin to the scheduling disable which is
> > > +		 * an async operation. When that operation completes the above
> > > +		 * intel_context_sched_disable_unpin is called potentially
> > > +		 * unpinning the context.
> > > +		 */
> > > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > 
> > Uh man lockless algorithms.
> > 
> > Unless this comes:
> > - with essentially an academic looking paper that describes the abstract
> >   model of the lockless algorithm and proves it against the linux kernel
> >   meory model.
> > 
> > - lockless stuff generally needs barriers, and those barriers must be all
> >   documented. This means a) a comment next to each barrier in the code b)
> >   pointing to its counterparty c) with the overall design also explained
> >   in the kerneldoc for those datastructres.
> > 
> >   If you don't know where your barriers are, see above point about "it
> >   should look more like an academic paper in the commit message"
> > 
> > - hard perf data about how this is absolutely required, based on a
> >   real-world use-case (which then sometimes justifies a microbenchmark
> >   metric for the details, but it always needs to be real-world based). And
> >   also a throughrough explainer how the perf issue isn't fixable through
> >   better design. If that's not doable, just protect the state machine with
> >   a big dumb lock and move on.
> > 
> > - Also, because the current code is in such bad shape wrt lockless
> >   algorithms and premature optimizations: Overall complexity should go
> >   down (it's way too high right now), so pay down your new lockless trick
> >   by removing one of the existing ones that we only have because we can.
> > 
> > Yes this is steep, but we're way out in the woods here and need to smoehow
> > get back.
> 
> See below FIXME. At one point all of this was hidden in the backend but
> the dma-resv patches that landed upstream completely broke the layering,
> hence the need for the code here.
> 
> I guess I don't really understand what mean when you say lockless alg
> needs barriers, if the atomic functions are not really atomic wouldn't
> the world be broken?

They unordered atomics by default. Which means they're atomic itself, but
entirely unordered with anything else that's going on. Except when you
have one of the atomic ops which already guarantee a barrier, or you
manually add the barriers yourself. And yes there's enormous amounts of
bugs, and with our dgpu potentially running on non-IA cpus those bugs
matter.

Note that in C++ atomics the default behaviour is strongly ordered atomics
with full barriers by default, because those are much easier to program
against. Kernel isn't like that and defaults to "you need to add all the
barriers yourself".

I have a full lenght rant in the works and will work that through all
channels, but essentially locking is really hard to get right. And
lockless tricks practically need an academic paper with a formal
correctness proof against the linux memory model, or you do have bugs.

And I know that the current code is choke full of this stuff, so it's
tempting to just add more, but we really cant. The amount of locking
trickery we have in the codebase must go down substantially. My take is
that any code that adds anything trick needs to fully justify it against
the above list, _and_ also clean up some of the existing nonsense so that
overall complexity doesn't increase.

I'll share the full length rant with you internally, it's not yet ready
for publishing (but that's planned too).


> Also here I don't think it is really as simple as grab big dump lock for
> a variety of reasons, at least with the current dynamic pin / unpin code
> in place. If we move a perma-pinned contexts this could be cleaned up
> then.

Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
context can fix this I think we should do this asap. I'd say for parallel
context we should just do it outright (special case them or whatever) so
that we don't have to add even more very tricky code and tech debt.

Doable?

Cheers, Daniel


> 
> Matt
> 
> > -Daniel
> > 
> > > +				ce->ops->sched_disable(ce);
> > > +				break;
> > > +			}
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > + * configured for parallel submission, are pinned all the contexts need to be
> > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > + * is in place works but is not ideal.
> > > + */
> > > +int intel_context_pin(struct intel_context *ce)
> > > +{
> > > +	if (intel_context_is_child(ce)) {
> > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > +			return ____intel_context_pin(ce->parent);
> > > +		else
> > > +			return 0;
> > > +	} else {
> > > +		return ____intel_context_pin(ce);
> > > +	}
> > > +}
> > > +
> > > +int intel_context_pin_ww(struct intel_context *ce,
> > > +			 struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	if (intel_context_is_child(ce)) {
> > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > +			return __intel_context_pin_ww(ce->parent, ww);
> > > +		else
> > > +			return 0;
> > > +	} else {
> > > +		return __intel_context_pin_ww(ce, ww);
> > > +	}
> > > +}
> > > +
> > > +void intel_context_unpin(struct intel_context *ce)
> > > +{
> > > +	if (intel_context_is_child(ce)) {
> > > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > +			__intel_context_unpin(ce->parent);
> > > +	} else {
> > > +		__intel_context_unpin(ce);
> > > +	}
> > > +}
> > > +
> > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > >  #include "selftest_context.c"
> > >  #endif
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index ad6ce5ac4824..c208691fc87d 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > >  	mutex_unlock(&ce->pin_mutex);
> > >  }
> > >  
> > > -int __intel_context_do_pin(struct intel_context *ce);
> > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > -			      struct i915_gem_ww_ctx *ww);
> > > -
> > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > >  {
> > >  	return atomic_inc_not_zero(&ce->pin_count);
> > >  }
> > >  
> > > -static inline int intel_context_pin(struct intel_context *ce)
> > > -{
> > > -	if (likely(intel_context_pin_if_active(ce)))
> > > -		return 0;
> > > -
> > > -	return __intel_context_do_pin(ce);
> > > -}
> > > -
> > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > -				       struct i915_gem_ww_ctx *ww)
> > > -{
> > > -	if (likely(intel_context_pin_if_active(ce)))
> > > -		return 0;
> > > +int intel_context_pin(struct intel_context *ce);
> > >  
> > > -	return __intel_context_do_pin_ww(ce, ww);
> > > -}
> > > +int intel_context_pin_ww(struct intel_context *ce,
> > > +			 struct i915_gem_ww_ctx *ww);
> > >  
> > >  static inline void __intel_context_pin(struct intel_context *ce)
> > >  {
> > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > >  
> > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > >  {
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  	__intel_context_do_unpin(ce, 2);
> > >  }
> > >  
> > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > -{
> > > -	if (!ce->ops->sched_disable) {
> > > -		__intel_context_do_unpin(ce, 1);
> > > -	} else {
> > > -		/*
> > > -		 * Move ownership of this pin to the scheduling disable which is
> > > -		 * an async operation. When that operation completes the above
> > > -		 * intel_context_sched_disable_unpin is called potentially
> > > -		 * unpinning the context.
> > > -		 */
> > > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > -				ce->ops->sched_disable(ce);
> > > -				break;
> > > -			}
> > > -		}
> > > -	}
> > > -}
> > > +void intel_context_unpin(struct intel_context *ce);
> > >  
> > >  void intel_context_enter_engine(struct intel_context *ce);
> > >  void intel_context_exit_engine(struct intel_context *ce);
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 66b22b370a72..eb82be15b7a2 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > >  
> > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > >  
> > > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > +	int (*pin)(struct intel_context *ce);
> > >  	void (*unpin)(struct intel_context *ce);
> > >  	void (*post_unpin)(struct intel_context *ce);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index baa1797af1c8..fc74ca28f245 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > >  static int
> > >  __execlists_context_pre_pin(struct intel_context *ce,
> > >  			    struct intel_engine_cs *engine,
> > > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > > +			    struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	int err;
> > >  
> > > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > +	err = lrc_pre_pin(ce, engine, ww);
> > >  	if (err)
> > >  		return err;
> > >  
> > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > -		lrc_init_state(ce, engine, *vaddr);
> > > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > >  
> > >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > >  	}
> > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > >  }
> > >  
> > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > -				     struct i915_gem_ww_ctx *ww,
> > > -				     void **vaddr)
> > > +				     struct i915_gem_ww_ctx *ww)
> > >  {
> > > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> > >  }
> > >  
> > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > +static int execlists_context_pin(struct intel_context *ce)
> > >  {
> > > -	return lrc_pin(ce, ce->engine, vaddr);
> > > +	return lrc_pin(ce, ce->engine);
> > >  }
> > >  
> > >  static int execlists_context_alloc(struct intel_context *ce)
> > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > >  }
> > >  
> > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > -				   struct i915_gem_ww_ctx *ww,
> > > -				   void **vaddr)
> > > +				   struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > >  
> > >  	 /* Note: we must use a real engine class for setting up reg state */
> > > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > >  }
> > >  
> > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > +static int virtual_context_pin(struct intel_context *ce)
> > >  {
> > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > >  
> > > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > > +	return lrc_pin(ce, ve->siblings[0]);
> > >  }
> > >  
> > >  static void virtual_context_enter(struct intel_context *ce)
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > index bb4af4977920..c466fc966005 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > >  int
> > >  lrc_pre_pin(struct intel_context *ce,
> > >  	    struct intel_engine_cs *engine,
> > > -	    struct i915_gem_ww_ctx *ww,
> > > -	    void **vaddr)
> > > +	    struct i915_gem_ww_ctx *ww)
> > >  {
> > > +	void *vaddr;
> > >  	GEM_BUG_ON(!ce->state);
> > >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > >  
> > > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > -					 i915_coherent_map_type(ce->engine->i915,
> > > -								ce->state->obj,
> > > -								false) |
> > > -					 I915_MAP_OVERRIDE);
> > > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > +					i915_coherent_map_type(ce->engine->i915,
> > > +							       ce->state->obj,
> > > +							       false) |
> > > +					I915_MAP_OVERRIDE);
> > >  
> > > -	return PTR_ERR_OR_ZERO(*vaddr);
> > > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > +
> > > +	return PTR_ERR_OR_ZERO(vaddr);
> > >  }
> > >  
> > >  int
> > >  lrc_pin(struct intel_context *ce,
> > > -	struct intel_engine_cs *engine,
> > > -	void *vaddr)
> > > +	struct intel_engine_cs *engine)
> > >  {
> > > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > -
> > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > -		lrc_init_state(ce, engine, vaddr);
> > > +		lrc_init_state(ce, engine,
> > > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > >  
> > >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > >  	return 0;
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > index 7f697845c4cf..837fcf00270d 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > >  int
> > >  lrc_pre_pin(struct intel_context *ce,
> > >  	    struct intel_engine_cs *engine,
> > > -	    struct i915_gem_ww_ctx *ww,
> > > -	    void **vaddr);
> > > +	    struct i915_gem_ww_ctx *ww);
> > >  int
> > >  lrc_pin(struct intel_context *ce,
> > > -	struct intel_engine_cs *engine,
> > > -	void *vaddr);
> > > +	struct intel_engine_cs *engine);
> > >  void lrc_unpin(struct intel_context *ce);
> > >  void lrc_post_unpin(struct intel_context *ce);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > index 2958e2fae380..f4f301bfb9f7 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > >  }
> > >  
> > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > -				struct i915_gem_ww_ctx *ww,
> > > -				void **unused)
> > > +				struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	struct i915_address_space *vm;
> > >  	int err = 0;
> > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > >  	return 0;
> > >  }
> > >  
> > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > +static int ring_context_pin(struct intel_context *ce)
> > >  {
> > >  	return 0;
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > index 2c1af030310c..826b5d7a4573 100644
> > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > >  }
> > >  
> > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > -				struct i915_gem_ww_ctx *ww, void **unused)
> > > +				struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	return 0;
> > >  }
> > >  
> > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > +static int mock_context_pin(struct intel_context *ce)
> > >  {
> > >  	return 0;
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index dec757d319a2..c5c73c42bcf7 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > >  
> > >  	GEM_BUG_ON(!engine->mask);
> > >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > >  	/*
> > >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > >  
> > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > >  				 struct intel_engine_cs *engine,
> > > -				 struct i915_gem_ww_ctx *ww,
> > > -				 void **vaddr)
> > > +				 struct i915_gem_ww_ctx *ww)
> > >  {
> > > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > > +	return lrc_pre_pin(ce, engine, ww);
> > >  }
> > >  
> > >  static int __guc_context_pin(struct intel_context *ce,
> > > -			     struct intel_engine_cs *engine,
> > > -			     void *vaddr)
> > > +			     struct intel_engine_cs *engine)
> > >  {
> > >  	if (i915_ggtt_offset(ce->state) !=
> > >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > >  	 * explaination of why.
> > >  	 */
> > >  
> > > -	return lrc_pin(ce, engine, vaddr);
> > > +	return lrc_pin(ce, engine);
> > > +}
> > > +
> > > +static void __guc_context_unpin(struct intel_context *ce)
> > > +{
> > > +	lrc_unpin(ce);
> > > +}
> > > +
> > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > +{
> > > +	lrc_post_unpin(ce);
> > >  }
> > >  
> > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > -			       struct i915_gem_ww_ctx *ww,
> > > -			       void **vaddr)
> > > +			       struct i915_gem_ww_ctx *ww)
> > >  {
> > > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> > >  }
> > >  
> > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > +static int guc_context_pin(struct intel_context *ce)
> > >  {
> > > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > +	int ret;
> > >  
> > > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > +		   intel_context_is_child(ce));
> > > +
> > > +	ret = __guc_context_pin(ce, ce->engine);
> > >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> > >  		intel_engine_pm_get(ce->engine);
> > >  
> > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > >  	GEM_BUG_ON(context_enabled(ce));
> > >  
> > >  	unpin_guc_id(guc, ce, true);
> > > -	lrc_unpin(ce);
> > > +	__guc_context_unpin(ce);
> > >  
> > >  	if (likely(!intel_context_is_barrier(ce)))
> > >  		intel_engine_pm_put(ce->engine);
> > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > >  
> > >  static void guc_context_post_unpin(struct intel_context *ce)
> > >  {
> > > -	lrc_post_unpin(ce);
> > > +	__guc_context_post_unpin(ce);
> > > +}
> > > +
> > > +/* Future patches will use this function */
> > > +__maybe_unused
> > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > +				      struct i915_gem_ww_ctx *ww)
> > > +{
> > > +	struct intel_context *child;
> > > +	int err, i = 0, j = 0;
> > > +
> > > +	for_each_child(ce, child) {
> > > +		err = i915_active_acquire(&child->active);
> > > +		if (unlikely(err))
> > > +			goto unwind_active;
> > > +		++i;
> > > +	}
> > > +
> > > +	for_each_child(ce, child) {
> > > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > > +		if (unlikely(err))
> > > +			goto unwind_pre_pin;
> > > +		++j;
> > > +	}
> > > +
> > > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > +	if (unlikely(err))
> > > +		goto unwind_pre_pin;
> > > +
> > > +	return 0;
> > > +
> > > +unwind_pre_pin:
> > > +	for_each_child(ce, child) {
> > > +		if (!j--)
> > > +			break;
> > > +		__guc_context_post_unpin(child);
> > > +	}
> > > +
> > > +unwind_active:
> > > +	for_each_child(ce, child) {
> > > +		if (!i--)
> > > +			break;
> > > +		i915_active_release(&child->active);
> > > +	}
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/* Future patches will use this function */
> > > +__maybe_unused
> > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > +{
> > > +	struct intel_context *child;
> > > +
> > > +	for_each_child(ce, child)
> > > +		__guc_context_post_unpin(child);
> > > +	__guc_context_post_unpin(ce);
> > > +
> > > +	for_each_child(ce, child) {
> > > +		intel_context_get(child);
> > > +		i915_active_release(&child->active);
> > > +		intel_context_put(child);
> > > +	}
> > > +}
> > > +
> > > +/* Future patches will use this function */
> > > +__maybe_unused
> > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > +{
> > > +	int ret, i = 0, j = 0;
> > > +	struct intel_context *child;
> > > +	struct intel_engine_cs *engine;
> > > +	intel_engine_mask_t tmp;
> > > +
> > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > +
> > > +	for_each_child(ce, child) {
> > > +		ret = __guc_context_pin(child, child->engine);
> > > +		if (unlikely(ret))
> > > +			goto unwind_pin;
> > > +		++i;
> > > +	}
> > > +	ret = __guc_context_pin(ce, ce->engine);
> > > +	if (unlikely(ret))
> > > +		goto unwind_pin;
> > > +
> > > +	for_each_child(ce, child)
> > > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > +			break;
> > > +		}
> > > +
> > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > +			       ce->engine->mask, tmp)
> > > +		intel_engine_pm_get(engine);
> > > +	for_each_child(ce, child)
> > > +		for_each_engine_masked(engine, child->engine->gt,
> > > +				       child->engine->mask, tmp)
> > > +			intel_engine_pm_get(engine);
> > > +
> > > +	return 0;
> > > +
> > > +unwind_pin:
> > > +	for_each_child(ce, child) {
> > > +		if (++j > i)
> > > +			break;
> > > +		__guc_context_unpin(child);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/* Future patches will use this function */
> > > +__maybe_unused
> > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > +{
> > > +	struct intel_context *child;
> > > +	struct intel_engine_cs *engine;
> > > +	intel_engine_mask_t tmp;
> > > +
> > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > +	GEM_BUG_ON(context_enabled(ce));
> > > +
> > > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > > +	for_each_child(ce, child)
> > > +		__guc_context_unpin(child);
> > > +	__guc_context_unpin(ce);
> > > +
> > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > +			       ce->engine->mask, tmp)
> > > +		intel_engine_pm_put(engine);
> > > +	for_each_child(ce, child)
> > > +		for_each_engine_masked(engine, child->engine->gt,
> > > +				       child->engine->mask, tmp)
> > > +			intel_engine_pm_put(engine);
> > >  }
> > >  
> > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > >  }
> > >  
> > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > -				       struct i915_gem_ww_ctx *ww,
> > > -				       void **vaddr)
> > > +				       struct i915_gem_ww_ctx *ww)
> > >  {
> > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > >  
> > > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > +	return __guc_context_pre_pin(ce, engine, ww);
> > >  }
> > >  
> > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > >  {
> > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > > +	int ret = __guc_context_pin(ce, engine);
> > >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > >  
> > >  	if (likely(!ret))
> > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> > >  
> > >  	unpin_guc_id(guc, ce, true);
> > > -	lrc_unpin(ce);
> > > +	__guc_context_unpin(ce);
> > >  
> > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > >  		intel_engine_pm_put(engine);
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-10  8:53       ` Daniel Vetter
@ 2021-08-10  9:07         ` Daniel Vetter
  2021-08-11 18:06           ` Matthew Brost
  2021-08-11 18:23         ` Matthew Brost
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  9:07 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 10:53:37AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > > Implement GuC parent-child context pin / unpin functions in which in any
> > > > contexts in the relationship are pinned all the contexts are pinned. The
> > > > parent owns most of the pinning / unpinning process and the children
> > > > direct any pins / unpins to the parent.
> > > > 
> > > > Patch implements a number of unused functions that will be connected
> > > > later in the series.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 8cb92b10b547..bb4c14656067 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > > >  	intel_ring_unpin(ring);
> > > >  }
> > > >  
> > > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > > -				 struct i915_gem_ww_ctx *ww)
> > > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > > +				   struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	int err;
> > > >  
> > > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > > >  	return err;
> > > >  }
> > > >  
> > > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > > >  {
> > > >  	if (ce->state)
> > > >  		__context_unpin_state(ce->state);
> > > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > > >  	__ring_retire(ce->ring);
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > -			      struct i915_gem_ww_ctx *ww)
> > > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > > +				 struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	bool handoff = false;
> > > > -	void *vaddr;
> > > > +	struct intel_context *child;
> > > > +	int err, i = 0;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __intel_context_pre_pin(child, ww);
> > > > +		if (unlikely(err))
> > > > +			goto unwind;
> > > > +		++i;
> > > > +	}
> > > > +
> > > > +	err = __intel_context_pre_pin(ce, ww);
> > > > +	if (unlikely(err))
> > > > +		goto unwind;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!i--)
> > > > +			break;
> > > > +		__intel_context_post_unpin(ce);
> > > > +	}
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		__intel_context_post_unpin(child);
> > > > +
> > > > +	__intel_context_post_unpin(ce);
> > > > +}
> > > > +
> > > > +static int __do_ww_lock(struct intel_context *ce,
> > > > +			struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > +
> > > > +	if (!err && ce->ring->vma->obj)
> > > > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > +	if (!err && ce->state)
> > > > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +static int do_ww_lock(struct intel_context *ce,
> > > > +		      struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	struct intel_context *child;
> > > >  	int err = 0;
> > > >  
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __do_ww_lock(child, ww);
> > > > +		if (unlikely(err))
> > > > +			return err;
> > > > +	}
> > > > +
> > > > +	return __do_ww_lock(ce, ww);
> > > > +}
> > > > +
> > > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > +				     struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	bool handoff = false;
> > > > +	int err;
> > > > +
> > > >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > > >  		err = intel_context_alloc_state(ce);
> > > >  		if (err)
> > > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	 * refcount for __intel_context_active(), which prevent a lock
> > > >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> > > >  	 */
> > > > +	err = do_ww_lock(ce, ww);
> > > > +	if (err)
> > > > +		return err;
> > > >  
> > > > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > -	if (!err && ce->ring->vma->obj)
> > > > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > -	if (!err && ce->state)
> > > > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > -	if (!err)
> > > > -		err = intel_context_pre_pin(ce, ww);
> > > > +	err = intel_context_pre_pin(ce, ww);
> > > >  	if (err)
> > > >  		return err;
> > > >  
> > > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	if (err)
> > > >  		goto err_ctx_unpin;
> > > >  
> > > > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > > +	err = ce->ops->pre_pin(ce, ww);
> > > >  	if (err)
> > > >  		goto err_release;
> > > >  
> > > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  		if (unlikely(err))
> > > >  			goto err_unlock;
> > > >  
> > > > -		err = ce->ops->pin(ce, vaddr);
> > > > +		err = ce->ops->pin(ce);
> > > >  		if (err) {
> > > >  			intel_context_active_release(ce);
> > > >  			goto err_unlock;
> > > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	return err;
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin(struct intel_context *ce)
> > > > +static int __intel_context_do_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct i915_gem_ww_ctx ww;
> > > >  	int err;
> > > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > > >  		 intel_context_get_avg_runtime_ns(ce));
> > > >  
> > > >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > > -	intel_context_post_unpin(ce);
> > > > +	__intel_context_post_unpin(ce);
> > > >  	intel_context_put(ce);
> > > >  }
> > > >  
> > > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > >  	child->parent = parent;
> > > >  }
> > > >  
> > > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > +		return 0;
> > > > +
> > > > +	return __intel_context_do_pin(ce);
> > > > +}
> > > > +
> > > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > > +					 struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > +		return 0;
> > > > +
> > > > +	return __intel_context_do_pin_ww(ce, ww);
> > > > +}
> > > > +
> > > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	if (!ce->ops->sched_disable) {
> > > > +		__intel_context_do_unpin(ce, 1);
> > > > +	} else {
> > > > +		/*
> > > > +		 * Move ownership of this pin to the scheduling disable which is
> > > > +		 * an async operation. When that operation completes the above
> > > > +		 * intel_context_sched_disable_unpin is called potentially
> > > > +		 * unpinning the context.
> > > > +		 */
> > > > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {

Just as an example of what I mean here on the code review side. This is an
endless loop, and you need to prove that there's no livelock or starvation
issues. Or explain how else you handle that if there is one.

Because unlike hand-rolled stuff linux kernel spinlocks are not dumb
spinlocks, but ticketed/queued locks and therefor starvation proof. But
this stuff actually matters on todays multi-core and not-so-uniform (even
without fully NUMA) architectures.

Also I've just found another lockless retry loop which does actually
degenerate into a full endless loop (if you're sufficiently unlucky in
your races), so this really isn't academic at all.
-Daniel

> > > 
> > > Uh man lockless algorithms.
> > > 
> > > Unless this comes:
> > > - with essentially an academic looking paper that describes the abstract
> > >   model of the lockless algorithm and proves it against the linux kernel
> > >   meory model.
> > > 
> > > - lockless stuff generally needs barriers, and those barriers must be all
> > >   documented. This means a) a comment next to each barrier in the code b)
> > >   pointing to its counterparty c) with the overall design also explained
> > >   in the kerneldoc for those datastructres.
> > > 
> > >   If you don't know where your barriers are, see above point about "it
> > >   should look more like an academic paper in the commit message"
> > > 
> > > - hard perf data about how this is absolutely required, based on a
> > >   real-world use-case (which then sometimes justifies a microbenchmark
> > >   metric for the details, but it always needs to be real-world based). And
> > >   also a throughrough explainer how the perf issue isn't fixable through
> > >   better design. If that's not doable, just protect the state machine with
> > >   a big dumb lock and move on.
> > > 
> > > - Also, because the current code is in such bad shape wrt lockless
> > >   algorithms and premature optimizations: Overall complexity should go
> > >   down (it's way too high right now), so pay down your new lockless trick
> > >   by removing one of the existing ones that we only have because we can.
> > > 
> > > Yes this is steep, but we're way out in the woods here and need to smoehow
> > > get back.
> > 
> > See below FIXME. At one point all of this was hidden in the backend but
> > the dma-resv patches that landed upstream completely broke the layering,
> > hence the need for the code here.
> > 
> > I guess I don't really understand what mean when you say lockless alg
> > needs barriers, if the atomic functions are not really atomic wouldn't
> > the world be broken?
> 
> They unordered atomics by default. Which means they're atomic itself, but
> entirely unordered with anything else that's going on. Except when you
> have one of the atomic ops which already guarantee a barrier, or you
> manually add the barriers yourself. And yes there's enormous amounts of
> bugs, and with our dgpu potentially running on non-IA cpus those bugs
> matter.
> 
> Note that in C++ atomics the default behaviour is strongly ordered atomics
> with full barriers by default, because those are much easier to program
> against. Kernel isn't like that and defaults to "you need to add all the
> barriers yourself".
> 
> I have a full lenght rant in the works and will work that through all
> channels, but essentially locking is really hard to get right. And
> lockless tricks practically need an academic paper with a formal
> correctness proof against the linux memory model, or you do have bugs.
> 
> And I know that the current code is choke full of this stuff, so it's
> tempting to just add more, but we really cant. The amount of locking
> trickery we have in the codebase must go down substantially. My take is
> that any code that adds anything trick needs to fully justify it against
> the above list, _and_ also clean up some of the existing nonsense so that
> overall complexity doesn't increase.
> 
> I'll share the full length rant with you internally, it's not yet ready
> for publishing (but that's planned too).
> 
> 
> > Also here I don't think it is really as simple as grab big dump lock for
> > a variety of reasons, at least with the current dynamic pin / unpin code
> > in place. If we move a perma-pinned contexts this could be cleaned up
> > then.
> 
> Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
> context can fix this I think we should do this asap. I'd say for parallel
> context we should just do it outright (special case them or whatever) so
> that we don't have to add even more very tricky code and tech debt.
> 
> Doable?
> 
> Cheers, Daniel
> 
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > > +				ce->ops->sched_disable(ce);
> > > > +				break;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > > + * configured for parallel submission, are pinned all the contexts need to be
> > > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > > + * is in place works but is not ideal.
> > > > + */
> > > > +int intel_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > +			return ____intel_context_pin(ce->parent);
> > > > +		else
> > > > +			return 0;
> > > > +	} else {
> > > > +		return ____intel_context_pin(ce);
> > > > +	}
> > > > +}
> > > > +
> > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > +			 struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > +			return __intel_context_pin_ww(ce->parent, ww);
> > > > +		else
> > > > +			return 0;
> > > > +	} else {
> > > > +		return __intel_context_pin_ww(ce, ww);
> > > > +	}
> > > > +}
> > > > +
> > > > +void intel_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > > +			__intel_context_unpin(ce->parent);
> > > > +	} else {
> > > > +		__intel_context_unpin(ce);
> > > > +	}
> > > > +}
> > > > +
> > > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > >  #include "selftest_context.c"
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > index ad6ce5ac4824..c208691fc87d 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > > >  	mutex_unlock(&ce->pin_mutex);
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin(struct intel_context *ce);
> > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > -			      struct i915_gem_ww_ctx *ww);
> > > > -
> > > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > > >  {
> > > >  	return atomic_inc_not_zero(&ce->pin_count);
> > > >  }
> > > >  
> > > > -static inline int intel_context_pin(struct intel_context *ce)
> > > > -{
> > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > -		return 0;
> > > > -
> > > > -	return __intel_context_do_pin(ce);
> > > > -}
> > > > -
> > > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > > -				       struct i915_gem_ww_ctx *ww)
> > > > -{
> > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > -		return 0;
> > > > +int intel_context_pin(struct intel_context *ce);
> > > >  
> > > > -	return __intel_context_do_pin_ww(ce, ww);
> > > > -}
> > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > +			 struct i915_gem_ww_ctx *ww);
> > > >  
> > > >  static inline void __intel_context_pin(struct intel_context *ce)
> > > >  {
> > > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > > >  
> > > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > > >  {
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >  	__intel_context_do_unpin(ce, 2);
> > > >  }
> > > >  
> > > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > > -{
> > > > -	if (!ce->ops->sched_disable) {
> > > > -		__intel_context_do_unpin(ce, 1);
> > > > -	} else {
> > > > -		/*
> > > > -		 * Move ownership of this pin to the scheduling disable which is
> > > > -		 * an async operation. When that operation completes the above
> > > > -		 * intel_context_sched_disable_unpin is called potentially
> > > > -		 * unpinning the context.
> > > > -		 */
> > > > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > > -				ce->ops->sched_disable(ce);
> > > > -				break;
> > > > -			}
> > > > -		}
> > > > -	}
> > > > -}
> > > > +void intel_context_unpin(struct intel_context *ce);
> > > >  
> > > >  void intel_context_enter_engine(struct intel_context *ce);
> > > >  void intel_context_exit_engine(struct intel_context *ce);
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 66b22b370a72..eb82be15b7a2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > > >  
> > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > >  
> > > > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > > > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > +	int (*pin)(struct intel_context *ce);
> > > >  	void (*unpin)(struct intel_context *ce);
> > > >  	void (*post_unpin)(struct intel_context *ce);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > index baa1797af1c8..fc74ca28f245 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > > >  static int
> > > >  __execlists_context_pre_pin(struct intel_context *ce,
> > > >  			    struct intel_engine_cs *engine,
> > > > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > > > +			    struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	int err;
> > > >  
> > > > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > > +	err = lrc_pre_pin(ce, engine, ww);
> > > >  	if (err)
> > > >  		return err;
> > > >  
> > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > > -		lrc_init_state(ce, engine, *vaddr);
> > > > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > > >  
> > > >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > > >  	}
> > > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > > >  }
> > > >  
> > > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > > -				     struct i915_gem_ww_ctx *ww,
> > > > -				     void **vaddr)
> > > > +				     struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> > > >  }
> > > >  
> > > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int execlists_context_pin(struct intel_context *ce)
> > > >  {
> > > > -	return lrc_pin(ce, ce->engine, vaddr);
> > > > +	return lrc_pin(ce, ce->engine);
> > > >  }
> > > >  
> > > >  static int execlists_context_alloc(struct intel_context *ce)
> > > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > > >  }
> > > >  
> > > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > > -				   struct i915_gem_ww_ctx *ww,
> > > > -				   void **vaddr)
> > > > +				   struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > >  
> > > >  	 /* Note: we must use a real engine class for setting up reg state */
> > > > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > > >  }
> > > >  
> > > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int virtual_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > >  
> > > > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > > > +	return lrc_pin(ce, ve->siblings[0]);
> > > >  }
> > > >  
> > > >  static void virtual_context_enter(struct intel_context *ce)
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > index bb4af4977920..c466fc966005 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > > >  int
> > > >  lrc_pre_pin(struct intel_context *ce,
> > > >  	    struct intel_engine_cs *engine,
> > > > -	    struct i915_gem_ww_ctx *ww,
> > > > -	    void **vaddr)
> > > > +	    struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > +	void *vaddr;
> > > >  	GEM_BUG_ON(!ce->state);
> > > >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > > >  
> > > > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > -					 i915_coherent_map_type(ce->engine->i915,
> > > > -								ce->state->obj,
> > > > -								false) |
> > > > -					 I915_MAP_OVERRIDE);
> > > > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > +					i915_coherent_map_type(ce->engine->i915,
> > > > +							       ce->state->obj,
> > > > +							       false) |
> > > > +					I915_MAP_OVERRIDE);
> > > >  
> > > > -	return PTR_ERR_OR_ZERO(*vaddr);
> > > > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > +
> > > > +	return PTR_ERR_OR_ZERO(vaddr);
> > > >  }
> > > >  
> > > >  int
> > > >  lrc_pin(struct intel_context *ce,
> > > > -	struct intel_engine_cs *engine,
> > > > -	void *vaddr)
> > > > +	struct intel_engine_cs *engine)
> > > >  {
> > > > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > -
> > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > > -		lrc_init_state(ce, engine, vaddr);
> > > > +		lrc_init_state(ce, engine,
> > > > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > > >  
> > > >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > > >  	return 0;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > index 7f697845c4cf..837fcf00270d 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > > >  int
> > > >  lrc_pre_pin(struct intel_context *ce,
> > > >  	    struct intel_engine_cs *engine,
> > > > -	    struct i915_gem_ww_ctx *ww,
> > > > -	    void **vaddr);
> > > > +	    struct i915_gem_ww_ctx *ww);
> > > >  int
> > > >  lrc_pin(struct intel_context *ce,
> > > > -	struct intel_engine_cs *engine,
> > > > -	void *vaddr);
> > > > +	struct intel_engine_cs *engine);
> > > >  void lrc_unpin(struct intel_context *ce);
> > > >  void lrc_post_unpin(struct intel_context *ce);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > index 2958e2fae380..f4f301bfb9f7 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > > >  }
> > > >  
> > > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > > -				struct i915_gem_ww_ctx *ww,
> > > > -				void **unused)
> > > > +				struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct i915_address_space *vm;
> > > >  	int err = 0;
> > > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > > +static int ring_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > index 2c1af030310c..826b5d7a4573 100644
> > > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > > >  }
> > > >  
> > > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > > -				struct i915_gem_ww_ctx *ww, void **unused)
> > > > +				struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > > +static int mock_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index dec757d319a2..c5c73c42bcf7 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >  
> > > >  	GEM_BUG_ON(!engine->mask);
> > > >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >  
> > > >  	/*
> > > >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >  
> > > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > > >  				 struct intel_engine_cs *engine,
> > > > -				 struct i915_gem_ww_ctx *ww,
> > > > -				 void **vaddr)
> > > > +				 struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > > > +	return lrc_pre_pin(ce, engine, ww);
> > > >  }
> > > >  
> > > >  static int __guc_context_pin(struct intel_context *ce,
> > > > -			     struct intel_engine_cs *engine,
> > > > -			     void *vaddr)
> > > > +			     struct intel_engine_cs *engine)
> > > >  {
> > > >  	if (i915_ggtt_offset(ce->state) !=
> > > >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > > >  	 * explaination of why.
> > > >  	 */
> > > >  
> > > > -	return lrc_pin(ce, engine, vaddr);
> > > > +	return lrc_pin(ce, engine);
> > > > +}
> > > > +
> > > > +static void __guc_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	lrc_unpin(ce);
> > > > +}
> > > > +
> > > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	lrc_post_unpin(ce);
> > > >  }
> > > >  
> > > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > > -			       struct i915_gem_ww_ctx *ww,
> > > > -			       void **vaddr)
> > > > +			       struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> > > >  }
> > > >  
> > > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int guc_context_pin(struct intel_context *ce)
> > > >  {
> > > > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret;
> > > >  
> > > > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > > +		   intel_context_is_child(ce));
> > > > +
> > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > >  		intel_engine_pm_get(ce->engine);
> > > >  
> > > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  	GEM_BUG_ON(context_enabled(ce));
> > > >  
> > > >  	unpin_guc_id(guc, ce, true);
> > > > -	lrc_unpin(ce);
> > > > +	__guc_context_unpin(ce);
> > > >  
> > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > >  		intel_engine_pm_put(ce->engine);
> > > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  
> > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > >  {
> > > > -	lrc_post_unpin(ce);
> > > > +	__guc_context_post_unpin(ce);
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > > +				      struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +	int err, i = 0, j = 0;
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = i915_active_acquire(&child->active);
> > > > +		if (unlikely(err))
> > > > +			goto unwind_active;
> > > > +		++i;
> > > > +	}
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > > > +		if (unlikely(err))
> > > > +			goto unwind_pre_pin;
> > > > +		++j;
> > > > +	}
> > > > +
> > > > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > > +	if (unlikely(err))
> > > > +		goto unwind_pre_pin;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind_pre_pin:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!j--)
> > > > +			break;
> > > > +		__guc_context_post_unpin(child);
> > > > +	}
> > > > +
> > > > +unwind_active:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!i--)
> > > > +			break;
> > > > +		i915_active_release(&child->active);
> > > > +	}
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		__guc_context_post_unpin(child);
> > > > +	__guc_context_post_unpin(ce);
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		intel_context_get(child);
> > > > +		i915_active_release(&child->active);
> > > > +		intel_context_put(child);
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	int ret, i = 0, j = 0;
> > > > +	struct intel_context *child;
> > > > +	struct intel_engine_cs *engine;
> > > > +	intel_engine_mask_t tmp;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		ret = __guc_context_pin(child, child->engine);
> > > > +		if (unlikely(ret))
> > > > +			goto unwind_pin;
> > > > +		++i;
> > > > +	}
> > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > +	if (unlikely(ret))
> > > > +		goto unwind_pin;
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > > +			break;
> > > > +		}
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > +			       ce->engine->mask, tmp)
> > > > +		intel_engine_pm_get(engine);
> > > > +	for_each_child(ce, child)
> > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > +				       child->engine->mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind_pin:
> > > > +	for_each_child(ce, child) {
> > > > +		if (++j > i)
> > > > +			break;
> > > > +		__guc_context_unpin(child);
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +	struct intel_engine_cs *engine;
> > > > +	intel_engine_mask_t tmp;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +
> > > > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > > > +	for_each_child(ce, child)
> > > > +		__guc_context_unpin(child);
> > > > +	__guc_context_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > +			       ce->engine->mask, tmp)
> > > > +		intel_engine_pm_put(engine);
> > > > +	for_each_child(ce, child)
> > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > +				       child->engine->mask, tmp)
> > > > +			intel_engine_pm_put(engine);
> > > >  }
> > > >  
> > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > > >  }
> > > >  
> > > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > > -				       struct i915_gem_ww_ctx *ww,
> > > > -				       void **vaddr)
> > > > +				       struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > >  
> > > > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > > +	return __guc_context_pre_pin(ce, engine, ww);
> > > >  }
> > > >  
> > > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, engine);
> > > >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > >  
> > > >  	if (likely(!ret))
> > > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > >  
> > > >  	unpin_guc_id(guc, ce, true);
> > > > -	lrc_unpin(ce);
> > > > +	__guc_context_unpin(ce);
> > > >  
> > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > >  		intel_engine_pm_put(engine);
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
  2021-08-09 19:03     ` Matthew Brost
@ 2021-08-10  9:12       ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  9:12 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:03:12PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 05:31:38PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:16PM -0700, Matthew Brost wrote:
> > > Assign contexts in parent-child relationship consecutive guc_ids. This
> > > is accomplished by partitioning guc_id space between ones that need to
> > > be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
> > > available guc_ids). The consecutive search is implemented via the bitmap
> > > API.
> > > 
> > > This is a precursor to the full GuC multi-lrc implementation but aligns
> > > to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
> > > when using the GuC multi-lrc interface.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_context.h       |   6 +
> > >  drivers/gpu/drm/i915/gt/intel_reset.c         |   3 +-
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 222 ++++++++++++------
> > >  .../i915/gt/uc/intel_guc_submission_types.h   |  10 +
> > >  5 files changed, 179 insertions(+), 69 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index c208691fc87d..7ce3b3d2edb7 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -54,6 +54,12 @@ static inline bool intel_context_is_parent(struct intel_context *ce)
> > >  	return !!ce->guc_number_children;
> > >  }
> > >  
> > > +static inline struct intel_context *
> > > +intel_context_to_parent(struct intel_context *ce)
> > > +{
> > > +	return intel_context_is_child(ce) ? ce->parent : ce;
> > > +}
> > > +
> > >  void intel_context_bind_parent_child(struct intel_context *parent,
> > >  				     struct intel_context *child);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > index ea763138197f..c3d4baa1b2b8 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > @@ -849,6 +849,7 @@ static void reset_finish(struct intel_gt *gt, intel_engine_mask_t awake)
> > >  
> > >  static void nop_submit_request(struct i915_request *request)
> > >  {
> > > +	struct intel_context *ce = intel_context_to_parent(request->context);
> > >  	RQ_TRACE(request, "-EIO\n");
> > >  
> > >  	/*
> > > @@ -857,7 +858,7 @@ static void nop_submit_request(struct i915_request *request)
> > >  	 * this for now.
> > >  	 */
> > >  	if (intel_engine_uses_guc(request->engine))
> > > -		intel_guc_decr_num_rq_not_ready(request->context);
> > > +		intel_guc_decr_num_rq_not_ready(ce);
> > >  
> > >  	request = i915_request_mark_eio(request);
> > >  	if (request) {
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index c0c60ccabfa4..30a0f364db8f 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -24,6 +24,7 @@ struct __guc_ads_blob;
> > >  
> > >  enum {
> > >  	GUC_SUBMIT_ENGINE_SINGLE_LRC,
> > > +	GUC_SUBMIT_ENGINE_MULTI_LRC,
> > >  	GUC_SUBMIT_ENGINE_MAX
> > >  };
> > >  
> > > @@ -59,8 +60,10 @@ struct intel_guc {
> > >  	struct ida guc_ids;
> > >  	u32 num_guc_ids;
> > >  	u32 max_guc_ids;
> > > -	struct list_head guc_id_list_no_ref;
> > > -	struct list_head guc_id_list_unpinned;
> > > +	unsigned long *guc_ids_bitmap;
> > > +#define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> > > +	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > > +	struct list_head guc_id_list_unpinned[MAX_GUC_ID_ORDER + 1];
> > 
> > Random new global lists definitely need kerneldoc about what is on them,
> > how they're linked, what their lifetime rules are and what locks we're
> > holding.
> > 
> > Leaving this all to reviews to figure out, and worse, future readers of
> > your code, is not kind.
> >
> 
> Got it.

I forgot the usual disclaimer: I know that the current code isn't
following this at all. But wee have to start somewhere :-/

> > >  	spinlock_t destroy_lock;	/* protects list / worker */
> > >  	struct list_head destroyed_contexts;
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index f23dd716723f..afb9b4bb8971 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -169,6 +169,15 @@ static void clr_guc_ids_exhausted(struct guc_submit_engine *gse)
> > >  	clear_bit(GSE_STATE_GUC_IDS_EXHAUSTED, &gse->flags);
> > >  }
> > >  
> > > +/*
> > > + * We reserve 1/16 of the guc_ids for multi-lrc as these need to be contiguous
> > 
> > I think it'd be good to put down the reason here for why. Is this a
> > requirement of the guc interface, or just an artifact of our current
> > implementation? In the latter case also explain what exactly the
> > contstraint is (but honestly I can't think of much reasons for that)
> 
> Multi-lrc guc_ids need to be sequential between the parent and children
> - this is a requirement of the GuC submission interface. Can explicitly
> state that here.

Ah yes that makes sense to document. That also gives us a very clear hint
what probably the first step to fix any multi-lrc exhaustion issues are:
We need to scan the entire guc_id space for conseutively free spots are.
Not sure xarray has support for that, but I know the guy who wrote it so
the answer is at most a mail away :-)

This also means I'm a lot less worried about potentially walling ourselves
into a corner with multi-lrc guc_id exhaustion. Might be good to note that
in the commit message too.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > + * and a different allocation algorithm is used (bitmap vs. ida). We believe the
> > > + * number of multi-lrc contexts in use should be low and 1/16 should be
> > > + * sufficient. Minimum of 32 ids for multi-lrc.
> > > + */
> > > +#define NUMBER_MULTI_LRC_GUC_ID(guc) \
> > > +	((guc)->num_guc_ids / 16 > 32 ? (guc)->num_guc_ids / 16 : 32)
> > > +
> > >  /*
> > >   * Below is a set of functions which control the GuC scheduling state which do
> > >   * not require a lock as all state transitions are mutually exclusive. i.e. It
> > > @@ -405,16 +414,10 @@ static inline void decr_context_blocked(struct intel_context *ce)
> > >  	ce->guc_state.sched_state -= SCHED_STATE_BLOCKED;
> > >  }
> > >  
> > > -static inline struct intel_context *
> > > -to_parent(struct intel_context *ce)
> > > -{
> > > -	return intel_context_is_child(ce) ? ce->parent : ce;
> > > -}
> > > -
> > >  static inline struct intel_context *
> > >  request_to_scheduling_context(struct i915_request *rq)
> > >  {
> > > -	return to_parent(rq->context);
> > > +	return intel_context_to_parent(rq->context);
> > >  }
> > >  
> > >  static inline bool context_guc_id_invalid(struct intel_context *ce)
> > > @@ -1436,7 +1439,7 @@ static void destroy_worker_func(struct work_struct *w);
> > >   */
> > >  int intel_guc_submission_init(struct intel_guc *guc)
> > >  {
> > > -	int ret;
> > > +	int ret, i;
> > >  
> > >  	if (guc_submission_initialized(guc))
> > >  		return 0;
> > > @@ -1448,9 +1451,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > >  	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> > >  
> > >  	spin_lock_init(&guc->contexts_lock);
> > > -	INIT_LIST_HEAD(&guc->guc_id_list_no_ref);
> > > -	INIT_LIST_HEAD(&guc->guc_id_list_unpinned);
> > > +	for (i = 0; i < MAX_GUC_ID_ORDER + 1; ++i) {
> > > +		INIT_LIST_HEAD(&guc->guc_id_list_no_ref[i]);
> > > +		INIT_LIST_HEAD(&guc->guc_id_list_unpinned[i]);
> > > +	}
> > >  	ida_init(&guc->guc_ids);
> > > +	guc->guc_ids_bitmap =
> > > +		bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
> > >  
> > >  	spin_lock_init(&guc->destroy_lock);
> > >  
> > > @@ -1476,6 +1483,8 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> > >  
> > >  		i915_sched_engine_put(sched_engine);
> > >  	}
> > > +
> > > +	bitmap_free(guc->guc_ids_bitmap);
> > >  }
> > >  
> > >  static inline void queue_request(struct i915_sched_engine *sched_engine,
> > > @@ -1499,11 +1508,13 @@ static inline void queue_request(struct i915_sched_engine *sched_engine,
> > >  static bool too_many_guc_ids_not_ready(struct guc_submit_engine *gse,
> > >  				       struct intel_context *ce)
> > >  {
> > > -	u32 available_guc_ids, guc_ids_consumed;
> > >  	struct intel_guc *guc = gse->sched_engine.private_data;
> > > +	u32 available_guc_ids = intel_context_is_parent(ce) ?
> > > +		NUMBER_MULTI_LRC_GUC_ID(guc) :
> > > +		guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> > > +	u32 guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
> > >  
> > > -	available_guc_ids = guc->num_guc_ids;
> > > -	guc_ids_consumed = atomic_read(&gse->num_guc_ids_not_ready);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > >  	if (TOO_MANY_GUC_IDS_NOT_READY(available_guc_ids, guc_ids_consumed)) {
> > >  		set_and_update_guc_ids_exhausted(gse);
> > > @@ -1517,17 +1528,26 @@ static void incr_num_rq_not_ready(struct intel_context *ce)
> > >  {
> > >  	struct guc_submit_engine *gse = ce_to_gse(ce);
> > >  
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +	GEM_BUG_ON(!intel_context_is_parent(ce) &&
> > > +		   ce->guc_number_children);
> > > +
> > >  	if (!atomic_fetch_add(1, &ce->guc_num_rq_not_ready))
> > > -		atomic_inc(&gse->num_guc_ids_not_ready);
> > > +		atomic_add(ce->guc_number_children + 1,
> > > +			   &gse->num_guc_ids_not_ready);
> > >  }
> > >  
> > >  void intel_guc_decr_num_rq_not_ready(struct intel_context *ce)
> > >  {
> > >  	struct guc_submit_engine *gse = ce_to_gse(ce);
> > >  
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > >  	if (atomic_fetch_add(-1, &ce->guc_num_rq_not_ready) == 1) {
> > >  		GEM_BUG_ON(!atomic_read(&gse->num_guc_ids_not_ready));
> > > -		atomic_dec(&gse->num_guc_ids_not_ready);
> > > +
> > > +		atomic_sub(ce->guc_number_children + 1,
> > > +			   &gse->num_guc_ids_not_ready);
> > >  	}
> > >  }
> > >  
> > > @@ -1579,20 +1599,42 @@ static void guc_submit_request(struct i915_request *rq)
> > >  
> > >  	spin_unlock_irqrestore(&sched_engine->lock, flags);
> > >  
> > > -	intel_guc_decr_num_rq_not_ready(rq->context);
> > > +	intel_guc_decr_num_rq_not_ready(request_to_scheduling_context(rq));
> > >  }
> > >  
> > > -static int new_guc_id(struct intel_guc *guc)
> > > +static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >  {
> > > -	return ida_simple_get(&guc->guc_ids, 0,
> > > -			      guc->num_guc_ids, GFP_KERNEL |
> > > -			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > +	int ret;
> > > +
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	if (intel_context_is_parent(ce))
> > > +		ret = bitmap_find_free_region(guc->guc_ids_bitmap,
> > > +					      NUMBER_MULTI_LRC_GUC_ID(guc),
> > > +					      order_base_2(ce->guc_number_children
> > > +							   + 1));
> > > +	else
> > > +		ret = ida_simple_get(&guc->guc_ids,
> > > +				     NUMBER_MULTI_LRC_GUC_ID(guc),
> > > +				     guc->num_guc_ids, GFP_KERNEL |
> > > +				     __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > +	if (unlikely(ret < 0))
> > > +		return ret;
> > > +
> > > +	ce->guc_id = ret;
> > > +	return 0;
> > >  }
> > >  
> > >  static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >  {
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  	if (!context_guc_id_invalid(ce)) {
> > > -		ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > > +		if (intel_context_is_parent(ce))
> > > +			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> > > +					      order_base_2(ce->guc_number_children
> > > +							   + 1));
> > > +		else
> > > +			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > >  		clr_lrc_desc_registered(guc, ce->guc_id);
> > >  		set_context_guc_id_invalid(ce);
> > >  	}
> > > @@ -1604,6 +1646,8 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >  {
> > >  	unsigned long flags;
> > >  
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > >  	spin_lock_irqsave(&guc->contexts_lock, flags);
> > >  	__release_guc_id(guc, ce);
> > >  	spin_unlock_irqrestore(&guc->contexts_lock, flags);
> > > @@ -1618,54 +1662,93 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >   * schedule disable H2G + a deregister H2G.
> > >   */
> > >  static struct list_head *get_guc_id_list(struct intel_guc *guc,
> > > +					 u8 number_children,
> > >  					 bool unpinned)
> > >  {
> > > +	GEM_BUG_ON(order_base_2(number_children + 1) > MAX_GUC_ID_ORDER);
> > > +
> > >  	if (unpinned)
> > > -		return &guc->guc_id_list_unpinned;
> > > +		return &guc->guc_id_list_unpinned[order_base_2(number_children + 1)];
> > >  	else
> > > -		return &guc->guc_id_list_no_ref;
> > > +		return &guc->guc_id_list_no_ref[order_base_2(number_children + 1)];
> > >  }
> > >  
> > > -static int steal_guc_id(struct intel_guc *guc, bool unpinned)
> > > +static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > > +			bool unpinned)
> > >  {
> > > -	struct intel_context *ce;
> > > -	int guc_id;
> > > -	struct list_head *guc_id_list = get_guc_id_list(guc, unpinned);
> > > +	struct intel_context *cn;
> > > +	u8 number_children = ce->guc_number_children;
> > >  
> > >  	lockdep_assert_held(&guc->contexts_lock);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > > -	if (!list_empty(guc_id_list)) {
> > > -		ce = list_first_entry(guc_id_list,
> > > -				      struct intel_context,
> > > -				      guc_id_link);
> > > +	do {
> > > +		struct list_head *guc_id_list =
> > > +			get_guc_id_list(guc, number_children, unpinned);
> > >  
> > > -		/* Ensure context getting stolen in expected state */
> > > -		GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> > > -		GEM_BUG_ON(context_guc_id_invalid(ce));
> > > -		GEM_BUG_ON(context_guc_id_stolen(ce));
> > > +		if (!list_empty(guc_id_list)) {
> > > +			u8 cn_o2, ce_o2 =
> > > +				order_base_2(ce->guc_number_children + 1);
> > >  
> > > -		list_del_init(&ce->guc_id_link);
> > > -		guc_id = ce->guc_id;
> > > -		clr_context_registered(ce);
> > > +			cn = list_first_entry(guc_id_list,
> > > +					      struct intel_context,
> > > +					      guc_id_link);
> > > +			cn_o2 = order_base_2(cn->guc_number_children + 1);
> > > +
> > > +			/*
> > > +			 * Corner case where a multi-lrc context steals a guc_id
> > > +			 * from another context that has more guc_id that itself.
> > > +			 */
> > > +			if (cn_o2 != ce_o2) {
> > > +				bitmap_release_region(guc->guc_ids_bitmap,
> > > +						      cn->guc_id,
> > > +						      cn_o2);
> > > +				bitmap_allocate_region(guc->guc_ids_bitmap,
> > > +						       ce->guc_id,
> > > +						       ce_o2);
> > > +			}
> > > +
> > > +			/* Ensure context getting stolen in expected state */
> > > +			GEM_BUG_ON(atomic_read(&cn->guc_id_ref));
> > > +			GEM_BUG_ON(context_guc_id_invalid(cn));
> > > +			GEM_BUG_ON(context_guc_id_stolen(cn));
> > > +			GEM_BUG_ON(ce_to_gse(ce) != ce_to_gse(cn));
> > > +
> > > +			list_del_init(&cn->guc_id_link);
> > > +			ce->guc_id = cn->guc_id;
> > > +
> > > +			/*
> > > +			 * If stealing from the pinned list, defer invalidating
> > > +			 * the guc_id until the retire workqueue processes this
> > > +			 * context.
> > > +			 */
> > > +			clr_context_registered(cn);
> > > +			if (!unpinned) {
> > > +				GEM_BUG_ON(ce_to_gse(cn)->stalled_context);
> > > +				ce_to_gse(cn)->stalled_context =
> > > +					intel_context_get(cn);
> > > +				set_context_guc_id_stolen(cn);
> > > +			} else {
> > > +				set_context_guc_id_invalid(cn);
> > > +			}
> > > +
> > > +			return 0;
> > > +		}
> > >  
> > >  		/*
> > > -		 * If stealing from the pinned list, defer invalidating
> > > -		 * the guc_id until the retire workqueue processes this
> > > -		 * context.
> > > +		 * When using multi-lrc we search the guc_id_lists with the
> > > +		 * least amount of guc_ids required first but will consume a
> > > +		 * block larger of guc_ids if necessary. 2x the children always
> > > +		 * moves you two the next list.
> > >  		 */
> > > -		if (!unpinned) {
> > > -			GEM_BUG_ON(ce_to_gse(ce)->stalled_context);
> > > +		if (!number_children ||
> > > +		    order_base_2(number_children + 1) == MAX_GUC_ID_ORDER)
> > > +			break;
> > >  
> > > -			ce_to_gse(ce)->stalled_context = intel_context_get(ce);
> > > -			set_context_guc_id_stolen(ce);
> > > -		} else {
> > > -			set_context_guc_id_invalid(ce);
> > > -		}
> > > +		number_children *= 2;
> > > +	} while (true);
> > >  
> > > -		return guc_id;
> > > -	} else {
> > > -		return -EAGAIN;
> > > -	}
> > > +	return -EAGAIN;
> > >  }
> > >  
> > >  enum {	/* Return values for pin_guc_id / assign_guc_id */
> > > @@ -1674,17 +1757,18 @@ enum {	/* Return values for pin_guc_id / assign_guc_id */
> > >  	NEW_GUC_ID_ENABLED	= 2,
> > >  };
> > >  
> > > -static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
> > > +static int assign_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > > +			 bool tasklet)
> > >  {
> > >  	int ret;
> > >  
> > >  	lockdep_assert_held(&guc->contexts_lock);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > > -	ret = new_guc_id(guc);
> > > +	ret = new_guc_id(guc, ce);
> > >  	if (unlikely(ret < 0)) {
> > > -		ret = steal_guc_id(guc, true);
> > > -		if (ret >= 0) {
> > > -			*out = ret;
> > > +		ret = steal_guc_id(guc, ce, true);
> > > +		if (!ret) {
> > >  			ret = NEW_GUC_ID_DISABLED;
> > >  		} else if (ret < 0 && tasklet) {
> > >  			/*
> > > @@ -1692,15 +1776,18 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out, bool tasklet)
> > >  			 * enabled if guc_ids are exhausted and we are submitting
> > >  			 * from the tasklet.
> > >  			 */
> > > -			ret = steal_guc_id(guc, false);
> > > -			if (ret >= 0) {
> > > -				*out = ret;
> > > +			ret = steal_guc_id(guc, ce, false);
> > > +			if (!ret)
> > >  				ret = NEW_GUC_ID_ENABLED;
> > > -			}
> > >  		}
> > > -	} else {
> > > -		*out = ret;
> > > -		ret = SAME_GUC_ID;
> > > +	}
> > > +
> > > +	if (!(ret < 0) && intel_context_is_parent(ce)) {
> > > +		struct intel_context *child;
> > > +		int i = 1;
> > > +
> > > +		for_each_child(ce, child)
> > > +			child->guc_id = ce->guc_id + i++;
> > >  	}
> > >  
> > >  	return ret;
> > > @@ -1713,6 +1800,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > >  	int ret = 0;
> > >  	unsigned long flags, tries = PIN_GUC_ID_TRIES;
> > >  
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref));
> > >  
> > >  try_again:
> > > @@ -1724,7 +1812,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > >  	}
> > >  
> > >  	if (context_guc_id_invalid(ce)) {
> > > -		ret = assign_guc_id(guc, &ce->guc_id, tasklet);
> > > +		ret = assign_guc_id(guc, ce, tasklet);
> > >  		if (unlikely(ret < 0))
> > >  			goto out_unlock;
> > >  	}
> > > @@ -1770,6 +1858,7 @@ static void unpin_guc_id(struct intel_guc *guc,
> > >  	unsigned long flags;
> > >  
> > >  	GEM_BUG_ON(atomic_read(&ce->guc_id_ref) < 0);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > >  	if (unlikely(context_guc_id_invalid(ce)))
> > >  		return;
> > > @@ -1781,7 +1870,8 @@ static void unpin_guc_id(struct intel_guc *guc,
> > >  
> > >  	if (!context_guc_id_invalid(ce) && !context_guc_id_stolen(ce) &&
> > >  	    !atomic_read(&ce->guc_id_ref)) {
> > > -		struct list_head *head = get_guc_id_list(guc, unpinned);
> > > +		struct list_head *head =
> > > +			get_guc_id_list(guc, ce->guc_number_children, unpinned);
> > >  
> > >  		list_add_tail(&ce->guc_id_link, head);
> > >  	}
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > index 7069b7248f55..a5933e07bdd2 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > @@ -22,6 +22,16 @@ struct guc_virtual_engine {
> > >  /*
> > >   * Object which encapsulates the globally operated on i915_sched_engine +
> > >   * the GuC submission state machine described in intel_guc_submission.c.
> > > + *
> > > + * Currently we have two instances of these per GuC. One for single-lrc and one
> > > + * for multi-lrc submission. We split these into two submission engines as they
> > > + * can operate in parallel allowing a blocking condition on one not to affect
> > > + * the other. i.e. guc_ids are statically allocated between these two submission
> > > + * modes. One mode may have guc_ids exhausted which requires blocking while the
> > > + * other has plenty of guc_ids and can make forward progres.
> > > + *
> > > + * In the future if different submission use cases arise we can simply
> > > + * instantiate another of these objects and assign it to the context.
> > >   */
> > >  struct guc_submit_engine {
> > >  	struct i915_sched_engine sched_engine;
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine
  2021-08-09 19:05     ` Matthew Brost
@ 2021-08-10  9:18       ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  9:18 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:05:58PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > > the hang check. As such if a different GSE's state machine hangs, the
> > > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > > can disable all submissions if it is hung.
> > > 
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
> > >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> > >  2 files changed, 39 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index afb9b4bb8971..2d8296bcc583 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
> > >  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > > +#define MAX_TASKLET_BLOCKED_NS	2000000000
> > >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >  	lockdep_assert_held(&gse->sched_engine.lock);
> > > +	hrtimer_start_range_ns(&gse->hang_timer,
> > > +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > > +			       HRTIMER_MODE_REL_PINNED);
> > >  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > 
> > So with drm/scheduler the reset handling is assumed to be
> > single-threaded, and there's quite complex rules around that. I've
> > recently worked with Boris Brezillion to clarify all this a bit and
> > improve docs. Does this all still work in that glorious future? Might be
> > good to at least sprinkle some comments/thoughts around in the commit
> > message about the envisaged future direction for all this stuff, to keep
> > people in the loop. Especially future people.
> > 
> > Ofc plan is still to just largely land all this.
> > 
> > Also: set_bit is an unordered atomic, which means you need barriers, which
> > meanes ... *insert the full rant about justifying/documenting lockless
> > algorithms from earlier *
> >
> 
> lockdep_assert_held(&gse->sched_engine.lock);
> 
> Not lockless. Also spin locks act as barriers, right?

Well if that spinlock is protecting that bit then that's good, but then it
shouldn't be an atomic set_bit. In that case:
- either make the entire bitfield non-atomic so it's clear there's boring
  dumb locking going on
- or split out your new bit into a separate field so that there's no false
  sharing with the existing bitfield state machinery, and add a kernel doc
  to that field explaining the locking

set_bit itself is atomic and unordered, so means you need barriers and all
that. If you don't have a lockless algorithm, don't use atomic bitops to
avoid confusing readers because set_bit/test_bit sets of all the warning
bells.

And yes it's annoying that for bitops the atomic ones don't have an
atomic_ prefix. The non-atomic ones have a __ prefix. This is honestly why
I don't think we should use bitfields as much as we do, because the main
use-case for them is when you have bitfields which are longer than 64bits.
They come from the cpumask world, and linux supports a lot of cpus.

Open-coding non-atomic simple bitfields with the usual C operators is
perfectly fine and legible imo. But that part is maybe more a bikeshed.

> > But I think this all falls out with the removal of the guc-id allocation
> > scheme?
> 
> Yes, this patch is getting deleted.

That works too :-)
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > >  }
> > >  
> > >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >  	lockdep_assert_held(&gse->sched_engine.lock);
> > > +	hrtimer_cancel(&gse->hang_timer);
> > >  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
> > >  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > >  			GEM_BUG_ON(!guc->ct.enabled);
> > >  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> > > +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> > >  			sched_engine->tasklet.callback = NULL;
> > >  		}
> > >  	}
> > > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
> > >  	kfree(gse);
> > >  }
> > >  
> > > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > > +{
> > > +	struct guc_submit_engine *gse =
> > > +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > > +	struct intel_guc *guc = gse->sched_engine.private_data;
> > > +
> > > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > +	if (guc->gse_hang_expected)
> > > +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> > > +			"GSE[%i] hung, disabling submission", gse->id);
> > > +	else
> > > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > > +			"GSE[%i] hung, disabling submission", gse->id);
> > > +#else
> > > +	drm_err(&guc_to_gt(guc)->i915->drm,
> > > +		"GSE[%i] hung, disabling submission", gse->id);
> > > +#endif
> > > +
> > > +	/*
> > > +	 * Tasklet not making forward progress, disable submission which in turn
> > > +	 * will kick in the heartbeat to do a full GPU reset.
> > > +	 */
> > > +	disable_submission(guc);
> > > +
> > > +	return HRTIMER_NORESTART;
> > > +}
> > > +
> > >  static void guc_submit_engine_init(struct intel_guc *guc,
> > >  				   struct guc_submit_engine *gse,
> > >  				   int id)
> > > @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
> > >  	sched_engine->retire_inflight_request_prio =
> > >  		guc_retire_inflight_request_prio;
> > >  	sched_engine->private_data = guc;
> > > +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > > +	gse->hang_timer.function = gse_hang;
> > >  	gse->id = id;
> > >  }
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > index a5933e07bdd2..eae2e9725ede 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > @@ -6,6 +6,8 @@
> > >  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
> > >  #define _INTEL_GUC_SUBMISSION_TYPES_H_
> > >  
> > > +#include <linux/xarray.h>
> > > +
> > >  #include "gt/intel_engine_types.h"
> > >  #include "gt/intel_context_types.h"
> > >  #include "i915_scheduler_types.h"
> > > @@ -41,6 +43,7 @@ struct guc_submit_engine {
> > >  	unsigned long flags;
> > >  	int total_num_rq_with_no_guc_id;
> > >  	atomic_t num_guc_ids_not_ready;
> > > +	struct hrtimer hang_timer;
> > >  	int id;
> > >  
> > >  	/*
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-09 19:13     ` Matthew Brost
@ 2021-08-10  9:23       ` Daniel Vetter
  2021-08-10  9:27         ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  9:23 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:13:11PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > > Display the workqueue status in debugfs for GuC contexts that are in
> > > parent-child relationship.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> > >  1 file changed, 39 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 30df1c8db491..44a7582c9aed 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > >  		gse_log_submission_info(guc->gse[i], p, i);
> > >  }
> > >  
> > > +static inline void guc_log_context(struct drm_printer *p,
> > > +				   struct intel_context *ce)
> > > +{
> > > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > +		   ce->ring->head,
> > > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > +		   ce->ring->tail,
> > > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > +		   atomic_read(&ce->pin_count));
> > > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > +		   atomic_read(&ce->guc_id_ref));
> > > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > +		   ce->guc_state.sched_state,
> > > +		   atomic_read(&ce->guc_sched_state_no_lock));
> > 
> > It's all debugfs, but I think proper locking even there is good. It at
> > least reduces the confusion when the locking scheme is largely
> > undocumented. Also given how much we have rcu for everything would be good
> > to double-check all pointer dererences are properly protected.
> >
> 
> Not sure if I 100% follow this but I don't think any of the pointers
> dref here are RCU protected. Certainly none of the GuC ones are.
> 
> Will double before the next respin though.
> 
> > > +}
> > > +
> > >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > >  					     struct drm_printer *p)
> > >  {
> > >  	struct intel_context *ce;
> > >  	unsigned long index;
> > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > 
> > xa_for_each doesn't provide any guarantees, so doesn't protect against
> > concurrent removeal or anything like that. We need to do better than that.
> 
> https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
> 'It is safe to modify the array during the iteration.'

The xarray. Not the thing you're dereferencing, because the xarray only
stores pointers, not your data structure. So yeah correct statement is
that it doesn't provide you any guarantees beyond "the iterator wont be
confused if the xarray itself is modified during iteration". Which isn't
what you need here, you need a lot more.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > -			   ce->ring->head,
> > > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > -			   ce->ring->tail,
> > > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > -			   atomic_read(&ce->pin_count));
> > > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > -			   atomic_read(&ce->guc_id_ref));
> > > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > -			   ce->guc_state.sched_state,
> > > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > > +		GEM_BUG_ON(intel_context_is_child(ce));
> > >  
> > > +		guc_log_context(p, ce);
> > >  		guc_log_context_priority(p, ce);
> > > +
> > > +		if (intel_context_is_parent(ce)) {
> > > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > > +			struct intel_context *child;
> > > +
> > > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > > +				   READ_ONCE(desc->head));
> > > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > > +				   READ_ONCE(desc->tail));
> > > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > +				   READ_ONCE(desc->wq_status));
> > > +
> > > +			for_each_child(ce, child)
> > > +				guc_log_context(p, child);
> > > +		}
> > >  	}
> > >  }
> > >  
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-10  9:23       ` Daniel Vetter
@ 2021-08-10  9:27         ` Daniel Vetter
  2021-08-10 17:29           ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-10  9:27 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 11:23:39AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 07:13:11PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > > > Display the workqueue status in debugfs for GuC contexts that are in
> > > > parent-child relationship.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> > > >  1 file changed, 39 insertions(+), 17 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 30df1c8db491..44a7582c9aed 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > >  		gse_log_submission_info(guc->gse[i], p, i);
> > > >  }
> > > >  
> > > > +static inline void guc_log_context(struct drm_printer *p,
> > > > +				   struct intel_context *ce)
> > > > +{
> > > > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > +		   ce->ring->head,
> > > > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > +		   ce->ring->tail,
> > > > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > +		   atomic_read(&ce->pin_count));
> > > > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > +		   atomic_read(&ce->guc_id_ref));
> > > > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > > > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > +		   ce->guc_state.sched_state,
> > > > +		   atomic_read(&ce->guc_sched_state_no_lock));
> > > 
> > > It's all debugfs, but I think proper locking even there is good. It at
> > > least reduces the confusion when the locking scheme is largely
> > > undocumented. Also given how much we have rcu for everything would be good
> > > to double-check all pointer dererences are properly protected.
> > >
> > 
> > Not sure if I 100% follow this but I don't think any of the pointers
> > dref here are RCU protected. Certainly none of the GuC ones are.
> > 
> > Will double before the next respin though.
> > 
> > > > +}
> > > > +
> > > >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > >  					     struct drm_printer *p)
> > > >  {
> > > >  	struct intel_context *ce;
> > > >  	unsigned long index;
> > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > 
> > > xa_for_each doesn't provide any guarantees, so doesn't protect against
> > > concurrent removeal or anything like that. We need to do better than that.
> > 
> > https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
> > 'It is safe to modify the array during the iteration.'
> 
> The xarray. Not the thing you're dereferencing, because the xarray only
> stores pointers, not your data structure. So yeah correct statement is
> that it doesn't provide you any guarantees beyond "the iterator wont be
> confused if the xarray itself is modified during iteration". Which isn't
> what you need here, you need a lot more.

Or spelled out: The pointer you get could become immediately meaningless,
before you can look at it, due to a concurrent removal/release. All the
xa_for_each guarantees you is that on the next round you get the next
pointer, until you got them all (plus/minus concurrent changes). But that
next pointer could have become meaningless right away too.

So you need your own locking to make use of these pointers you got and
make sure they're not immediately meaningless before your loop body even
started.

One of the reasons why I think this is so important is that debugfs files
nest a lot of loops fairly often, so are good cheat-sheet for the locking
if it happens to be undocumented (which also shouldn't be the case). Ofc
if there's no locking in debugfs, no cheat-sheet :-)

Cheers, Daniel

> -Daniel
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > -			   ce->ring->head,
> > > > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > -			   ce->ring->tail,
> > > > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > -			   atomic_read(&ce->pin_count));
> > > > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > -			   atomic_read(&ce->guc_id_ref));
> > > > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > > > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > -			   ce->guc_state.sched_state,
> > > > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > > > +		GEM_BUG_ON(intel_context_is_child(ce));
> > > >  
> > > > +		guc_log_context(p, ce);
> > > >  		guc_log_context_priority(p, ce);
> > > > +
> > > > +		if (intel_context_is_parent(ce)) {
> > > > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > > > +			struct intel_context *child;
> > > > +
> > > > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > > > +				   READ_ONCE(desc->head));
> > > > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > > > +				   READ_ONCE(desc->tail));
> > > > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > > +				   READ_ONCE(desc->wq_status));
> > > > +
> > > > +			for_each_child(ce, child)
> > > > +				guc_log_context(p, child);
> > > > +		}
> > > >  	}
> > > >  }
> > > >  
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-10  9:27         ` Daniel Vetter
@ 2021-08-10 17:29           ` Matthew Brost
  2021-08-11 10:04             ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-10 17:29 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 11:27:31AM +0200, Daniel Vetter wrote:
> On Tue, Aug 10, 2021 at 11:23:39AM +0200, Daniel Vetter wrote:
> > On Mon, Aug 09, 2021 at 07:13:11PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> > > > On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > > > > Display the workqueue status in debugfs for GuC contexts that are in
> > > > > parent-child relationship.
> > > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> > > > >  1 file changed, 39 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index 30df1c8db491..44a7582c9aed 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > > >  		gse_log_submission_info(guc->gse[i], p, i);
> > > > >  }
> > > > >  
> > > > > +static inline void guc_log_context(struct drm_printer *p,
> > > > > +				   struct intel_context *ce)
> > > > > +{
> > > > > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > +		   ce->ring->head,
> > > > > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > +		   ce->ring->tail,
> > > > > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > +		   atomic_read(&ce->pin_count));
> > > > > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > +		   atomic_read(&ce->guc_id_ref));
> > > > > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > +		   ce->guc_state.sched_state,
> > > > > +		   atomic_read(&ce->guc_sched_state_no_lock));
> > > > 
> > > > It's all debugfs, but I think proper locking even there is good. It at
> > > > least reduces the confusion when the locking scheme is largely
> > > > undocumented. Also given how much we have rcu for everything would be good
> > > > to double-check all pointer dererences are properly protected.
> > > >
> > > 
> > > Not sure if I 100% follow this but I don't think any of the pointers
> > > dref here are RCU protected. Certainly none of the GuC ones are.
> > > 
> > > Will double before the next respin though.
> > > 
> > > > > +}
> > > > > +
> > > > >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > > >  					     struct drm_printer *p)
> > > > >  {
> > > > >  	struct intel_context *ce;
> > > > >  	unsigned long index;
> > > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > 
> > > > xa_for_each doesn't provide any guarantees, so doesn't protect against
> > > > concurrent removeal or anything like that. We need to do better than that.
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
> > > 'It is safe to modify the array during the iteration.'
> > 
> > The xarray. Not the thing you're dereferencing, because the xarray only
> > stores pointers, not your data structure. So yeah correct statement is
> > that it doesn't provide you any guarantees beyond "the iterator wont be
> > confused if the xarray itself is modified during iteration". Which isn't
> > what you need here, you need a lot more.
> 
> Or spelled out: The pointer you get could become immediately meaningless,
> before you can look at it, due to a concurrent removal/release. All the
> xa_for_each guarantees you is that on the next round you get the next
> pointer, until you got them all (plus/minus concurrent changes). But that
> next pointer could have become meaningless right away too.
> 
> So you need your own locking to make use of these pointers you got and
> make sure they're not immediately meaningless before your loop body even
> started.
> 

Ok, I think I see your point. Likely whenever we do a xa_for_each over
&guc->context_lookup we should just grab its lock as if it is in the
xarray we have reference to object looked up. Also everytime we use
xa_for_each on &guc->context_lookup it is a corner case we it is ok to
block anyone else from using this (e.g. during a reset, checking
debugfs, etc...). Does that sound correct?

Matt

> One of the reasons why I think this is so important is that debugfs files
> nest a lot of loops fairly often, so are good cheat-sheet for the locking
> if it happens to be undocumented (which also shouldn't be the case). Ofc
> if there's no locking in debugfs, no cheat-sheet :-)
> 
> Cheers, Daniel
> 
> > -Daniel
> > 
> > > 
> > > Matt
> > > 
> > > > -Daniel
> > > > 
> > > > > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > -			   ce->ring->head,
> > > > > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > -			   ce->ring->tail,
> > > > > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > -			   atomic_read(&ce->pin_count));
> > > > > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > -			   atomic_read(&ce->guc_id_ref));
> > > > > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > -			   ce->guc_state.sched_state,
> > > > > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > > > > +		GEM_BUG_ON(intel_context_is_child(ce));
> > > > >  
> > > > > +		guc_log_context(p, ce);
> > > > >  		guc_log_context_priority(p, ce);
> > > > > +
> > > > > +		if (intel_context_is_parent(ce)) {
> > > > > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > > > > +			struct intel_context *child;
> > > > > +
> > > > > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > > > > +				   READ_ONCE(desc->head));
> > > > > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > > > > +				   READ_ONCE(desc->tail));
> > > > > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > > > +				   READ_ONCE(desc->wq_status));
> > > > > +
> > > > > +			for_each_child(ce, child)
> > > > > +				guc_log_context(p, child);
> > > > > +		}
> > > > >  	}
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.28.0
> > > > > 
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission
  2021-08-10  6:43       ` Daniel Vetter
@ 2021-08-10 21:29         ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-10 21:29 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 08:43:50AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 06:11:37PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 04:23:42PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:07PM -0700, Matthew Brost wrote:
> > > > Taking a PM reference to prevent intel_gt_wait_for_idle from short
> > > > circuiting while a scheduling of user context could be enabled.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/Makefile                 |  1 +
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++--
> > > >  2 files changed, 34 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > > > index 903de270f2db..5e3a1e2095b0 100644
> > > > --- a/drivers/gpu/drm/i915/Makefile
> > > > +++ b/drivers/gpu/drm/i915/Makefile
> > > > @@ -103,6 +103,7 @@ gt-y += \
> > > >  	gt/intel_gt_clock_utils.o \
> > > >  	gt/intel_gt_irq.o \
> > > >  	gt/intel_gt_pm.o \
> > > > +	gt/intel_gt_pm_unpark_work.o \
> > > 
> > > This file isn't here?
> > > 
> > 
> > Yep, included this in the wrong patch. Should be in:
> > https://patchwork.freedesktop.org/patch/448462/?series=92789&rev=2
> > 
> > > Also pm stuff tends to have very nasty locking requirements, doing special
> > > stuff like this in the backend tends to lead to really big surprises. I
> > > think two options to make sure our locking design stays consistent:
> > > - Lift this to generic code.
> > 
> > Not sure I'm following this, intel_engine_pm_get/put are generic calls.
> > Those calls should have all the correct annoations. If they don't we can
> > add them.
> 
> But you only call them in the GuC backend, not in all of them. Which is an
> inconsistency in locking, and unfortunately runtime pm is extremely nasty,
> so having potentially very divergent locking behind the same interface in
> the same driver is a recipe for an unmaintainable mess.
> 
> Iow, if the high-level code runs on execlist or the ringbuffer backend we
> still need to go through at least the lockdep motions of what you're
> adding here.
> 
> This is similar in spirit to all the might_sleep/might_lock calls we have
> all over the kernel where in many cases something doesn't happen, but we
> need to make sure it's allowed to have a consistent design.
> 
> So essentially in the intel_context_pin and all these functions put a
> intel_engine_pm_might_get (which compiles out without debugging enabled),
> unconditionally, across all platforms and sched backends.
> 

Ok, I see your point here. We currently don't have a
intel_engine_pm_might_get but I think this translates to roughly:

might_lock(engine_pm_wf_mutex)
intel_gt_pm_might_get

Will dig in a big a more and add the annotations to the next rev.

Matt

> In general I think backend specific locking (irrespective of what kind of
> backend or interface you implement) is a pretty bad idea in the kernel,
> and needs to be avoided if at all possible. Avoid here means "pull the
> might_lock/might_sleep/might_whatever checks into generic code".
> -Daniel
> 
> > Matt
> > 
> > > - expose some engine_pm_migt_get/put() calls which do have the right set
> > >   of might_lock annoations, and call those in the generic code.
> > > 
> > > Imo the worst kernel abstractions are those where all implementations
> > > look&act the same, except for locking. Unfortunately i915-gem code is full
> > > of this stuff, and we need to stop this by enlisting lockdep to check the
> > > contracts for us.
> > > -Daniel
> > > 
> > > >  	gt/intel_gt_pm_irq.o \
> > > >  	gt/intel_gt_requests.o \
> > > >  	gt/intel_gtt.o \
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 7fe4d1559a81..c5d9548bfd00 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -2056,7 +2056,12 @@ static int guc_context_pre_pin(struct intel_context *ce,
> > > >  
> > > >  static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > >  {
> > > > -	return __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +
> > > > +	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_get(ce->engine);
> > > > +
> > > > +	return ret;
> > > >  }
> > > >  
> > > >  static void guc_context_unpin(struct intel_context *ce)
> > > > @@ -2067,6 +2072,9 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  
> > > >  	unpin_guc_id(guc, ce, true);
> > > >  	lrc_unpin(ce);
> > > > +
> > > > +	if (likely(!intel_context_is_barrier(ce)))
> > > > +		intel_engine_pm_put(ce->engine);
> > > >  }
> > > >  
> > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > @@ -3002,8 +3010,30 @@ static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > >  static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > >  {
> > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > +	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +
> > > > +	if (likely(!ret))
> > > > +		for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > >  
> > > > -	return __guc_context_pin(ce, engine, vaddr);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > +	struct intel_engine_cs *engine;
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > +
> > > > +	unpin_guc_id(guc, ce, true);
> > > > +	lrc_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > +		intel_engine_pm_put(engine);
> > > >  }
> > > >  
> > > >  static void guc_virtual_context_enter(struct intel_context *ce)
> > > > @@ -3040,7 +3070,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > >  
> > > >  	.pre_pin = guc_virtual_context_pre_pin,
> > > >  	.pin = guc_virtual_context_pin,
> > > > -	.unpin = guc_context_unpin,
> > > > +	.unpin = guc_virtual_context_unpin,
> > > >  	.post_unpin = guc_context_post_unpin,
> > > >  
> > > >  	.ban = guc_context_ban,
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-09 19:32     ` Matthew Brost
@ 2021-08-11  9:55       ` Daniel Vetter
  2021-08-11 17:43         ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-11  9:55 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Mon, Aug 09, 2021 at 07:32:26PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 07:17:27PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> > > Some workloads use lots of contexts that continually pin / unpin
> > > contexts. With GuC submission an unpin translates to a schedule disable
> > > H2G which puts pressure on both the i915 and GuC. A schedule disable can
> > > also block future requests from being submitted until the operation
> > > completes. None of this is ideal.
> > > 
> > > Add a configurable, via debugfs, delay period before the schedule
> > > disable is issued. Default delay period is 1 second. The delay period is
> > > skipped if more than 3/4 of the guc_ids are in use.
> > > 
> > > This patch also updates the selftests to turn off this delay period as
> > > this extra time would likely cause many selftests to fail. Follow up
> > > patches will fix all the selftests and enable the delay period.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > I think this is more evidence that we should just pin/unpin context at
> > create/destruction time. The current scheme doesn't really work that well
> > and causes way more pain than benefits it seems.
> > 
> 
> Well that choice is above my pay grade, but for what it is worth it
> would simplify the GuC backend quite a bit if we perma-pin contexts. By
> quite a bit, I actually mean a lot of complexity goes away.
> 
> In the meantime I think we probably need this code though to avoid
> trashes on the scheduling enable / disable.

The trouble is that you muck around with the context close state bit,
which is one of these lockless trickeries where my cursory analysis (just
a few days in total of randomly stumbling over it when reading other code)
strongly suggests it's busted.

I really don't want to build more on top, especially not without careful
review and all that.

Also since this is a perf claim, the commit message needs some numbers.

Finally even if we decide to make contexts properly evictable, we need a
different scheme anyway. As you realized the current active tracking is
kinda backwards because it unpins immediately when no longer in use.
-Daniel

> 
> Matt
> 
> > If anyone screams, and that's a big if aside of some igts, we can come up
> > with a proper scheme to evict contexts without pin/unpin and layer hacks
> > over that misdesign.
> > -Daniel
> > 
> > > ---
> > >  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
> > >  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
> > >  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
> > >  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
> > >  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
> > >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> > >  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
> > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
> > >  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
> > >  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
> > >  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
> > >  drivers/gpu/drm/i915/i915_trace.h             |  10 +
> > >  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
> > >  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
> > >  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
> > >  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
> > >  18 files changed, 405 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > index b199d59bd2c4..1553287e5491 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
> > >  		int err;
> > >  
> > >  		/* serialises with execbuf */
> > > -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > +		intel_context_close(ce);
> > >  		if (!intel_context_pin_if_active(ce))
> > >  			continue;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > index 13b088cc787e..a666d7e610f5 100644
> > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
> > >  		SUBTEST(igt_gem_coherency),
> > >  	};
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > index ffae7df5e4d7..2c92afa9d608 100644
> > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
> > >  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
> > >  	};
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > index b20f5621f62b..4745c78a48de 100644
> > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
> > >  		SUBTEST(igt_mmap_gpu),
> > >  	};
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > index 740ee8086a27..ae1361c7c4cf 100644
> > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
> > >  		SUBTEST(igt_gem_huge),
> > >  	};
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > index 8e90a4a0b7b0..96643040defd 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > >  	ce->guc_id = GUC_INVALID_LRC_ID;
> > >  	INIT_LIST_HEAD(&ce->guc_id_link);
> > >  
> > > +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> > > +
> > >  	mutex_init(&ce->parallel_submit);
> > >  	ce->fence_context = dma_fence_context_alloc(1);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > index a302599e436a..f4c9036f7f03 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
> > >  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
> > >  }
> > >  
> > > +static inline void intel_context_close(struct intel_context *ce)
> > > +{
> > > +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > +
> > > +	trace_intel_context_close(ce);
> > > +	if (ce->ops->close)
> > > +		ce->ops->close(ce);
> > > +}
> > > +
> > >  static inline bool intel_context_is_closed(const struct intel_context *ce)
> > >  {
> > >  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index 8af9ace4c052..53f00657a45c 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/list.h>
> > >  #include <linux/mutex.h>
> > >  #include <linux/types.h>
> > > +#include <linux/ktime.h>
> > >  
> > >  #include "i915_active_types.h"
> > >  #include "i915_sw_fence.h"
> > > @@ -38,6 +39,7 @@ struct intel_context_ops {
> > >  	int (*alloc)(struct intel_context *ce);
> > >  
> > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > +	void (*close)(struct intel_context *ce);
> > >  
> > >  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > >  	int (*pin)(struct intel_context *ce);
> > > @@ -203,6 +205,12 @@ struct intel_context {
> > >  	 */
> > >  	struct list_head guc_id_link;
> > >  
> > > +	/*
> > > +	 * GuC schedule disable link / time
> > > +	 */
> > > +	struct list_head guc_sched_disable_link;
> > > +	ktime_t guc_sched_disable_time;
> > > +
> > >  	/* GuC context blocked fence */
> > >  	struct i915_sw_fence guc_blocked;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 30a0f364db8f..90b5b657d411 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -60,6 +60,7 @@ struct intel_guc {
> > >  	struct ida guc_ids;
> > >  	u32 num_guc_ids;
> > >  	u32 max_guc_ids;
> > > +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
> > >  	unsigned long *guc_ids_bitmap;
> > >  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> > >  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > > @@ -69,6 +70,12 @@ struct intel_guc {
> > >  	struct list_head destroyed_contexts;
> > >  	struct intel_gt_pm_unpark_work destroy_worker;
> > >  
> > > +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> > > +	struct list_head sched_disable_list;
> > > +	struct hrtimer sched_disable_timer;
> > > +#define SCHED_DISABLE_DELAY_NS	1000000000
> > > +	u64 sched_disable_delay_ns;
> > > +
> > >  	bool submission_supported;
> > >  	bool submission_selected;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > index 7c479c5e7b3a..53a6f3da6cce 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
> > >  }
> > >  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> > >  
> > > +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> > > +{
> > > +	struct intel_guc *guc = data;
> > > +
> > > +	if (!intel_guc_submission_is_used(guc))
> > > +		return -ENODEV;
> > > +
> > > +	*val = guc->sched_disable_delay_ns;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> > > +{
> > > +	struct intel_guc *guc = data;
> > > +
> > > +	if (!intel_guc_submission_is_used(guc))
> > > +		return -ENODEV;
> > > +
> > > +	guc->sched_disable_delay_ns = val;
> > > +
> > > +	return 0;
> > > +}
> > > +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> > > +			guc_sched_disable_delay_ns_get,
> > > +			guc_sched_disable_delay_ns_set, "%lld\n");
> > > +
> > >  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
> > >  {
> > >  	static const struct debugfs_gt_file files[] = {
> > >  		{ "guc_info", &guc_info_fops, NULL },
> > >  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
> > >  		{ "guc_num_id", &guc_num_id_fops, NULL },
> > > +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
> > >  	};
> > >  
> > >  	if (!intel_guc_is_supported(guc))
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index cd1893edf43a..dc0d6a099bee 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> > >  	return (timeout < 0) ? timeout : 0;
> > >  }
> > >  
> > > +static void sched_disable_contexts_flush(struct intel_guc *guc);
> > > +
> > >  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> > >  {
> > >  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
> > >  		return 0;
> > >  
> > > +	sched_disable_contexts_flush(guc);
> > > +
> > >  	return intel_guc_wait_for_pending_msg(guc,
> > >  					      &guc->outstanding_submission_g2h,
> > >  					      true, timeout);
> > > @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
> > >  static void guc_signal_context_fence(struct intel_context *ce);
> > >  static void guc_cancel_context_requests(struct intel_context *ce);
> > >  static void guc_blocked_fence_complete(struct intel_context *ce);
> > > +static void sched_disable_context_delete(struct intel_context *ce);
> > >  
> > >  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > >  {
> > > @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > >  		deregister = context_wait_for_deregister_to_register(ce);
> > >  		banned = context_banned(ce);
> > >  		init_sched_state(ce);
> > > +		sched_disable_context_delete(ce);
> > >  
> > >  		if (pending_enable || destroyed || deregister) {
> > >  			atomic_dec(&guc->outstanding_submission_g2h);
> > > @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > >  
> > >  	intel_gt_park_heartbeats(guc_to_gt(guc));
> > >  	disable_submission(guc);
> > > +	hrtimer_cancel(&guc->sched_disable_timer);
> > >  	guc->interrupts.disable(guc);
> > >  
> > >  	/* Flush IRQ handler */
> > > @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
> > >  
> > >  static void destroy_worker_func(struct work_struct *w);
> > >  
> > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> > > +
> > >  /*
> > >   * Set up the memory resources to be shared with the GuC (via the GGTT)
> > >   * at firmware loading time.
> > > @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > >  	INIT_LIST_HEAD(&guc->destroyed_contexts);
> > >  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
> > >  
> > > +	spin_lock_init(&guc->sched_disable_lock);
> > > +	INIT_LIST_HEAD(&guc->sched_disable_list);
> > > +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> > > +		     HRTIMER_MODE_REL);
> > > +	guc->sched_disable_timer.function = sched_disable_timer_func;
> > > +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> > > +
> > >  	return 0;
> > >  }
> > >  
> > > @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >  	if (unlikely(ret < 0))
> > >  		return ret;
> > >  
> > > +	if (intel_context_is_parent(ce))
> > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > +			order_base_2(ce->guc_number_children + 1);
> > > +	else
> > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> > > +
> > >  	ce->guc_id = ret;
> > >  	return 0;
> > >  }
> > > @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > >  {
> > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > >  	if (!context_guc_id_invalid(ce)) {
> > > -		if (intel_context_is_parent(ce))
> > > +		if (intel_context_is_parent(ce)) {
> > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > +				order_base_2(ce->guc_number_children + 1);
> > >  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> > >  					      order_base_2(ce->guc_number_children
> > >  							   + 1));
> > > -		else
> > > +		} else {
> > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
> > >  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > > +		}
> > >  		clr_lrc_desc_registered(guc, ce->guc_id);
> > > +
> > >  		set_context_guc_id_invalid(ce);
> > >  	}
> > >  	if (!list_empty(&ce->guc_id_link))
> > > @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > >  			 * from another context that has more guc_id that itself.
> > >  			 */
> > >  			if (cn_o2 != ce_o2) {
> > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > +					order_base_2(cn->guc_number_children + 1);
> > >  				bitmap_release_region(guc->guc_ids_bitmap,
> > >  						      cn->guc_id,
> > >  						      cn_o2);
> > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > +					order_base_2(ce->guc_number_children + 1);
> > >  				bitmap_allocate_region(guc->guc_ids_bitmap,
> > >  						       ce->guc_id,
> > >  						       ce_o2);
> > > @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > >  	__guc_context_unpin(ce);
> > >  
> > >  	if (likely(!intel_context_is_barrier(ce)))
> > > -		intel_engine_pm_put(ce->engine);
> > > +		intel_engine_pm_put_async(ce->engine);
> > >  }
> > >  
> > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> > >  
> > >  	for_each_engine_masked(engine, ce->engine->gt,
> > >  			       ce->engine->mask, tmp)
> > > -		intel_engine_pm_put(engine);
> > > +		intel_engine_pm_put_async(engine);
> > >  	for_each_child(ce, child)
> > >  		for_each_engine_masked(engine, child->engine->gt,
> > >  				       child->engine->mask, tmp)
> > > -			intel_engine_pm_put(engine);
> > > +			intel_engine_pm_put_async(engine);
> > >  }
> > >  
> > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> > >  
> > >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > >  
> > > +	sched_disable_context_delete(ce);
> > > +
> > >  	with_intel_runtime_pm(runtime_pm, wakeref)
> > >  		__guc_context_sched_disable(guc, ce, guc_id);
> > >  
> > > @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> > >  								     1);
> > >  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > >  	}
> > > +
> > > +	sched_disable_context_delete(ce);
> > > +}
> > > +
> > > +#define next_sched_disable_time(guc, now, ce) \
> > > +	(guc->sched_disable_delay_ns - \
> > > +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> > > +static void ____sched_disable_context_delete(struct intel_guc *guc,
> > > +					     struct intel_context *ce)
> > > +{
> > > +	bool is_first;
> > > +
> > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> > > +
> > > +	is_first = list_is_first(&ce->guc_sched_disable_link,
> > > +				 &guc->sched_disable_list);
> > > +	list_del_init(&ce->guc_sched_disable_link);
> > > +	if (list_empty(&guc->sched_disable_list)) {
> > > +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > +	} else if (is_first) {
> > > +		struct intel_context *first =
> > > +			list_first_entry(&guc->sched_disable_list,
> > > +					 typeof(*first),
> > > +					 guc_sched_disable_link);
> > > +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> > > +							first);
> > > +
> > > +		hrtimer_start(&guc->sched_disable_timer,
> > > +			      ns_to_ktime(next_time),
> > > +			      HRTIMER_MODE_REL_PINNED);
> > > +	}
> > > +}
> > > +
> > > +static void __sched_disable_context_delete(struct intel_guc *guc,
> > > +					   struct intel_context *ce)
> > > +{
> > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > +		intel_context_sched_disable_unpin(ce);
> > > +		____sched_disable_context_delete(guc, ce);
> > > +	}
> > > +}
> > > +
> > > +static void sched_disable_context_delete(struct intel_context *ce)
> > > +{
> > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > +	unsigned long flags;
> > > +
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > +		__sched_disable_context_delete(guc, ce);
> > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > +	}
> > > +}
> > > +
> > > +static void sched_disable_context_add(struct intel_guc *guc,
> > > +				      struct intel_context *ce)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > > +
> > > +	ce->guc_sched_disable_time = ktime_get();
> > > +
> > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > +	if (list_empty(&guc->sched_disable_list))
> > > +		hrtimer_start(&guc->sched_disable_timer,
> > > +			      ns_to_ktime(guc->sched_disable_delay_ns),
> > > +			      HRTIMER_MODE_REL_PINNED);
> > > +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > +}
> > > +
> > > +static void sched_disable_contexts_flush(struct intel_guc *guc)
> > > +{
> > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > +	struct intel_context *ce, *cn;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > +
> > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > +					 guc_sched_disable_link) {
> > > +		intel_wakeref_t wakeref;
> > > +		bool enabled;
> > > +		u16 guc_id;
> > > +
> > > +		list_del_init(&ce->guc_sched_disable_link);
> > > +
> > > +		spin_lock(&ce->guc_state.lock);
> > > +		enabled = context_enabled(ce);
> > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > +			if (enabled)
> > > +				clr_context_enabled(ce);
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			intel_context_sched_disable_unpin(ce);
> > > +			continue;
> > > +		}
> > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			continue;
> > > +		}
> > > +		guc_id = prep_context_pending_disable(ce);
> > > +		spin_unlock(&ce->guc_state.lock);
> > > +
> > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > +	}
> > > +
> > > +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > +
> > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > >  }
> > >  
> > > +#define should_sched_be_disabled(guc, now, ce) \
> > > +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> > > +	(guc->sched_disable_delay_ns / 4) * 3)
> > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> > > +{
> > > +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> > > +					     sched_disable_timer);
> > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > +	struct intel_context *ce, *cn;
> > > +	unsigned long flags;
> > > +	ktime_t now;
> > > +
> > > +	if (list_empty(&guc->sched_disable_list))
> > > +		return HRTIMER_NORESTART;
> > > +
> > > +	now = ktime_get();
> > > +
> > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > +
> > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > +					 guc_sched_disable_link) {
> > > +		intel_wakeref_t wakeref;
> > > +		bool enabled;
> > > +		u16 guc_id;
> > > +
> > > +		/*
> > > +		 * If a context has been waiting for 3/4 of its delay or more,
> > > +		 * issue the schedule disable. Using this heuristic allows more
> > > +		 * than 1 context to have its scheduling disabled when this
> > > +		 * timer is run.
> > > +		 */
> > > +		if (!should_sched_be_disabled(guc, now, ce))
> > > +			break;
> > > +
> > > +		list_del_init(&ce->guc_sched_disable_link);
> > > +
> > > +		spin_lock(&ce->guc_state.lock);
> > > +		enabled = context_enabled(ce);
> > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > +			if (enabled)
> > > +				clr_context_enabled(ce);
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			intel_context_sched_disable_unpin(ce);
> > > +			continue;
> > > +		}
> > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			continue;
> > > +		}
> > > +		guc_id = prep_context_pending_disable(ce);
> > > +		spin_unlock(&ce->guc_state.lock);
> > > +
> > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > +	}
> > > +
> > > +	if (!list_empty(&guc->sched_disable_list)) {
> > > +		struct intel_context *first =
> > > +			list_first_entry(&guc->sched_disable_list,
> > > +					 typeof(*first),
> > > +					 guc_sched_disable_link);
> > > +		u64 next_time = next_sched_disable_time(guc, now, first);
> > > +
> > > +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > +
> > > +		return HRTIMER_RESTART;
> > > +	} else {
> > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > +
> > > +		return HRTIMER_NORESTART;
> > > +	}
> > > +}
> > > +
> > > +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
> > >  static void guc_context_sched_disable(struct intel_context *ce)
> > >  {
> > >  	struct intel_guc *guc = ce_to_guc(ce);
> > > @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > >  	intel_wakeref_t wakeref;
> > >  	u16 guc_id;
> > >  	bool enabled;
> > > +	int guc_id_index = intel_context_is_parent(ce) ?
> > > +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> > > +	int max_guc_ids = intel_context_is_parent(ce) ?
> > > +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> > > +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> > >  
> > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > >  
> > >  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > >  	    !lrc_desc_registered(guc, ce->guc_id)) {
> > > @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > >  	if (!context_enabled(ce))
> > >  		goto unpin;
> > >  
> > > +	/*
> > > +	 * If no guc_id pressure and the context isn't closed we delay the
> > > +	 * schedule disable to not to continuously disable / enable scheduling
> > > +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> > > +	 * debugfs, default 1s.
> > > +	 */
> > > +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> > > +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> > > +		sched_disable_context_add(guc, ce);
> > > +		return;
> > > +	}
> > > +
> > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > >  
> > >  	/*
> > > @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
> > >  	i915_request_notify_execute_cb_imm(rq);
> > >  }
> > >  
> > > +static void __guc_context_close(struct intel_guc *guc,
> > > +				struct intel_context *ce)
> > > +{
> > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > +
> > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > +		struct intel_runtime_pm *runtime_pm =
> > > +			ce->engine->uncore->rpm;
> > > +		intel_wakeref_t wakeref;
> > > +		bool enabled;
> > > +		u16 guc_id;
> > > +
> > > +		spin_lock(&ce->guc_state.lock);
> > > +		enabled = context_enabled(ce);
> > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > +			if (enabled)
> > > +				clr_context_enabled(ce);
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			intel_context_sched_disable_unpin(ce);
> > > +			goto update_list;
> > > +		}
> > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > +			spin_unlock(&ce->guc_state.lock);
> > > +			goto update_list;
> > > +		}
> > > +		guc_id = prep_context_pending_disable(ce);
> > > +		spin_unlock(&ce->guc_state.lock);
> > > +
> > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > +update_list:
> > > +		____sched_disable_context_delete(guc, ce);
> > > +	}
> > > +}
> > > +
> > > +static void guc_context_close(struct intel_context *ce)
> > > +{
> > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > +	unsigned long flags;
> > > +
> > > +	/*
> > > +	 * If we close the context and a schedule disable is pending a delay, do
> > > +	 * it immediately.
> > > +	 */
> > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > +		__guc_context_close(guc, ce);
> > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > +	}
> > > +}
> > > +
> > >  static struct intel_context *
> > >  guc_create_parallel(struct intel_engine_cs **engines,
> > >  		    unsigned int num_siblings,
> > > @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
> > >  	.post_unpin = guc_context_post_unpin,
> > >  
> > >  	.ban = guc_context_ban,
> > > +	.close = guc_context_close,
> > >  
> > >  	.cancel_request = guc_context_cancel_request,
> > >  
> > > @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
> > >  
> > >  	rq->reserved_space -= GUC_REQUEST_SIZE;
> > >  
> > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> > > +		   atomic_read(&ce->pin_count) < 3);
> > > +	sched_disable_context_delete(ce);
> > > +
> > >  	/*
> > >  	 * guc_ids are exhausted or a heuristic is met indicating too many
> > >  	 * guc_ids are waiting on requests with submission dependencies (not
> > > @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > >  	__guc_context_unpin(ce);
> > >  
> > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > -		intel_engine_pm_put(engine);
> > > +		intel_engine_pm_put_async(engine);
> > >  }
> > >  
> > >  static void guc_virtual_context_enter(struct intel_context *ce)
> > > @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > >  	.post_unpin = guc_context_post_unpin,
> > >  
> > >  	.ban = guc_context_ban,
> > > +	.close = guc_context_close,
> > >  
> > >  	.cancel_request = guc_context_cancel_request,
> > >  
> > > @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
> > >  	.post_unpin = guc_parent_context_post_unpin,
> > >  
> > >  	.ban = guc_context_ban,
> > > +	.close = guc_context_close,
> > >  
> > >  	.enter = guc_virtual_context_enter,
> > >  	.exit = guc_virtual_context_exit,
> > > @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > >  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> > >  		   atomic_read(&guc->outstanding_submission_g2h));
> > >  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> > > -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> > > +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> > > +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> > > +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
> > >  	drm_printf(p, "GuC max context registered: %u\n\n",
> > >  		   guc->lrcd_reg.max_idx);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > index 9cfecf9d368e..ad70b3159ce4 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
> > >  #define NUM_RQ_PER_CONTEXT	2
> > >  #define HEARTBEAT_INTERVAL	1500
> > >  
> > > -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> > > +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> > > +					bool hang, bool sched_disable_delay)
> > >  {
> > >  	struct intel_gt *gt = arg;
> > >  	struct intel_guc *guc = &gt->uc.guc;
> > > @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > >  	if (limit_guc_ids)
> > >  		guc->num_guc_ids = NUM_GUC_ID;
> > >  
> > > +	if (sched_disable_delay)
> > > +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> > > +
> > >  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
> > >  	if (IS_ERR(ce)) {
> > >  		ret = PTR_ERR(ce);
> > > @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > >  	guc->num_guc_ids = guc->max_guc_ids;
> > >  	guc->gse_hang_expected = false;
> > >  	guc->inject_bad_sched_disable = false;
> > > +	guc->sched_disable_delay_ns = 0;
> > >  	kfree(contexts);
> > >  
> > >  	return ret;
> > > @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > >  
> > >  static int intel_guc_flow_control_guc_ids(void *arg)
> > >  {
> > > -	return __intel_guc_flow_control_guc(arg, true, false);
> > > +	return __intel_guc_flow_control_guc(arg, true, false, false);
> > > +}
> > > +
> > > +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> > > +{
> > > +	return __intel_guc_flow_control_guc(arg, true, false, true);
> > >  }
> > >  
> > >  static int intel_guc_flow_control_lrcd_reg(void *arg)
> > >  {
> > > -	return __intel_guc_flow_control_guc(arg, false, false);
> > > +	return __intel_guc_flow_control_guc(arg, false, false, false);
> > >  }
> > >  
> > >  static int intel_guc_flow_control_hang_state_machine(void *arg)
> > >  {
> > > -	return __intel_guc_flow_control_guc(arg, true, true);
> > > +	return __intel_guc_flow_control_guc(arg, true, true, false);
> > >  }
> > >  
> > >  #define NUM_RQ_STRESS_CTBS	0x4000
> > > @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
> > >  	static const struct i915_subtest tests[] = {
> > >  		SUBTEST(intel_guc_flow_control_stress_ctbs),
> > >  		SUBTEST(intel_guc_flow_control_guc_ids),
> > > +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
> > >  		SUBTEST(intel_guc_flow_control_lrcd_reg),
> > >  		SUBTEST(intel_guc_flow_control_hang_state_machine),
> > >  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> > > diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> > > index f54de0499be7..bf464db7affe 100644
> > > --- a/drivers/gpu/drm/i915/i915_selftest.h
> > > +++ b/drivers/gpu/drm/i915/i915_selftest.h
> > > @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
> > >  			T, ARRAY_SIZE(T), data)
> > >  #define i915_live_subtests(T, data) ({ \
> > >  	typecheck(struct drm_i915_private *, data); \
> > > +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
> > >  	__i915_subtests(__func__, \
> > >  			__i915_live_setup, __i915_live_teardown, \
> > >  			T, ARRAY_SIZE(T), data); \
> > >  })
> > >  #define intel_gt_live_subtests(T, data) ({ \
> > >  	typecheck(struct intel_gt *, data); \
> > > +	(data)->uc.guc.sched_disable_delay_ns = 0; \
> > >  	__i915_subtests(__func__, \
> > >  			__intel_gt_live_setup, __intel_gt_live_teardown, \
> > >  			T, ARRAY_SIZE(T), data); \
> > > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > > index 806ad688274b..57ba7065d5ab 100644
> > > --- a/drivers/gpu/drm/i915/i915_trace.h
> > > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > > @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
> > >  	     TP_ARGS(ce)
> > >  );
> > >  
> > > +DEFINE_EVENT(intel_context, intel_context_close,
> > > +	     TP_PROTO(struct intel_context *ce),
> > > +	     TP_ARGS(ce)
> > > +);
> > > +
> > >  DEFINE_EVENT(intel_context, intel_context_ban,
> > >  	     TP_PROTO(struct intel_context *ce),
> > >  	     TP_ARGS(ce)
> > > @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
> > >  {
> > >  }
> > >  
> > > +static inline void
> > > +trace_intel_context_close(struct intel_context *ce)
> > > +{
> > > +}
> > > +
> > >  static inline void
> > >  trace_intel_context_ban(struct intel_context *ce)
> > >  {
> > > diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > index f843a5040706..d54c280217fe 100644
> > > --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
> > >  
> > >  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > index 9e9a6cb1d9e5..86bad00cca95 100644
> > > --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
> > >  	if (err)
> > >  		return err;
> > >  
> > > -	err = i915_subtests(tests, i915);
> > > +	err = i915_live_subtests(tests, i915);
> > >  
> > >  	destroy_empty_config(&i915->perf);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > index d67710d10615..afbf88865a8b 100644
> > > --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> > > +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
> > >  	if (intel_gt_is_wedged(&i915->gt))
> > >  		return 0;
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > >  
> > >  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> > > diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > index dd0607254a95..f4b157451851 100644
> > > --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
> > >  		SUBTEST(igt_vma_remapped_gtt),
> > >  	};
> > >  
> > > -	return i915_subtests(tests, i915);
> > > +	return i915_live_subtests(tests, i915);
> > >  }
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-10 17:29           ` Matthew Brost
@ 2021-08-11 10:04             ` Daniel Vetter
  2021-08-11 17:35               ` Matthew Brost
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-11 10:04 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 05:29:46PM +0000, Matthew Brost wrote:
> On Tue, Aug 10, 2021 at 11:27:31AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 11:23:39AM +0200, Daniel Vetter wrote:
> > > On Mon, Aug 09, 2021 at 07:13:11PM +0000, Matthew Brost wrote:
> > > > On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > > > > > Display the workqueue status in debugfs for GuC contexts that are in
> > > > > > parent-child relationship.
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> > > > > >  1 file changed, 39 insertions(+), 17 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > index 30df1c8db491..44a7582c9aed 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > > > >  		gse_log_submission_info(guc->gse[i], p, i);
> > > > > >  }
> > > > > >  
> > > > > > +static inline void guc_log_context(struct drm_printer *p,
> > > > > > +				   struct intel_context *ce)
> > > > > > +{
> > > > > > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > > +		   ce->ring->head,
> > > > > > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > > +		   ce->ring->tail,
> > > > > > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > > +		   atomic_read(&ce->pin_count));
> > > > > > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > > +		   atomic_read(&ce->guc_id_ref));
> > > > > > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > > +		   ce->guc_state.sched_state,
> > > > > > +		   atomic_read(&ce->guc_sched_state_no_lock));
> > > > > 
> > > > > It's all debugfs, but I think proper locking even there is good. It at
> > > > > least reduces the confusion when the locking scheme is largely
> > > > > undocumented. Also given how much we have rcu for everything would be good
> > > > > to double-check all pointer dererences are properly protected.
> > > > >
> > > > 
> > > > Not sure if I 100% follow this but I don't think any of the pointers
> > > > dref here are RCU protected. Certainly none of the GuC ones are.
> > > > 
> > > > Will double before the next respin though.
> > > > 
> > > > > > +}
> > > > > > +
> > > > > >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > > > >  					     struct drm_printer *p)
> > > > > >  {
> > > > > >  	struct intel_context *ce;
> > > > > >  	unsigned long index;
> > > > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > > 
> > > > > xa_for_each doesn't provide any guarantees, so doesn't protect against
> > > > > concurrent removeal or anything like that. We need to do better than that.
> > > > 
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
> > > > 'It is safe to modify the array during the iteration.'
> > > 
> > > The xarray. Not the thing you're dereferencing, because the xarray only
> > > stores pointers, not your data structure. So yeah correct statement is
> > > that it doesn't provide you any guarantees beyond "the iterator wont be
> > > confused if the xarray itself is modified during iteration". Which isn't
> > > what you need here, you need a lot more.
> > 
> > Or spelled out: The pointer you get could become immediately meaningless,
> > before you can look at it, due to a concurrent removal/release. All the
> > xa_for_each guarantees you is that on the next round you get the next
> > pointer, until you got them all (plus/minus concurrent changes). But that
> > next pointer could have become meaningless right away too.
> > 
> > So you need your own locking to make use of these pointers you got and
> > make sure they're not immediately meaningless before your loop body even
> > started.
> > 
> 
> Ok, I think I see your point. Likely whenever we do a xa_for_each over
> &guc->context_lookup we should just grab its lock as if it is in the
> xarray we have reference to object looked up. Also everytime we use
> xa_for_each on &guc->context_lookup it is a corner case we it is ok to
> block anyone else from using this (e.g. during a reset, checking
> debugfs, etc...). Does that sound correct?

Yup, generally the simplest is to just hold the lock for the
list/xarray/whatever to keep the object alive. Next up in complexity is to
grab a temporary reference. This is usually required if the next step is
taking a mutex, and your lookup lock is a spinlock. Or if you have some
other locking inversion.

And yes anywhere in debugfs, or anywhere else where performance doesn't
matter just use proper locking, no tricks with rcu or lockless or
whatever.

Finally a word on gpu reset: It is currently not annotated, but gpu reset
is a dma_fence signalling critical section (if we fail to get through gpu
reset dma_fence are potentially stuck). That means any lock you take in
gpu reset is very encumbered, so needs an audit to make sure you're not
creating an inversion anywhere. While I bring this up, I noticed you're
using i915_sw_fence instead of dma_fence directly in a bunch of places in
GuC code. We're kinda aiming to get rid of i915_sw_fence (and maybe move
the remaining useful bits into drivers/dma-buf/), so using less of the
i915-NIH-isms would be really good in general. There's unfortunately way
too much of that too.
-Daniel

> 
> Matt
> 
> > One of the reasons why I think this is so important is that debugfs files
> > nest a lot of loops fairly often, so are good cheat-sheet for the locking
> > if it happens to be undocumented (which also shouldn't be the case). Ofc
> > if there's no locking in debugfs, no cheat-sheet :-)
> > 
> > Cheers, Daniel
> > 
> > > -Daniel
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > > -			   ce->ring->head,
> > > > > > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > > -			   ce->ring->tail,
> > > > > > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > > -			   atomic_read(&ce->pin_count));
> > > > > > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > > -			   atomic_read(&ce->guc_id_ref));
> > > > > > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > > -			   ce->guc_state.sched_state,
> > > > > > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > > > > > +		GEM_BUG_ON(intel_context_is_child(ce));
> > > > > >  
> > > > > > +		guc_log_context(p, ce);
> > > > > >  		guc_log_context_priority(p, ce);
> > > > > > +
> > > > > > +		if (intel_context_is_parent(ce)) {
> > > > > > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > > > > > +			struct intel_context *child;
> > > > > > +
> > > > > > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > > > > > +				   READ_ONCE(desc->head));
> > > > > > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > > > > > +				   READ_ONCE(desc->tail));
> > > > > > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > > > > +				   READ_ONCE(desc->wq_status));
> > > > > > +
> > > > > > +			for_each_child(ce, child)
> > > > > > +				guc_log_context(p, child);
> > > > > > +		}
> > > > > >  	}
> > > > > >  }
> > > > > >  
> > > > > > -- 
> > > > > > 2.28.0
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc
  2021-08-11 10:04             ` Daniel Vetter
@ 2021-08-11 17:35               ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 17:35 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Wed, Aug 11, 2021 at 12:04:04PM +0200, Daniel Vetter wrote:
> On Tue, Aug 10, 2021 at 05:29:46PM +0000, Matthew Brost wrote:
> > On Tue, Aug 10, 2021 at 11:27:31AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 10, 2021 at 11:23:39AM +0200, Daniel Vetter wrote:
> > > > On Mon, Aug 09, 2021 at 07:13:11PM +0000, Matthew Brost wrote:
> > > > > On Mon, Aug 09, 2021 at 06:36:44PM +0200, Daniel Vetter wrote:
> > > > > > On Tue, Aug 03, 2021 at 03:29:22PM -0700, Matthew Brost wrote:
> > > > > > > Display the workqueue status in debugfs for GuC contexts that are in
> > > > > > > parent-child relationship.
> > > > > > > 
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > ---
> > > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 +++++++++++++------
> > > > > > >  1 file changed, 39 insertions(+), 17 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > index 30df1c8db491..44a7582c9aed 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > @@ -4527,31 +4527,53 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > > > > >  		gse_log_submission_info(guc->gse[i], p, i);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static inline void guc_log_context(struct drm_printer *p,
> > > > > > > +				   struct intel_context *ce)
> > > > > > > +{
> > > > > > > +	drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > > > +	drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > > > +	drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > > > +		   ce->ring->head,
> > > > > > > +		   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > > > +	drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > > > +		   ce->ring->tail,
> > > > > > > +		   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > > > +	drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > > > +		   atomic_read(&ce->pin_count));
> > > > > > > +	drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > > > +		   atomic_read(&ce->guc_id_ref));
> > > > > > > +	drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > > > +		   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > > > +	drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > > > +		   ce->guc_state.sched_state,
> > > > > > > +		   atomic_read(&ce->guc_sched_state_no_lock));
> > > > > > 
> > > > > > It's all debugfs, but I think proper locking even there is good. It at
> > > > > > least reduces the confusion when the locking scheme is largely
> > > > > > undocumented. Also given how much we have rcu for everything would be good
> > > > > > to double-check all pointer dererences are properly protected.
> > > > > >
> > > > > 
> > > > > Not sure if I 100% follow this but I don't think any of the pointers
> > > > > dref here are RCU protected. Certainly none of the GuC ones are.
> > > > > 
> > > > > Will double before the next respin though.
> > > > > 
> > > > > > > +}
> > > > > > > +
> > > > > > >  void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > > > > >  					     struct drm_printer *p)
> > > > > > >  {
> > > > > > >  	struct intel_context *ce;
> > > > > > >  	unsigned long index;
> > > > > > >  	xa_for_each(&guc->context_lookup, index, ce) {
> > > > > > 
> > > > > > xa_for_each doesn't provide any guarantees, so doesn't protect against
> > > > > > concurrent removeal or anything like that. We need to do better than that.
> > > > > 
> > > > > https://elixir.bootlin.com/linux/latest/source/include/linux/xarray.h#L498
> > > > > 'It is safe to modify the array during the iteration.'
> > > > 
> > > > The xarray. Not the thing you're dereferencing, because the xarray only
> > > > stores pointers, not your data structure. So yeah correct statement is
> > > > that it doesn't provide you any guarantees beyond "the iterator wont be
> > > > confused if the xarray itself is modified during iteration". Which isn't
> > > > what you need here, you need a lot more.
> > > 
> > > Or spelled out: The pointer you get could become immediately meaningless,
> > > before you can look at it, due to a concurrent removal/release. All the
> > > xa_for_each guarantees you is that on the next round you get the next
> > > pointer, until you got them all (plus/minus concurrent changes). But that
> > > next pointer could have become meaningless right away too.
> > > 
> > > So you need your own locking to make use of these pointers you got and
> > > make sure they're not immediately meaningless before your loop body even
> > > started.
> > > 
> > 
> > Ok, I think I see your point. Likely whenever we do a xa_for_each over
> > &guc->context_lookup we should just grab its lock as if it is in the
> > xarray we have reference to object looked up. Also everytime we use
> > xa_for_each on &guc->context_lookup it is a corner case we it is ok to
> > block anyone else from using this (e.g. during a reset, checking
> > debugfs, etc...). Does that sound correct?
> 
> Yup, generally the simplest is to just hold the lock for the
> list/xarray/whatever to keep the object alive. Next up in complexity is to
> grab a temporary reference. This is usually required if the next step is
> taking a mutex, and your lookup lock is a spinlock. Or if you have some
> other locking inversion.
> 
> And yes anywhere in debugfs, or anywhere else where performance doesn't
> matter just use proper locking, no tricks with rcu or lockless or
> whatever.
> 
> Finally a word on gpu reset: It is currently not annotated, but gpu reset
> is a dma_fence signalling critical section (if we fail to get through gpu
> reset dma_fence are potentially stuck). That means any lock you take in
> gpu reset is very encumbered, so needs an audit to make sure you're not
> creating an inversion anywhere. While I bring this up, I noticed you're
> using i915_sw_fence instead of dma_fence directly in a bunch of places in
> GuC code. We're kinda aiming to get rid of i915_sw_fence (and maybe move
> the remaining useful bits into drivers/dma-buf/), so using less of the
> i915-NIH-isms would be really good in general. There's unfortunately way
> too much of that too.

Yes, I'm aware of trying to get rid of the i915_sw_fence as a long term
goal. That is going to take quite a while to unwind.

Matt 

> -Daniel
> 
> > 
> > Matt
> > 
> > > One of the reasons why I think this is so important is that debugfs files
> > > nest a lot of loops fairly often, so are good cheat-sheet for the locking
> > > if it happens to be undocumented (which also shouldn't be the case). Ofc
> > > if there's no locking in debugfs, no cheat-sheet :-)
> > > 
> > > Cheers, Daniel
> > > 
> > > > -Daniel
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > -Daniel
> > > > > > 
> > > > > > > -		drm_printf(p, "GuC lrc descriptor %u:\n", ce->guc_id);
> > > > > > > -		drm_printf(p, "\tHW Context Desc: 0x%08x\n", ce->lrc.lrca);
> > > > > > > -		drm_printf(p, "\t\tLRC Head: Internal %u, Memory %u\n",
> > > > > > > -			   ce->ring->head,
> > > > > > > -			   ce->lrc_reg_state[CTX_RING_HEAD]);
> > > > > > > -		drm_printf(p, "\t\tLRC Tail: Internal %u, Memory %u\n",
> > > > > > > -			   ce->ring->tail,
> > > > > > > -			   ce->lrc_reg_state[CTX_RING_TAIL]);
> > > > > > > -		drm_printf(p, "\t\tContext Pin Count: %u\n",
> > > > > > > -			   atomic_read(&ce->pin_count));
> > > > > > > -		drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
> > > > > > > -			   atomic_read(&ce->guc_id_ref));
> > > > > > > -		drm_printf(p, "\t\tNumber Requests Not Ready: %u\n",
> > > > > > > -			   atomic_read(&ce->guc_num_rq_not_ready));
> > > > > > > -		drm_printf(p, "\t\tSchedule State: 0x%x, 0x%x\n\n",
> > > > > > > -			   ce->guc_state.sched_state,
> > > > > > > -			   atomic_read(&ce->guc_sched_state_no_lock));
> > > > > > > +		GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > >  
> > > > > > > +		guc_log_context(p, ce);
> > > > > > >  		guc_log_context_priority(p, ce);
> > > > > > > +
> > > > > > > +		if (intel_context_is_parent(ce)) {
> > > > > > > +			struct guc_process_desc *desc = __get_process_desc(ce);
> > > > > > > +			struct intel_context *child;
> > > > > > > +
> > > > > > > +			drm_printf(p, "\t\tWQI Head: %u\n",
> > > > > > > +				   READ_ONCE(desc->head));
> > > > > > > +			drm_printf(p, "\t\tWQI Tail: %u\n",
> > > > > > > +				   READ_ONCE(desc->tail));
> > > > > > > +			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > > > > > +				   READ_ONCE(desc->wq_status));
> > > > > > > +
> > > > > > > +			for_each_child(ce, child)
> > > > > > > +				guc_log_context(p, child);
> > > > > > > +		}
> > > > > > >  	}
> > > > > > >  }
> > > > > > >  
> > > > > > > -- 
> > > > > > > 2.28.0
> > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > http://blog.ffwll.ch
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-11  9:55       ` Daniel Vetter
@ 2021-08-11 17:43         ` Matthew Brost
  2021-08-12 14:04           ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 17:43 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Wed, Aug 11, 2021 at 11:55:48AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 07:32:26PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 07:17:27PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> > > > Some workloads use lots of contexts that continually pin / unpin
> > > > contexts. With GuC submission an unpin translates to a schedule disable
> > > > H2G which puts pressure on both the i915 and GuC. A schedule disable can
> > > > also block future requests from being submitted until the operation
> > > > completes. None of this is ideal.
> > > > 
> > > > Add a configurable, via debugfs, delay period before the schedule
> > > > disable is issued. Default delay period is 1 second. The delay period is
> > > > skipped if more than 3/4 of the guc_ids are in use.
> > > > 
> > > > This patch also updates the selftests to turn off this delay period as
> > > > this extra time would likely cause many selftests to fail. Follow up
> > > > patches will fix all the selftests and enable the delay period.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > I think this is more evidence that we should just pin/unpin context at
> > > create/destruction time. The current scheme doesn't really work that well
> > > and causes way more pain than benefits it seems.
> > > 
> > 
> > Well that choice is above my pay grade, but for what it is worth it
> > would simplify the GuC backend quite a bit if we perma-pin contexts. By
> > quite a bit, I actually mean a lot of complexity goes away.
> > 
> > In the meantime I think we probably need this code though to avoid
> > trashes on the scheduling enable / disable.
> 
> The trouble is that you muck around with the context close state bit,

This really doesn't mess this bit anymore that what is there, it just
adds callback to the backend.

> which is one of these lockless trickeries where my cursory analysis (just
> a few days in total of randomly stumbling over it when reading other code)
> strongly suggests it's busted.
> 
> I really don't want to build more on top, especially not without careful
> review and all that.
> 
> Also since this is a perf claim, the commit message needs some numbers.
>

This was basically just visual inspection of ftrace of a media workload
that uses lots of contexts. The contexts were repeatedly pinned /
unpinned. Disabling / enabling scheduling is a rather expensive
operation so we really shouldn't be doing it all the time. We visually
observed an ftrace after this change and all this unnecessary traffic
went away.

> Finally even if we decide to make contexts properly evictable, we need a
> different scheme anyway. As you realized the current active tracking is
> kinda backwards because it unpins immediately when no longer in use.

Right, this basically just works around the fact that contexts are
immediately unpinned when not in use. As stated before if we perma-pin
contexts all this goes away.

Matt

> -Daniel
> 
> > 
> > Matt
> > 
> > > If anyone screams, and that's a big if aside of some igts, we can come up
> > > with a proper scheme to evict contexts without pin/unpin and layer hacks
> > > over that misdesign.
> > > -Daniel
> > > 
> > > > ---
> > > >  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
> > > >  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
> > > >  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
> > > >  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
> > > >  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
> > > >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> > > >  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
> > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
> > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
> > > >  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
> > > >  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
> > > >  drivers/gpu/drm/i915/i915_trace.h             |  10 +
> > > >  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
> > > >  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
> > > >  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
> > > >  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
> > > >  18 files changed, 405 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > index b199d59bd2c4..1553287e5491 100644
> > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
> > > >  		int err;
> > > >  
> > > >  		/* serialises with execbuf */
> > > > -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > +		intel_context_close(ce);
> > > >  		if (!intel_context_pin_if_active(ce))
> > > >  			continue;
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > index 13b088cc787e..a666d7e610f5 100644
> > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
> > > >  		SUBTEST(igt_gem_coherency),
> > > >  	};
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > index ffae7df5e4d7..2c92afa9d608 100644
> > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
> > > >  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
> > > >  	};
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > index b20f5621f62b..4745c78a48de 100644
> > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
> > > >  		SUBTEST(igt_mmap_gpu),
> > > >  	};
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > index 740ee8086a27..ae1361c7c4cf 100644
> > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
> > > >  		SUBTEST(igt_gem_huge),
> > > >  	};
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 8e90a4a0b7b0..96643040defd 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > > >  	ce->guc_id = GUC_INVALID_LRC_ID;
> > > >  	INIT_LIST_HEAD(&ce->guc_id_link);
> > > >  
> > > > +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> > > > +
> > > >  	mutex_init(&ce->parallel_submit);
> > > >  	ce->fence_context = dma_fence_context_alloc(1);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > index a302599e436a..f4c9036f7f03 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
> > > >  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
> > > >  }
> > > >  
> > > > +static inline void intel_context_close(struct intel_context *ce)
> > > > +{
> > > > +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > +
> > > > +	trace_intel_context_close(ce);
> > > > +	if (ce->ops->close)
> > > > +		ce->ops->close(ce);
> > > > +}
> > > > +
> > > >  static inline bool intel_context_is_closed(const struct intel_context *ce)
> > > >  {
> > > >  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 8af9ace4c052..53f00657a45c 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -11,6 +11,7 @@
> > > >  #include <linux/list.h>
> > > >  #include <linux/mutex.h>
> > > >  #include <linux/types.h>
> > > > +#include <linux/ktime.h>
> > > >  
> > > >  #include "i915_active_types.h"
> > > >  #include "i915_sw_fence.h"
> > > > @@ -38,6 +39,7 @@ struct intel_context_ops {
> > > >  	int (*alloc)(struct intel_context *ce);
> > > >  
> > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > > +	void (*close)(struct intel_context *ce);
> > > >  
> > > >  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > >  	int (*pin)(struct intel_context *ce);
> > > > @@ -203,6 +205,12 @@ struct intel_context {
> > > >  	 */
> > > >  	struct list_head guc_id_link;
> > > >  
> > > > +	/*
> > > > +	 * GuC schedule disable link / time
> > > > +	 */
> > > > +	struct list_head guc_sched_disable_link;
> > > > +	ktime_t guc_sched_disable_time;
> > > > +
> > > >  	/* GuC context blocked fence */
> > > >  	struct i915_sw_fence guc_blocked;
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 30a0f364db8f..90b5b657d411 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -60,6 +60,7 @@ struct intel_guc {
> > > >  	struct ida guc_ids;
> > > >  	u32 num_guc_ids;
> > > >  	u32 max_guc_ids;
> > > > +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
> > > >  	unsigned long *guc_ids_bitmap;
> > > >  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> > > >  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > > > @@ -69,6 +70,12 @@ struct intel_guc {
> > > >  	struct list_head destroyed_contexts;
> > > >  	struct intel_gt_pm_unpark_work destroy_worker;
> > > >  
> > > > +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> > > > +	struct list_head sched_disable_list;
> > > > +	struct hrtimer sched_disable_timer;
> > > > +#define SCHED_DISABLE_DELAY_NS	1000000000
> > > > +	u64 sched_disable_delay_ns;
> > > > +
> > > >  	bool submission_supported;
> > > >  	bool submission_selected;
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > index 7c479c5e7b3a..53a6f3da6cce 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
> > > >  }
> > > >  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> > > >  
> > > > +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> > > > +{
> > > > +	struct intel_guc *guc = data;
> > > > +
> > > > +	if (!intel_guc_submission_is_used(guc))
> > > > +		return -ENODEV;
> > > > +
> > > > +	*val = guc->sched_disable_delay_ns;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> > > > +{
> > > > +	struct intel_guc *guc = data;
> > > > +
> > > > +	if (!intel_guc_submission_is_used(guc))
> > > > +		return -ENODEV;
> > > > +
> > > > +	guc->sched_disable_delay_ns = val;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> > > > +			guc_sched_disable_delay_ns_get,
> > > > +			guc_sched_disable_delay_ns_set, "%lld\n");
> > > > +
> > > >  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
> > > >  {
> > > >  	static const struct debugfs_gt_file files[] = {
> > > >  		{ "guc_info", &guc_info_fops, NULL },
> > > >  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
> > > >  		{ "guc_num_id", &guc_num_id_fops, NULL },
> > > > +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
> > > >  	};
> > > >  
> > > >  	if (!intel_guc_is_supported(guc))
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index cd1893edf43a..dc0d6a099bee 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> > > >  	return (timeout < 0) ? timeout : 0;
> > > >  }
> > > >  
> > > > +static void sched_disable_contexts_flush(struct intel_guc *guc);
> > > > +
> > > >  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> > > >  {
> > > >  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
> > > >  		return 0;
> > > >  
> > > > +	sched_disable_contexts_flush(guc);
> > > > +
> > > >  	return intel_guc_wait_for_pending_msg(guc,
> > > >  					      &guc->outstanding_submission_g2h,
> > > >  					      true, timeout);
> > > > @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
> > > >  static void guc_signal_context_fence(struct intel_context *ce);
> > > >  static void guc_cancel_context_requests(struct intel_context *ce);
> > > >  static void guc_blocked_fence_complete(struct intel_context *ce);
> > > > +static void sched_disable_context_delete(struct intel_context *ce);
> > > >  
> > > >  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > >  {
> > > > @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > >  		deregister = context_wait_for_deregister_to_register(ce);
> > > >  		banned = context_banned(ce);
> > > >  		init_sched_state(ce);
> > > > +		sched_disable_context_delete(ce);
> > > >  
> > > >  		if (pending_enable || destroyed || deregister) {
> > > >  			atomic_dec(&guc->outstanding_submission_g2h);
> > > > @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > >  
> > > >  	intel_gt_park_heartbeats(guc_to_gt(guc));
> > > >  	disable_submission(guc);
> > > > +	hrtimer_cancel(&guc->sched_disable_timer);
> > > >  	guc->interrupts.disable(guc);
> > > >  
> > > >  	/* Flush IRQ handler */
> > > > @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
> > > >  
> > > >  static void destroy_worker_func(struct work_struct *w);
> > > >  
> > > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> > > > +
> > > >  /*
> > > >   * Set up the memory resources to be shared with the GuC (via the GGTT)
> > > >   * at firmware loading time.
> > > > @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > >  	INIT_LIST_HEAD(&guc->destroyed_contexts);
> > > >  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
> > > >  
> > > > +	spin_lock_init(&guc->sched_disable_lock);
> > > > +	INIT_LIST_HEAD(&guc->sched_disable_list);
> > > > +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> > > > +		     HRTIMER_MODE_REL);
> > > > +	guc->sched_disable_timer.function = sched_disable_timer_func;
> > > > +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> > > > +
> > > >  	return 0;
> > > >  }
> > > >  
> > > > @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >  	if (unlikely(ret < 0))
> > > >  		return ret;
> > > >  
> > > > +	if (intel_context_is_parent(ce))
> > > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > > +			order_base_2(ce->guc_number_children + 1);
> > > > +	else
> > > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> > > > +
> > > >  	ce->guc_id = ret;
> > > >  	return 0;
> > > >  }
> > > > @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > >  {
> > > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > > >  	if (!context_guc_id_invalid(ce)) {
> > > > -		if (intel_context_is_parent(ce))
> > > > +		if (intel_context_is_parent(ce)) {
> > > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > > +				order_base_2(ce->guc_number_children + 1);
> > > >  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> > > >  					      order_base_2(ce->guc_number_children
> > > >  							   + 1));
> > > > -		else
> > > > +		} else {
> > > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
> > > >  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > > > +		}
> > > >  		clr_lrc_desc_registered(guc, ce->guc_id);
> > > > +
> > > >  		set_context_guc_id_invalid(ce);
> > > >  	}
> > > >  	if (!list_empty(&ce->guc_id_link))
> > > > @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > > >  			 * from another context that has more guc_id that itself.
> > > >  			 */
> > > >  			if (cn_o2 != ce_o2) {
> > > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > > +					order_base_2(cn->guc_number_children + 1);
> > > >  				bitmap_release_region(guc->guc_ids_bitmap,
> > > >  						      cn->guc_id,
> > > >  						      cn_o2);
> > > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > > +					order_base_2(ce->guc_number_children + 1);
> > > >  				bitmap_allocate_region(guc->guc_ids_bitmap,
> > > >  						       ce->guc_id,
> > > >  						       ce_o2);
> > > > @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  	__guc_context_unpin(ce);
> > > >  
> > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > > -		intel_engine_pm_put(ce->engine);
> > > > +		intel_engine_pm_put_async(ce->engine);
> > > >  }
> > > >  
> > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> > > >  
> > > >  	for_each_engine_masked(engine, ce->engine->gt,
> > > >  			       ce->engine->mask, tmp)
> > > > -		intel_engine_pm_put(engine);
> > > > +		intel_engine_pm_put_async(engine);
> > > >  	for_each_child(ce, child)
> > > >  		for_each_engine_masked(engine, child->engine->gt,
> > > >  				       child->engine->mask, tmp)
> > > > -			intel_engine_pm_put(engine);
> > > > +			intel_engine_pm_put_async(engine);
> > > >  }
> > > >  
> > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> > > >  
> > > >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > >  
> > > > +	sched_disable_context_delete(ce);
> > > > +
> > > >  	with_intel_runtime_pm(runtime_pm, wakeref)
> > > >  		__guc_context_sched_disable(guc, ce, guc_id);
> > > >  
> > > > @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> > > >  								     1);
> > > >  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > >  	}
> > > > +
> > > > +	sched_disable_context_delete(ce);
> > > > +}
> > > > +
> > > > +#define next_sched_disable_time(guc, now, ce) \
> > > > +	(guc->sched_disable_delay_ns - \
> > > > +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> > > > +static void ____sched_disable_context_delete(struct intel_guc *guc,
> > > > +					     struct intel_context *ce)
> > > > +{
> > > > +	bool is_first;
> > > > +
> > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> > > > +
> > > > +	is_first = list_is_first(&ce->guc_sched_disable_link,
> > > > +				 &guc->sched_disable_list);
> > > > +	list_del_init(&ce->guc_sched_disable_link);
> > > > +	if (list_empty(&guc->sched_disable_list)) {
> > > > +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > > +	} else if (is_first) {
> > > > +		struct intel_context *first =
> > > > +			list_first_entry(&guc->sched_disable_list,
> > > > +					 typeof(*first),
> > > > +					 guc_sched_disable_link);
> > > > +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> > > > +							first);
> > > > +
> > > > +		hrtimer_start(&guc->sched_disable_timer,
> > > > +			      ns_to_ktime(next_time),
> > > > +			      HRTIMER_MODE_REL_PINNED);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void __sched_disable_context_delete(struct intel_guc *guc,
> > > > +					   struct intel_context *ce)
> > > > +{
> > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > +		intel_context_sched_disable_unpin(ce);
> > > > +		____sched_disable_context_delete(guc, ce);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void sched_disable_context_delete(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +	unsigned long flags;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > +		__sched_disable_context_delete(guc, ce);
> > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void sched_disable_context_add(struct intel_guc *guc,
> > > > +				      struct intel_context *ce)
> > > > +{
> > > > +	unsigned long flags;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > > > +
> > > > +	ce->guc_sched_disable_time = ktime_get();
> > > > +
> > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > +	if (list_empty(&guc->sched_disable_list))
> > > > +		hrtimer_start(&guc->sched_disable_timer,
> > > > +			      ns_to_ktime(guc->sched_disable_delay_ns),
> > > > +			      HRTIMER_MODE_REL_PINNED);
> > > > +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> > > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > +}
> > > > +
> > > > +static void sched_disable_contexts_flush(struct intel_guc *guc)
> > > > +{
> > > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > > +	struct intel_context *ce, *cn;
> > > > +	unsigned long flags;
> > > > +
> > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > +
> > > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > > +					 guc_sched_disable_link) {
> > > > +		intel_wakeref_t wakeref;
> > > > +		bool enabled;
> > > > +		u16 guc_id;
> > > > +
> > > > +		list_del_init(&ce->guc_sched_disable_link);
> > > > +
> > > > +		spin_lock(&ce->guc_state.lock);
> > > > +		enabled = context_enabled(ce);
> > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > +			if (enabled)
> > > > +				clr_context_enabled(ce);
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			intel_context_sched_disable_unpin(ce);
> > > > +			continue;
> > > > +		}
> > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			continue;
> > > > +		}
> > > > +		guc_id = prep_context_pending_disable(ce);
> > > > +		spin_unlock(&ce->guc_state.lock);
> > > > +
> > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > +	}
> > > > +
> > > > +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > > +
> > > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > >  }
> > > >  
> > > > +#define should_sched_be_disabled(guc, now, ce) \
> > > > +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> > > > +	(guc->sched_disable_delay_ns / 4) * 3)
> > > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> > > > +{
> > > > +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> > > > +					     sched_disable_timer);
> > > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > > +	struct intel_context *ce, *cn;
> > > > +	unsigned long flags;
> > > > +	ktime_t now;
> > > > +
> > > > +	if (list_empty(&guc->sched_disable_list))
> > > > +		return HRTIMER_NORESTART;
> > > > +
> > > > +	now = ktime_get();
> > > > +
> > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > +
> > > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > > +					 guc_sched_disable_link) {
> > > > +		intel_wakeref_t wakeref;
> > > > +		bool enabled;
> > > > +		u16 guc_id;
> > > > +
> > > > +		/*
> > > > +		 * If a context has been waiting for 3/4 of its delay or more,
> > > > +		 * issue the schedule disable. Using this heuristic allows more
> > > > +		 * than 1 context to have its scheduling disabled when this
> > > > +		 * timer is run.
> > > > +		 */
> > > > +		if (!should_sched_be_disabled(guc, now, ce))
> > > > +			break;
> > > > +
> > > > +		list_del_init(&ce->guc_sched_disable_link);
> > > > +
> > > > +		spin_lock(&ce->guc_state.lock);
> > > > +		enabled = context_enabled(ce);
> > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > +			if (enabled)
> > > > +				clr_context_enabled(ce);
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			intel_context_sched_disable_unpin(ce);
> > > > +			continue;
> > > > +		}
> > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			continue;
> > > > +		}
> > > > +		guc_id = prep_context_pending_disable(ce);
> > > > +		spin_unlock(&ce->guc_state.lock);
> > > > +
> > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > +	}
> > > > +
> > > > +	if (!list_empty(&guc->sched_disable_list)) {
> > > > +		struct intel_context *first =
> > > > +			list_first_entry(&guc->sched_disable_list,
> > > > +					 typeof(*first),
> > > > +					 guc_sched_disable_link);
> > > > +		u64 next_time = next_sched_disable_time(guc, now, first);
> > > > +
> > > > +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > +
> > > > +		return HRTIMER_RESTART;
> > > > +	} else {
> > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > +
> > > > +		return HRTIMER_NORESTART;
> > > > +	}
> > > > +}
> > > > +
> > > > +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
> > > >  static void guc_context_sched_disable(struct intel_context *ce)
> > > >  {
> > > >  	struct intel_guc *guc = ce_to_guc(ce);
> > > > @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > > >  	intel_wakeref_t wakeref;
> > > >  	u16 guc_id;
> > > >  	bool enabled;
> > > > +	int guc_id_index = intel_context_is_parent(ce) ?
> > > > +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> > > > +	int max_guc_ids = intel_context_is_parent(ce) ?
> > > > +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> > > > +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> > > >  
> > > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > > >  
> > > >  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > > >  	    !lrc_desc_registered(guc, ce->guc_id)) {
> > > > @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > > >  	if (!context_enabled(ce))
> > > >  		goto unpin;
> > > >  
> > > > +	/*
> > > > +	 * If no guc_id pressure and the context isn't closed we delay the
> > > > +	 * schedule disable to not to continuously disable / enable scheduling
> > > > +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> > > > +	 * debugfs, default 1s.
> > > > +	 */
> > > > +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> > > > +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> > > > +		sched_disable_context_add(guc, ce);
> > > > +		return;
> > > > +	}
> > > > +
> > > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > >  
> > > >  	/*
> > > > @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
> > > >  	i915_request_notify_execute_cb_imm(rq);
> > > >  }
> > > >  
> > > > +static void __guc_context_close(struct intel_guc *guc,
> > > > +				struct intel_context *ce)
> > > > +{
> > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > +		struct intel_runtime_pm *runtime_pm =
> > > > +			ce->engine->uncore->rpm;
> > > > +		intel_wakeref_t wakeref;
> > > > +		bool enabled;
> > > > +		u16 guc_id;
> > > > +
> > > > +		spin_lock(&ce->guc_state.lock);
> > > > +		enabled = context_enabled(ce);
> > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > +			if (enabled)
> > > > +				clr_context_enabled(ce);
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			intel_context_sched_disable_unpin(ce);
> > > > +			goto update_list;
> > > > +		}
> > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > +			spin_unlock(&ce->guc_state.lock);
> > > > +			goto update_list;
> > > > +		}
> > > > +		guc_id = prep_context_pending_disable(ce);
> > > > +		spin_unlock(&ce->guc_state.lock);
> > > > +
> > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > +update_list:
> > > > +		____sched_disable_context_delete(guc, ce);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void guc_context_close(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > +	unsigned long flags;
> > > > +
> > > > +	/*
> > > > +	 * If we close the context and a schedule disable is pending a delay, do
> > > > +	 * it immediately.
> > > > +	 */
> > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > +		__guc_context_close(guc, ce);
> > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > +	}
> > > > +}
> > > > +
> > > >  static struct intel_context *
> > > >  guc_create_parallel(struct intel_engine_cs **engines,
> > > >  		    unsigned int num_siblings,
> > > > @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
> > > >  	.post_unpin = guc_context_post_unpin,
> > > >  
> > > >  	.ban = guc_context_ban,
> > > > +	.close = guc_context_close,
> > > >  
> > > >  	.cancel_request = guc_context_cancel_request,
> > > >  
> > > > @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
> > > >  
> > > >  	rq->reserved_space -= GUC_REQUEST_SIZE;
> > > >  
> > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> > > > +		   atomic_read(&ce->pin_count) < 3);
> > > > +	sched_disable_context_delete(ce);
> > > > +
> > > >  	/*
> > > >  	 * guc_ids are exhausted or a heuristic is met indicating too many
> > > >  	 * guc_ids are waiting on requests with submission dependencies (not
> > > > @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > >  	__guc_context_unpin(ce);
> > > >  
> > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > -		intel_engine_pm_put(engine);
> > > > +		intel_engine_pm_put_async(engine);
> > > >  }
> > > >  
> > > >  static void guc_virtual_context_enter(struct intel_context *ce)
> > > > @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > >  	.post_unpin = guc_context_post_unpin,
> > > >  
> > > >  	.ban = guc_context_ban,
> > > > +	.close = guc_context_close,
> > > >  
> > > >  	.cancel_request = guc_context_cancel_request,
> > > >  
> > > > @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
> > > >  	.post_unpin = guc_parent_context_post_unpin,
> > > >  
> > > >  	.ban = guc_context_ban,
> > > > +	.close = guc_context_close,
> > > >  
> > > >  	.enter = guc_virtual_context_enter,
> > > >  	.exit = guc_virtual_context_exit,
> > > > @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > >  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> > > >  		   atomic_read(&guc->outstanding_submission_g2h));
> > > >  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> > > > -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> > > > +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> > > > +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> > > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> > > > +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> > > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
> > > >  	drm_printf(p, "GuC max context registered: %u\n\n",
> > > >  		   guc->lrcd_reg.max_idx);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > index 9cfecf9d368e..ad70b3159ce4 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
> > > >  #define NUM_RQ_PER_CONTEXT	2
> > > >  #define HEARTBEAT_INTERVAL	1500
> > > >  
> > > > -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> > > > +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> > > > +					bool hang, bool sched_disable_delay)
> > > >  {
> > > >  	struct intel_gt *gt = arg;
> > > >  	struct intel_guc *guc = &gt->uc.guc;
> > > > @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > >  	if (limit_guc_ids)
> > > >  		guc->num_guc_ids = NUM_GUC_ID;
> > > >  
> > > > +	if (sched_disable_delay)
> > > > +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> > > > +
> > > >  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
> > > >  	if (IS_ERR(ce)) {
> > > >  		ret = PTR_ERR(ce);
> > > > @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > >  	guc->num_guc_ids = guc->max_guc_ids;
> > > >  	guc->gse_hang_expected = false;
> > > >  	guc->inject_bad_sched_disable = false;
> > > > +	guc->sched_disable_delay_ns = 0;
> > > >  	kfree(contexts);
> > > >  
> > > >  	return ret;
> > > > @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > >  
> > > >  static int intel_guc_flow_control_guc_ids(void *arg)
> > > >  {
> > > > -	return __intel_guc_flow_control_guc(arg, true, false);
> > > > +	return __intel_guc_flow_control_guc(arg, true, false, false);
> > > > +}
> > > > +
> > > > +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> > > > +{
> > > > +	return __intel_guc_flow_control_guc(arg, true, false, true);
> > > >  }
> > > >  
> > > >  static int intel_guc_flow_control_lrcd_reg(void *arg)
> > > >  {
> > > > -	return __intel_guc_flow_control_guc(arg, false, false);
> > > > +	return __intel_guc_flow_control_guc(arg, false, false, false);
> > > >  }
> > > >  
> > > >  static int intel_guc_flow_control_hang_state_machine(void *arg)
> > > >  {
> > > > -	return __intel_guc_flow_control_guc(arg, true, true);
> > > > +	return __intel_guc_flow_control_guc(arg, true, true, false);
> > > >  }
> > > >  
> > > >  #define NUM_RQ_STRESS_CTBS	0x4000
> > > > @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
> > > >  	static const struct i915_subtest tests[] = {
> > > >  		SUBTEST(intel_guc_flow_control_stress_ctbs),
> > > >  		SUBTEST(intel_guc_flow_control_guc_ids),
> > > > +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
> > > >  		SUBTEST(intel_guc_flow_control_lrcd_reg),
> > > >  		SUBTEST(intel_guc_flow_control_hang_state_machine),
> > > >  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> > > > diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> > > > index f54de0499be7..bf464db7affe 100644
> > > > --- a/drivers/gpu/drm/i915/i915_selftest.h
> > > > +++ b/drivers/gpu/drm/i915/i915_selftest.h
> > > > @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
> > > >  			T, ARRAY_SIZE(T), data)
> > > >  #define i915_live_subtests(T, data) ({ \
> > > >  	typecheck(struct drm_i915_private *, data); \
> > > > +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
> > > >  	__i915_subtests(__func__, \
> > > >  			__i915_live_setup, __i915_live_teardown, \
> > > >  			T, ARRAY_SIZE(T), data); \
> > > >  })
> > > >  #define intel_gt_live_subtests(T, data) ({ \
> > > >  	typecheck(struct intel_gt *, data); \
> > > > +	(data)->uc.guc.sched_disable_delay_ns = 0; \
> > > >  	__i915_subtests(__func__, \
> > > >  			__intel_gt_live_setup, __intel_gt_live_teardown, \
> > > >  			T, ARRAY_SIZE(T), data); \
> > > > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > > > index 806ad688274b..57ba7065d5ab 100644
> > > > --- a/drivers/gpu/drm/i915/i915_trace.h
> > > > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > > > @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
> > > >  	     TP_ARGS(ce)
> > > >  );
> > > >  
> > > > +DEFINE_EVENT(intel_context, intel_context_close,
> > > > +	     TP_PROTO(struct intel_context *ce),
> > > > +	     TP_ARGS(ce)
> > > > +);
> > > > +
> > > >  DEFINE_EVENT(intel_context, intel_context_ban,
> > > >  	     TP_PROTO(struct intel_context *ce),
> > > >  	     TP_ARGS(ce)
> > > > @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
> > > >  {
> > > >  }
> > > >  
> > > > +static inline void
> > > > +trace_intel_context_close(struct intel_context *ce)
> > > > +{
> > > > +}
> > > > +
> > > >  static inline void
> > > >  trace_intel_context_ban(struct intel_context *ce)
> > > >  {
> > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > index f843a5040706..d54c280217fe 100644
> > > > --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
> > > >  
> > > >  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > index 9e9a6cb1d9e5..86bad00cca95 100644
> > > > --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
> > > >  	if (err)
> > > >  		return err;
> > > >  
> > > > -	err = i915_subtests(tests, i915);
> > > > +	err = i915_live_subtests(tests, i915);
> > > >  
> > > >  	destroy_empty_config(&i915->perf);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > index d67710d10615..afbf88865a8b 100644
> > > > --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
> > > >  	if (intel_gt_is_wedged(&i915->gt))
> > > >  		return 0;
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > >  
> > > >  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > index dd0607254a95..f4b157451851 100644
> > > > --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
> > > >  		SUBTEST(igt_vma_remapped_gtt),
> > > >  	};
> > > >  
> > > > -	return i915_subtests(tests, i915);
> > > > +	return i915_live_subtests(tests, i915);
> > > >  }
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission
  2021-08-10  6:47       ` Daniel Vetter
@ 2021-08-11 17:47         ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 17:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 08:47:10AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 06:20:51PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 04:27:01PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:08PM -0700, Matthew Brost wrote:
> > > > Calling switch_to_kernel_context isn't needed if the engine PM reference
> > > > is taken while all contexts are pinned. By not calling
> > > > switch_to_kernel_context we save on issuing a request to the engine.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > > index 1f07ac4e0672..58099de6bf07 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> > > > @@ -162,6 +162,10 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
> > > >  	unsigned long flags;
> > > >  	bool result = true;
> > > >  
> > > > +	/* No need to switch_to_kernel_context if GuC submission */
> > > 
> > > Maybe whack a big FIXME on here that we should unravel this properly.
> > 
> > Sure, can add a FIXME here.
> > 
> > > Currently the execlist backend assumptions are leaked all over the place,
> > > leading to stuff like this. Which means extremely fragile code.
> > >
> > 
> > Yes, this something required for execlists implemented in what should be
> > generic code. 
> > 
> > > I currently don't have a great idea on how exactly we should do that, but
> > > oh well.
> > 
> > Me either, it will be a process.
> > 
> > > 
> > > btw just in case we ever want to make guc lrc properly evictable (which as
> > > the og use-case for this function, way, way back), would we need to fully
> > 
> > Can you explain what you mean by fully evictable? Not getting what you
> > mean in this context.
> > 
> > > unregister them from guc? At least I'm assuming there's no other trick
> > 
> > If scheduling is disabled on the context (currently done on unpin) you are
> > free move anything around as the GuC is guaranteed not to touch the
> > context state. If on re-pin something has moved (e.g. the LRC vaddr is
> > different), you need to unregister and re-register the context with the
> > GuC.
> 
> So at that point GuC also guarantees that it's not left in the hw engine?
> Execlist has this barrier request to fully unload the ctx from the hw, and
> that's also why I cam on the topic of OA.
> 
> > > like the below one.
> > > 
> > > Another aside: How does the perf/OA patching work on GuC?
> > >
> > 
> > Not my area of expertise but perf somewhat a WIP. The plan is for the
> > GuC to write out some stats to HWSP I think? John Harrison is working to
> > get this fully implemented.
> > 
> > OA is working afaik, with Umesh Nerlige Ramappa being the expert here.
> 
> I think it's OA that I'm thinking of here: We have code in i915_perf.c to
> patch all the ctx currently in the system, so that they have a consistent
> OA config. That's also relying on this barrier stuff, and I was wondering
> how that will work with GuC.
> -Daniel
> 

Not an OA expert at all but glanced at the code I don't see anything in
there that prevents it working with GuC submission. We certainly have
this working internally. If you have questions about this I'd reach out to
Umesh Nerlige Ramappa as he likely has the answers.

Matt

> > 
> > Matt
> > 
> > > Anyway, patch looks legit:
> > > 
> > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > 
> > > 
> > > > +	if (intel_engine_uses_guc(engine))
> > > > +		return true;
> > > > +
> > > >  	/* GPU is pointing to the void, as good as in the kernel context. */
> > > >  	if (intel_gt_is_wedged(engine->gt))
> > > >  		return true;
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user
  2021-08-10  6:53       ` Daniel Vetter
@ 2021-08-11 17:55         ` Matthew Brost
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 17:55 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 08:53:16AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 06:37:01PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 04:30:06PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:11PM -0700, Matthew Brost wrote:
> > > > Expose logical engine instance to user via query engine info IOCTL. This
> > > > is required for split-frame workloads as these needs to be placed on
> > > > engines in a logically contiguous order. The logical mapping can change
> > > > based on fusing. Rather than having user have knowledge of the fusing we
> > > > simply just expose the logical mapping with the existing query engine
> > > > info IOCTL.
> > > > 
> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > Uapi must have a link to the userspace MR/patch set using this, and to the
> > > igt patch set validating it.
> > > 
> > 
> > Have an IGT:
> > https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
> > 
> > Not sure when the media UMD is going to be updated upstream to use this.
> > Does that mean I can't merge this until the media UMD is ready? Seems
> > like it but isn't that a circular dependency? How can the media team
> > develop for a new uAPI that isn't in the kernel yet?
> 
> Yes and no. Full explainer here:
> 
> https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> 
> In the drm subsystem this is pretty much the only rule where if you break
> it the book will be thrown at you with extreme prejudice.
>

Well I don't want a book thrown at, new here and trying to play by the
rules.

> Also wrt circular: If the umd aren't set up to test their branches against
> kernel branches they need to fix their stuff. I know that internally
> that's not been done, and its a disaster, but in upstream there's no room
> for excuses. Both kernel and userspace needs to be in branches until it's
> ready for merging.
> 

Ok, looks like a have a few things to learn. I'll coordinate with the
media team on this. Likely won't have links to the UMD in the next spin
but I'll have a branch for them to prep their patches on.

Matt

> > For what it is worth the downstream release is already using this.
> 
> Yeah which is another problem, shipping new uapi in downstream before it's
> in upstream is decidedly not great.
> -Daniel
> 
> > 
> > Matt
> > 
> > > Ideally in each patch, since it's way too hard to unfortunately find the
> > > cover letter late on.
> > > 
> > > Jason even went as far as making this a hard requirement because he wasted
> > > a bit too much time trying to find the userspace for new uapi:
> > > 
> > > https://lore.kernel.org/dri-devel/20210804185704.624883-1-jason@jlekstrand.net/
> > > 
> > > Cheers, Daniel
> > > 
> > > >---
> > > >  drivers/gpu/drm/i915/i915_query.c | 2 ++
> > > >  include/uapi/drm/i915_drm.h       | 8 +++++++-
> > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
> > > > index e49da36c62fb..8a72923fbdba 100644
> > > > --- a/drivers/gpu/drm/i915/i915_query.c
> > > > +++ b/drivers/gpu/drm/i915/i915_query.c
> > > > @@ -124,7 +124,9 @@ query_engine_info(struct drm_i915_private *i915,
> > > >  	for_each_uabi_engine(engine, i915) {
> > > >  		info.engine.engine_class = engine->uabi_class;
> > > >  		info.engine.engine_instance = engine->uabi_instance;
> > > > +		info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;
> > > >  		info.capabilities = engine->uabi_capabilities;
> > > > +		info.logical_instance = ilog2(engine->logical_mask);
> > > >  
> > > >  		if (copy_to_user(info_ptr, &info, sizeof(info)))
> > > >  			return -EFAULT;
> > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > index 7f13d241417f..ef72e07fe08c 100644
> > > > --- a/include/uapi/drm/i915_drm.h
> > > > +++ b/include/uapi/drm/i915_drm.h
> > > > @@ -2706,14 +2706,20 @@ struct drm_i915_engine_info {
> > > >  
> > > >  	/** @flags: Engine flags. */
> > > >  	__u64 flags;
> > > > +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
> > > >  
> > > >  	/** @capabilities: Capabilities of this engine. */
> > > >  	__u64 capabilities;
> > > >  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
> > > >  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
> > > >  
> > > > +	/** @logical_instance: Logical instance of engine */
> > > > +	__u16 logical_instance;
> > > > +
> > > >  	/** @rsvd1: Reserved fields. */
> > > > -	__u64 rsvd1[4];
> > > > +	__u16 rsvd1[3];
> > > > +	/** @rsvd2: Reserved fields. */
> > > > +	__u64 rsvd2[3];
> > > >  };
> > > >  
> > > >  /**
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-10  9:07         ` Daniel Vetter
@ 2021-08-11 18:06           ` Matthew Brost
  2021-08-12 14:45             ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 18:06 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 11:07:55AM +0200, Daniel Vetter wrote:
> On Tue, Aug 10, 2021 at 10:53:37AM +0200, Daniel Vetter wrote:
> > On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > > > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > > > Implement GuC parent-child context pin / unpin functions in which in any
> > > > > contexts in the relationship are pinned all the contexts are pinned. The
> > > > > parent owns most of the pinning / unpinning process and the children
> > > > > direct any pins / unpins to the parent.
> > > > > 
> > > > > Patch implements a number of unused functions that will be connected
> > > > > later in the series.
> > > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > > > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > > > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > > > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > > > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > > > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > > > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > > > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > index 8cb92b10b547..bb4c14656067 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > > > >  	intel_ring_unpin(ring);
> > > > >  }
> > > > >  
> > > > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > > > -				 struct i915_gem_ww_ctx *ww)
> > > > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > > > +				   struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	int err;
> > > > >  
> > > > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > > > >  	return err;
> > > > >  }
> > > > >  
> > > > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > > > >  {
> > > > >  	if (ce->state)
> > > > >  		__context_unpin_state(ce->state);
> > > > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > > > >  	__ring_retire(ce->ring);
> > > > >  }
> > > > >  
> > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > -			      struct i915_gem_ww_ctx *ww)
> > > > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > > > +				 struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > > -	bool handoff = false;
> > > > > -	void *vaddr;
> > > > > +	struct intel_context *child;
> > > > > +	int err, i = 0;
> > > > > +
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		err = __intel_context_pre_pin(child, ww);
> > > > > +		if (unlikely(err))
> > > > > +			goto unwind;
> > > > > +		++i;
> > > > > +	}
> > > > > +
> > > > > +	err = __intel_context_pre_pin(ce, ww);
> > > > > +	if (unlikely(err))
> > > > > +		goto unwind;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +unwind:
> > > > > +	for_each_child(ce, child) {
> > > > > +		if (!i--)
> > > > > +			break;
> > > > > +		__intel_context_post_unpin(ce);
> > > > > +	}
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	struct intel_context *child;
> > > > > +
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	for_each_child(ce, child)
> > > > > +		__intel_context_post_unpin(child);
> > > > > +
> > > > > +	__intel_context_post_unpin(ce);
> > > > > +}
> > > > > +
> > > > > +static int __do_ww_lock(struct intel_context *ce,
> > > > > +			struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > +
> > > > > +	if (!err && ce->ring->vma->obj)
> > > > > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > +	if (!err && ce->state)
> > > > > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +static int do_ww_lock(struct intel_context *ce,
> > > > > +		      struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	struct intel_context *child;
> > > > >  	int err = 0;
> > > > >  
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		err = __do_ww_lock(child, ww);
> > > > > +		if (unlikely(err))
> > > > > +			return err;
> > > > > +	}
> > > > > +
> > > > > +	return __do_ww_lock(ce, ww);
> > > > > +}
> > > > > +
> > > > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > +				     struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	bool handoff = false;
> > > > > +	int err;
> > > > > +
> > > > >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > > > >  		err = intel_context_alloc_state(ce);
> > > > >  		if (err)
> > > > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > >  	 * refcount for __intel_context_active(), which prevent a lock
> > > > >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> > > > >  	 */
> > > > > +	err = do_ww_lock(ce, ww);
> > > > > +	if (err)
> > > > > +		return err;
> > > > >  
> > > > > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > -	if (!err && ce->ring->vma->obj)
> > > > > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > -	if (!err && ce->state)
> > > > > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > -	if (!err)
> > > > > -		err = intel_context_pre_pin(ce, ww);
> > > > > +	err = intel_context_pre_pin(ce, ww);
> > > > >  	if (err)
> > > > >  		return err;
> > > > >  
> > > > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > >  	if (err)
> > > > >  		goto err_ctx_unpin;
> > > > >  
> > > > > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > > > +	err = ce->ops->pre_pin(ce, ww);
> > > > >  	if (err)
> > > > >  		goto err_release;
> > > > >  
> > > > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > >  		if (unlikely(err))
> > > > >  			goto err_unlock;
> > > > >  
> > > > > -		err = ce->ops->pin(ce, vaddr);
> > > > > +		err = ce->ops->pin(ce);
> > > > >  		if (err) {
> > > > >  			intel_context_active_release(ce);
> > > > >  			goto err_unlock;
> > > > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > >  	return err;
> > > > >  }
> > > > >  
> > > > > -int __intel_context_do_pin(struct intel_context *ce)
> > > > > +static int __intel_context_do_pin(struct intel_context *ce)
> > > > >  {
> > > > >  	struct i915_gem_ww_ctx ww;
> > > > >  	int err;
> > > > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > > > >  		 intel_context_get_avg_runtime_ns(ce));
> > > > >  
> > > > >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > > > -	intel_context_post_unpin(ce);
> > > > > +	__intel_context_post_unpin(ce);
> > > > >  	intel_context_put(ce);
> > > > >  }
> > > > >  
> > > > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > > >  	child->parent = parent;
> > > > >  }
> > > > >  
> > > > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > > > +{
> > > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > > +		return 0;
> > > > > +
> > > > > +	return __intel_context_do_pin(ce);
> > > > > +}
> > > > > +
> > > > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > > > +					 struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > > +		return 0;
> > > > > +
> > > > > +	return __intel_context_do_pin_ww(ce, ww);
> > > > > +}
> > > > > +
> > > > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	if (!ce->ops->sched_disable) {
> > > > > +		__intel_context_do_unpin(ce, 1);
> > > > > +	} else {
> > > > > +		/*
> > > > > +		 * Move ownership of this pin to the scheduling disable which is
> > > > > +		 * an async operation. When that operation completes the above
> > > > > +		 * intel_context_sched_disable_unpin is called potentially
> > > > > +		 * unpinning the context.
> > > > > +		 */
> > > > > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> 
> Just as an example of what I mean here on the code review side. This is an
> endless loop, and you need to prove that there's no livelock or starvation
> issues. Or explain how else you handle that if there is one.
> 

If we pop into the while loop the pin_count = 1, so in all likelihood the
follow if evaluates to true and the loop is broken. Only way it
evaluates to false is the context gets pinned between the while & if, so
on the next pass the while statement should evaluate to false breaking
the looping unless of course the contexts gets unpinned again... In
pratice this should be at most 3 atomic operations unless the loop is
broken.

Matt

> Because unlike hand-rolled stuff linux kernel spinlocks are not dumb
> spinlocks, but ticketed/queued locks and therefor starvation proof. But
> this stuff actually matters on todays multi-core and not-so-uniform (even
> without fully NUMA) architectures.
> 
> Also I've just found another lockless retry loop which does actually
> degenerate into a full endless loop (if you're sufficiently unlucky in
> your races), so this really isn't academic at all.
> -Daniel
> 
> > > > 
> > > > Uh man lockless algorithms.
> > > > 
> > > > Unless this comes:
> > > > - with essentially an academic looking paper that describes the abstract
> > > >   model of the lockless algorithm and proves it against the linux kernel
> > > >   meory model.
> > > > 
> > > > - lockless stuff generally needs barriers, and those barriers must be all
> > > >   documented. This means a) a comment next to each barrier in the code b)
> > > >   pointing to its counterparty c) with the overall design also explained
> > > >   in the kerneldoc for those datastructres.
> > > > 
> > > >   If you don't know where your barriers are, see above point about "it
> > > >   should look more like an academic paper in the commit message"
> > > > 
> > > > - hard perf data about how this is absolutely required, based on a
> > > >   real-world use-case (which then sometimes justifies a microbenchmark
> > > >   metric for the details, but it always needs to be real-world based). And
> > > >   also a throughrough explainer how the perf issue isn't fixable through
> > > >   better design. If that's not doable, just protect the state machine with
> > > >   a big dumb lock and move on.
> > > > 
> > > > - Also, because the current code is in such bad shape wrt lockless
> > > >   algorithms and premature optimizations: Overall complexity should go
> > > >   down (it's way too high right now), so pay down your new lockless trick
> > > >   by removing one of the existing ones that we only have because we can.
> > > > 
> > > > Yes this is steep, but we're way out in the woods here and need to smoehow
> > > > get back.
> > > 
> > > See below FIXME. At one point all of this was hidden in the backend but
> > > the dma-resv patches that landed upstream completely broke the layering,
> > > hence the need for the code here.
> > > 
> > > I guess I don't really understand what mean when you say lockless alg
> > > needs barriers, if the atomic functions are not really atomic wouldn't
> > > the world be broken?
> > 
> > They unordered atomics by default. Which means they're atomic itself, but
> > entirely unordered with anything else that's going on. Except when you
> > have one of the atomic ops which already guarantee a barrier, or you
> > manually add the barriers yourself. And yes there's enormous amounts of
> > bugs, and with our dgpu potentially running on non-IA cpus those bugs
> > matter.
> > 
> > Note that in C++ atomics the default behaviour is strongly ordered atomics
> > with full barriers by default, because those are much easier to program
> > against. Kernel isn't like that and defaults to "you need to add all the
> > barriers yourself".
> > 
> > I have a full lenght rant in the works and will work that through all
> > channels, but essentially locking is really hard to get right. And
> > lockless tricks practically need an academic paper with a formal
> > correctness proof against the linux memory model, or you do have bugs.
> > 
> > And I know that the current code is choke full of this stuff, so it's
> > tempting to just add more, but we really cant. The amount of locking
> > trickery we have in the codebase must go down substantially. My take is
> > that any code that adds anything trick needs to fully justify it against
> > the above list, _and_ also clean up some of the existing nonsense so that
> > overall complexity doesn't increase.
> > 
> > I'll share the full length rant with you internally, it's not yet ready
> > for publishing (but that's planned too).
> > 
> > 
> > > Also here I don't think it is really as simple as grab big dump lock for
> > > a variety of reasons, at least with the current dynamic pin / unpin code
> > > in place. If we move a perma-pinned contexts this could be cleaned up
> > > then.
> > 
> > Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
> > context can fix this I think we should do this asap. I'd say for parallel
> > context we should just do it outright (special case them or whatever) so
> > that we don't have to add even more very tricky code and tech debt.
> > 
> > Doable?
> > 
> > Cheers, Daniel
> > 
> > 
> > > 
> > > Matt
> > > 
> > > > -Daniel
> > > > 
> > > > > +				ce->ops->sched_disable(ce);
> > > > > +				break;
> > > > > +			}
> > > > > +		}
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > > > + * configured for parallel submission, are pinned all the contexts need to be
> > > > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > > > + * is in place works but is not ideal.
> > > > > + */
> > > > > +int intel_context_pin(struct intel_context *ce)
> > > > > +{
> > > > > +	if (intel_context_is_child(ce)) {
> > > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > +			return ____intel_context_pin(ce->parent);
> > > > > +		else
> > > > > +			return 0;
> > > > > +	} else {
> > > > > +		return ____intel_context_pin(ce);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > +			 struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	if (intel_context_is_child(ce)) {
> > > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > +			return __intel_context_pin_ww(ce->parent, ww);
> > > > > +		else
> > > > > +			return 0;
> > > > > +	} else {
> > > > > +		return __intel_context_pin_ww(ce, ww);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +void intel_context_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	if (intel_context_is_child(ce)) {
> > > > > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > > > +			__intel_context_unpin(ce->parent);
> > > > > +	} else {
> > > > > +		__intel_context_unpin(ce);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > > >  #include "selftest_context.c"
> > > > >  #endif
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > index ad6ce5ac4824..c208691fc87d 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > > > >  	mutex_unlock(&ce->pin_mutex);
> > > > >  }
> > > > >  
> > > > > -int __intel_context_do_pin(struct intel_context *ce);
> > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > -			      struct i915_gem_ww_ctx *ww);
> > > > > -
> > > > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > > > >  {
> > > > >  	return atomic_inc_not_zero(&ce->pin_count);
> > > > >  }
> > > > >  
> > > > > -static inline int intel_context_pin(struct intel_context *ce)
> > > > > -{
> > > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > > -		return 0;
> > > > > -
> > > > > -	return __intel_context_do_pin(ce);
> > > > > -}
> > > > > -
> > > > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > > > -				       struct i915_gem_ww_ctx *ww)
> > > > > -{
> > > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > > -		return 0;
> > > > > +int intel_context_pin(struct intel_context *ce);
> > > > >  
> > > > > -	return __intel_context_do_pin_ww(ce, ww);
> > > > > -}
> > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > +			 struct i915_gem_ww_ctx *ww);
> > > > >  
> > > > >  static inline void __intel_context_pin(struct intel_context *ce)
> > > > >  {
> > > > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > > > >  
> > > > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > > > >  {
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > >  	__intel_context_do_unpin(ce, 2);
> > > > >  }
> > > > >  
> > > > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > > > -{
> > > > > -	if (!ce->ops->sched_disable) {
> > > > > -		__intel_context_do_unpin(ce, 1);
> > > > > -	} else {
> > > > > -		/*
> > > > > -		 * Move ownership of this pin to the scheduling disable which is
> > > > > -		 * an async operation. When that operation completes the above
> > > > > -		 * intel_context_sched_disable_unpin is called potentially
> > > > > -		 * unpinning the context.
> > > > > -		 */
> > > > > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > > > -				ce->ops->sched_disable(ce);
> > > > > -				break;
> > > > > -			}
> > > > > -		}
> > > > > -	}
> > > > > -}
> > > > > +void intel_context_unpin(struct intel_context *ce);
> > > > >  
> > > > >  void intel_context_enter_engine(struct intel_context *ce);
> > > > >  void intel_context_exit_engine(struct intel_context *ce);
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > index 66b22b370a72..eb82be15b7a2 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > > > >  
> > > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > > >  
> > > > > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > > > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > > > > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > > +	int (*pin)(struct intel_context *ce);
> > > > >  	void (*unpin)(struct intel_context *ce);
> > > > >  	void (*post_unpin)(struct intel_context *ce);
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > index baa1797af1c8..fc74ca28f245 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > > > >  static int
> > > > >  __execlists_context_pre_pin(struct intel_context *ce,
> > > > >  			    struct intel_engine_cs *engine,
> > > > > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > > > > +			    struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	int err;
> > > > >  
> > > > > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > +	err = lrc_pre_pin(ce, engine, ww);
> > > > >  	if (err)
> > > > >  		return err;
> > > > >  
> > > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > > > -		lrc_init_state(ce, engine, *vaddr);
> > > > > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > > > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > > > >  
> > > > >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > > > >  	}
> > > > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > > > >  }
> > > > >  
> > > > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > > > -				     struct i915_gem_ww_ctx *ww,
> > > > > -				     void **vaddr)
> > > > > +				     struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> > > > >  }
> > > > >  
> > > > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > > > +static int execlists_context_pin(struct intel_context *ce)
> > > > >  {
> > > > > -	return lrc_pin(ce, ce->engine, vaddr);
> > > > > +	return lrc_pin(ce, ce->engine);
> > > > >  }
> > > > >  
> > > > >  static int execlists_context_alloc(struct intel_context *ce)
> > > > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > > > >  }
> > > > >  
> > > > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > > > -				   struct i915_gem_ww_ctx *ww,
> > > > > -				   void **vaddr)
> > > > > +				   struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > >  
> > > > >  	 /* Note: we must use a real engine class for setting up reg state */
> > > > > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > > > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > > > >  }
> > > > >  
> > > > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > +static int virtual_context_pin(struct intel_context *ce)
> > > > >  {
> > > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > >  
> > > > > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > > > > +	return lrc_pin(ce, ve->siblings[0]);
> > > > >  }
> > > > >  
> > > > >  static void virtual_context_enter(struct intel_context *ce)
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > index bb4af4977920..c466fc966005 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > > > >  int
> > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > >  	    struct intel_engine_cs *engine,
> > > > > -	    struct i915_gem_ww_ctx *ww,
> > > > > -	    void **vaddr)
> > > > > +	    struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > > +	void *vaddr;
> > > > >  	GEM_BUG_ON(!ce->state);
> > > > >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > > > >  
> > > > > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > -					 i915_coherent_map_type(ce->engine->i915,
> > > > > -								ce->state->obj,
> > > > > -								false) |
> > > > > -					 I915_MAP_OVERRIDE);
> > > > > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > +					i915_coherent_map_type(ce->engine->i915,
> > > > > +							       ce->state->obj,
> > > > > +							       false) |
> > > > > +					I915_MAP_OVERRIDE);
> > > > >  
> > > > > -	return PTR_ERR_OR_ZERO(*vaddr);
> > > > > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > +
> > > > > +	return PTR_ERR_OR_ZERO(vaddr);
> > > > >  }
> > > > >  
> > > > >  int
> > > > >  lrc_pin(struct intel_context *ce,
> > > > > -	struct intel_engine_cs *engine,
> > > > > -	void *vaddr)
> > > > > +	struct intel_engine_cs *engine)
> > > > >  {
> > > > > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > -
> > > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > > > -		lrc_init_state(ce, engine, vaddr);
> > > > > +		lrc_init_state(ce, engine,
> > > > > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > > > >  
> > > > >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > > > >  	return 0;
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > index 7f697845c4cf..837fcf00270d 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > > > >  int
> > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > >  	    struct intel_engine_cs *engine,
> > > > > -	    struct i915_gem_ww_ctx *ww,
> > > > > -	    void **vaddr);
> > > > > +	    struct i915_gem_ww_ctx *ww);
> > > > >  int
> > > > >  lrc_pin(struct intel_context *ce,
> > > > > -	struct intel_engine_cs *engine,
> > > > > -	void *vaddr);
> > > > > +	struct intel_engine_cs *engine);
> > > > >  void lrc_unpin(struct intel_context *ce);
> > > > >  void lrc_post_unpin(struct intel_context *ce);
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > index 2958e2fae380..f4f301bfb9f7 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > > > >  }
> > > > >  
> > > > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > > > -				struct i915_gem_ww_ctx *ww,
> > > > > -				void **unused)
> > > > > +				struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	struct i915_address_space *vm;
> > > > >  	int err = 0;
> > > > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > > > +static int ring_context_pin(struct intel_context *ce)
> > > > >  {
> > > > >  	return 0;
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > index 2c1af030310c..826b5d7a4573 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > > > >  }
> > > > >  
> > > > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > > > -				struct i915_gem_ww_ctx *ww, void **unused)
> > > > > +				struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > > > +static int mock_context_pin(struct intel_context *ce)
> > > > >  {
> > > > >  	return 0;
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index dec757d319a2..c5c73c42bcf7 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > >  
> > > > >  	GEM_BUG_ON(!engine->mask);
> > > > >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > >  
> > > > >  	/*
> > > > >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > > > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > >  
> > > > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > > > >  				 struct intel_engine_cs *engine,
> > > > > -				 struct i915_gem_ww_ctx *ww,
> > > > > -				 void **vaddr)
> > > > > +				 struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > +	return lrc_pre_pin(ce, engine, ww);
> > > > >  }
> > > > >  
> > > > >  static int __guc_context_pin(struct intel_context *ce,
> > > > > -			     struct intel_engine_cs *engine,
> > > > > -			     void *vaddr)
> > > > > +			     struct intel_engine_cs *engine)
> > > > >  {
> > > > >  	if (i915_ggtt_offset(ce->state) !=
> > > > >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > > > >  	 * explaination of why.
> > > > >  	 */
> > > > >  
> > > > > -	return lrc_pin(ce, engine, vaddr);
> > > > > +	return lrc_pin(ce, engine);
> > > > > +}
> > > > > +
> > > > > +static void __guc_context_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	lrc_unpin(ce);
> > > > > +}
> > > > > +
> > > > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	lrc_post_unpin(ce);
> > > > >  }
> > > > >  
> > > > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > > > -			       struct i915_gem_ww_ctx *ww,
> > > > > -			       void **vaddr)
> > > > > +			       struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> > > > >  }
> > > > >  
> > > > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > > > +static int guc_context_pin(struct intel_context *ce)
> > > > >  {
> > > > > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > > +	int ret;
> > > > >  
> > > > > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > > > +		   intel_context_is_child(ce));
> > > > > +
> > > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > >  		intel_engine_pm_get(ce->engine);
> > > > >  
> > > > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > >  	GEM_BUG_ON(context_enabled(ce));
> > > > >  
> > > > >  	unpin_guc_id(guc, ce, true);
> > > > > -	lrc_unpin(ce);
> > > > > +	__guc_context_unpin(ce);
> > > > >  
> > > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > > >  		intel_engine_pm_put(ce->engine);
> > > > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > >  
> > > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > >  {
> > > > > -	lrc_post_unpin(ce);
> > > > > +	__guc_context_post_unpin(ce);
> > > > > +}
> > > > > +
> > > > > +/* Future patches will use this function */
> > > > > +__maybe_unused
> > > > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > > > +				      struct i915_gem_ww_ctx *ww)
> > > > > +{
> > > > > +	struct intel_context *child;
> > > > > +	int err, i = 0, j = 0;
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		err = i915_active_acquire(&child->active);
> > > > > +		if (unlikely(err))
> > > > > +			goto unwind_active;
> > > > > +		++i;
> > > > > +	}
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > > > > +		if (unlikely(err))
> > > > > +			goto unwind_pre_pin;
> > > > > +		++j;
> > > > > +	}
> > > > > +
> > > > > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > > > +	if (unlikely(err))
> > > > > +		goto unwind_pre_pin;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +unwind_pre_pin:
> > > > > +	for_each_child(ce, child) {
> > > > > +		if (!j--)
> > > > > +			break;
> > > > > +		__guc_context_post_unpin(child);
> > > > > +	}
> > > > > +
> > > > > +unwind_active:
> > > > > +	for_each_child(ce, child) {
> > > > > +		if (!i--)
> > > > > +			break;
> > > > > +		i915_active_release(&child->active);
> > > > > +	}
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/* Future patches will use this function */
> > > > > +__maybe_unused
> > > > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	struct intel_context *child;
> > > > > +
> > > > > +	for_each_child(ce, child)
> > > > > +		__guc_context_post_unpin(child);
> > > > > +	__guc_context_post_unpin(ce);
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		intel_context_get(child);
> > > > > +		i915_active_release(&child->active);
> > > > > +		intel_context_put(child);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/* Future patches will use this function */
> > > > > +__maybe_unused
> > > > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > > > +{
> > > > > +	int ret, i = 0, j = 0;
> > > > > +	struct intel_context *child;
> > > > > +	struct intel_engine_cs *engine;
> > > > > +	intel_engine_mask_t tmp;
> > > > > +
> > > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > +
> > > > > +	for_each_child(ce, child) {
> > > > > +		ret = __guc_context_pin(child, child->engine);
> > > > > +		if (unlikely(ret))
> > > > > +			goto unwind_pin;
> > > > > +		++i;
> > > > > +	}
> > > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > > +	if (unlikely(ret))
> > > > > +		goto unwind_pin;
> > > > > +
> > > > > +	for_each_child(ce, child)
> > > > > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > > > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > > +			       ce->engine->mask, tmp)
> > > > > +		intel_engine_pm_get(engine);
> > > > > +	for_each_child(ce, child)
> > > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > > +				       child->engine->mask, tmp)
> > > > > +			intel_engine_pm_get(engine);
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +unwind_pin:
> > > > > +	for_each_child(ce, child) {
> > > > > +		if (++j > i)
> > > > > +			break;
> > > > > +		__guc_context_unpin(child);
> > > > > +	}
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +/* Future patches will use this function */
> > > > > +__maybe_unused
> > > > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > > > +{
> > > > > +	struct intel_context *child;
> > > > > +	struct intel_engine_cs *engine;
> > > > > +	intel_engine_mask_t tmp;
> > > > > +
> > > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > > +
> > > > > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > > > > +	for_each_child(ce, child)
> > > > > +		__guc_context_unpin(child);
> > > > > +	__guc_context_unpin(ce);
> > > > > +
> > > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > > +			       ce->engine->mask, tmp)
> > > > > +		intel_engine_pm_put(engine);
> > > > > +	for_each_child(ce, child)
> > > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > > +				       child->engine->mask, tmp)
> > > > > +			intel_engine_pm_put(engine);
> > > > >  }
> > > > >  
> > > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > > > >  }
> > > > >  
> > > > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > > > -				       struct i915_gem_ww_ctx *ww,
> > > > > -				       void **vaddr)
> > > > > +				       struct i915_gem_ww_ctx *ww)
> > > > >  {
> > > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > >  
> > > > > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > > > +	return __guc_context_pre_pin(ce, engine, ww);
> > > > >  }
> > > > >  
> > > > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > > > >  {
> > > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > > +	int ret = __guc_context_pin(ce, engine);
> > > > >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > >  
> > > > >  	if (likely(!ret))
> > > > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > >  
> > > > >  	unpin_guc_id(guc, ce, true);
> > > > > -	lrc_unpin(ce);
> > > > > +	__guc_context_unpin(ce);
> > > > >  
> > > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > >  		intel_engine_pm_put(engine);
> > > > > -- 
> > > > > 2.28.0
> > > > > 
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-10  8:53       ` Daniel Vetter
  2021-08-10  9:07         ` Daniel Vetter
@ 2021-08-11 18:23         ` Matthew Brost
  1 sibling, 0 replies; 111+ messages in thread
From: Matthew Brost @ 2021-08-11 18:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Tue, Aug 10, 2021 at 10:53:37AM +0200, Daniel Vetter wrote:
> On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> > On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > > Implement GuC parent-child context pin / unpin functions in which in any
> > > > contexts in the relationship are pinned all the contexts are pinned. The
> > > > parent owns most of the pinning / unpinning process and the children
> > > > direct any pins / unpins to the parent.
> > > > 
> > > > Patch implements a number of unused functions that will be connected
> > > > later in the series.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 8cb92b10b547..bb4c14656067 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > > >  	intel_ring_unpin(ring);
> > > >  }
> > > >  
> > > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > > -				 struct i915_gem_ww_ctx *ww)
> > > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > > +				   struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	int err;
> > > >  
> > > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > > >  	return err;
> > > >  }
> > > >  
> > > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > > >  {
> > > >  	if (ce->state)
> > > >  		__context_unpin_state(ce->state);
> > > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > > >  	__ring_retire(ce->ring);
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > -			      struct i915_gem_ww_ctx *ww)
> > > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > > +				 struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	bool handoff = false;
> > > > -	void *vaddr;
> > > > +	struct intel_context *child;
> > > > +	int err, i = 0;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __intel_context_pre_pin(child, ww);
> > > > +		if (unlikely(err))
> > > > +			goto unwind;
> > > > +		++i;
> > > > +	}
> > > > +
> > > > +	err = __intel_context_pre_pin(ce, ww);
> > > > +	if (unlikely(err))
> > > > +		goto unwind;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!i--)
> > > > +			break;
> > > > +		__intel_context_post_unpin(ce);
> > > > +	}
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		__intel_context_post_unpin(child);
> > > > +
> > > > +	__intel_context_post_unpin(ce);
> > > > +}
> > > > +
> > > > +static int __do_ww_lock(struct intel_context *ce,
> > > > +			struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > +
> > > > +	if (!err && ce->ring->vma->obj)
> > > > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > +	if (!err && ce->state)
> > > > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +static int do_ww_lock(struct intel_context *ce,
> > > > +		      struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	struct intel_context *child;
> > > >  	int err = 0;
> > > >  
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __do_ww_lock(child, ww);
> > > > +		if (unlikely(err))
> > > > +			return err;
> > > > +	}
> > > > +
> > > > +	return __do_ww_lock(ce, ww);
> > > > +}
> > > > +
> > > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > +				     struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	bool handoff = false;
> > > > +	int err;
> > > > +
> > > >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > > >  		err = intel_context_alloc_state(ce);
> > > >  		if (err)
> > > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	 * refcount for __intel_context_active(), which prevent a lock
> > > >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> > > >  	 */
> > > > +	err = do_ww_lock(ce, ww);
> > > > +	if (err)
> > > > +		return err;
> > > >  
> > > > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > -	if (!err && ce->ring->vma->obj)
> > > > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > -	if (!err && ce->state)
> > > > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > -	if (!err)
> > > > -		err = intel_context_pre_pin(ce, ww);
> > > > +	err = intel_context_pre_pin(ce, ww);
> > > >  	if (err)
> > > >  		return err;
> > > >  
> > > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	if (err)
> > > >  		goto err_ctx_unpin;
> > > >  
> > > > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > > +	err = ce->ops->pre_pin(ce, ww);
> > > >  	if (err)
> > > >  		goto err_release;
> > > >  
> > > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  		if (unlikely(err))
> > > >  			goto err_unlock;
> > > >  
> > > > -		err = ce->ops->pin(ce, vaddr);
> > > > +		err = ce->ops->pin(ce);
> > > >  		if (err) {
> > > >  			intel_context_active_release(ce);
> > > >  			goto err_unlock;
> > > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > >  	return err;
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin(struct intel_context *ce)
> > > > +static int __intel_context_do_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct i915_gem_ww_ctx ww;
> > > >  	int err;
> > > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > > >  		 intel_context_get_avg_runtime_ns(ce));
> > > >  
> > > >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > > -	intel_context_post_unpin(ce);
> > > > +	__intel_context_post_unpin(ce);
> > > >  	intel_context_put(ce);
> > > >  }
> > > >  
> > > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > >  	child->parent = parent;
> > > >  }
> > > >  
> > > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > +		return 0;
> > > > +
> > > > +	return __intel_context_do_pin(ce);
> > > > +}
> > > > +
> > > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > > +					 struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > +		return 0;
> > > > +
> > > > +	return __intel_context_do_pin_ww(ce, ww);
> > > > +}
> > > > +
> > > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	if (!ce->ops->sched_disable) {
> > > > +		__intel_context_do_unpin(ce, 1);
> > > > +	} else {
> > > > +		/*
> > > > +		 * Move ownership of this pin to the scheduling disable which is
> > > > +		 * an async operation. When that operation completes the above
> > > > +		 * intel_context_sched_disable_unpin is called potentially
> > > > +		 * unpinning the context.
> > > > +		 */
> > > > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > 
> > > Uh man lockless algorithms.
> > > 
> > > Unless this comes:
> > > - with essentially an academic looking paper that describes the abstract
> > >   model of the lockless algorithm and proves it against the linux kernel
> > >   meory model.
> > > 
> > > - lockless stuff generally needs barriers, and those barriers must be all
> > >   documented. This means a) a comment next to each barrier in the code b)
> > >   pointing to its counterparty c) with the overall design also explained
> > >   in the kerneldoc for those datastructres.
> > > 
> > >   If you don't know where your barriers are, see above point about "it
> > >   should look more like an academic paper in the commit message"
> > > 
> > > - hard perf data about how this is absolutely required, based on a
> > >   real-world use-case (which then sometimes justifies a microbenchmark
> > >   metric for the details, but it always needs to be real-world based). And
> > >   also a throughrough explainer how the perf issue isn't fixable through
> > >   better design. If that's not doable, just protect the state machine with
> > >   a big dumb lock and move on.
> > > 
> > > - Also, because the current code is in such bad shape wrt lockless
> > >   algorithms and premature optimizations: Overall complexity should go
> > >   down (it's way too high right now), so pay down your new lockless trick
> > >   by removing one of the existing ones that we only have because we can.
> > > 
> > > Yes this is steep, but we're way out in the woods here and need to smoehow
> > > get back.
> > 
> > See below FIXME. At one point all of this was hidden in the backend but
> > the dma-resv patches that landed upstream completely broke the layering,
> > hence the need for the code here.
> > 
> > I guess I don't really understand what mean when you say lockless alg
> > needs barriers, if the atomic functions are not really atomic wouldn't
> > the world be broken?
> 
> They unordered atomics by default. Which means they're atomic itself, but
> entirely unordered with anything else that's going on. Except when you
> have one of the atomic ops which already guarantee a barrier, or you
> manually add the barriers yourself. And yes there's enormous amounts of
> bugs, and with our dgpu potentially running on non-IA cpus those bugs
> matter.
> 

Here I don't think it matters if they are unorderd relative to anything
else. We really only care about the pin_count and resulting value of the
atomic operation.

> Note that in C++ atomics the default behaviour is strongly ordered atomics
> with full barriers by default, because those are much easier to program
> against. Kernel isn't like that and defaults to "you need to add all the
> barriers yourself".
> 
> I have a full lenght rant in the works and will work that through all
> channels, but essentially locking is really hard to get right. And
> lockless tricks practically need an academic paper with a formal
> correctness proof against the linux memory model, or you do have bugs.
> 
> And I know that the current code is choke full of this stuff, so it's
> tempting to just add more, but we really cant. The amount of locking
> trickery we have in the codebase must go down substantially. My take is
> that any code that adds anything trick needs to fully justify it against
> the above list, _and_ also clean up some of the existing nonsense so that
> overall complexity doesn't increase.
> 
> I'll share the full length rant with you internally, it's not yet ready
> for publishing (but that's planned too).
> 

Sure we can chat about this. I am new here and basically taught myself
to be a kernel developer by looking at the i915 which probably wasn't
the best way to learn.

> 
> > Also here I don't think it is really as simple as grab big dump lock for
> > a variety of reasons, at least with the current dynamic pin / unpin code
> > in place. If we move a perma-pinned contexts this could be cleaned up
> > then.
> 
> Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
> context can fix this I think we should do this asap. I'd say for parallel
> context we should just do it outright (special case them or whatever) so
> that we don't have to add even more very tricky code and tech debt.
> 
> Doable?

I think it is doable to perma-pin parallel contexts, regular contexts
not so doable as that is a much large rework. I actually like this as I
can drop a few other things in the parallel submission code if we move
to perm-pinned contexts.

The only potential issue I see is running on a system with a limited
number of guc_ids, lots of engine instances, and media UMD creating more
parallel contexts that it really needs. This isn't a blocker as this
really isn't a concern for upstream yet and is a workable problem one
way or another.

Matt

> 
> Cheers, Daniel
> 
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > > +				ce->ops->sched_disable(ce);
> > > > +				break;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > > + * configured for parallel submission, are pinned all the contexts need to be
> > > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > > + * is in place works but is not ideal.
> > > > + */
> > > > +int intel_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > +			return ____intel_context_pin(ce->parent);
> > > > +		else
> > > > +			return 0;
> > > > +	} else {
> > > > +		return ____intel_context_pin(ce);
> > > > +	}
> > > > +}
> > > > +
> > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > +			 struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > +			return __intel_context_pin_ww(ce->parent, ww);
> > > > +		else
> > > > +			return 0;
> > > > +	} else {
> > > > +		return __intel_context_pin_ww(ce, ww);
> > > > +	}
> > > > +}
> > > > +
> > > > +void intel_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	if (intel_context_is_child(ce)) {
> > > > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > > +			__intel_context_unpin(ce->parent);
> > > > +	} else {
> > > > +		__intel_context_unpin(ce);
> > > > +	}
> > > > +}
> > > > +
> > > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > >  #include "selftest_context.c"
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > index ad6ce5ac4824..c208691fc87d 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > > >  	mutex_unlock(&ce->pin_mutex);
> > > >  }
> > > >  
> > > > -int __intel_context_do_pin(struct intel_context *ce);
> > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > -			      struct i915_gem_ww_ctx *ww);
> > > > -
> > > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > > >  {
> > > >  	return atomic_inc_not_zero(&ce->pin_count);
> > > >  }
> > > >  
> > > > -static inline int intel_context_pin(struct intel_context *ce)
> > > > -{
> > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > -		return 0;
> > > > -
> > > > -	return __intel_context_do_pin(ce);
> > > > -}
> > > > -
> > > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > > -				       struct i915_gem_ww_ctx *ww)
> > > > -{
> > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > -		return 0;
> > > > +int intel_context_pin(struct intel_context *ce);
> > > >  
> > > > -	return __intel_context_do_pin_ww(ce, ww);
> > > > -}
> > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > +			 struct i915_gem_ww_ctx *ww);
> > > >  
> > > >  static inline void __intel_context_pin(struct intel_context *ce)
> > > >  {
> > > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > > >  
> > > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > > >  {
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >  	__intel_context_do_unpin(ce, 2);
> > > >  }
> > > >  
> > > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > > -{
> > > > -	if (!ce->ops->sched_disable) {
> > > > -		__intel_context_do_unpin(ce, 1);
> > > > -	} else {
> > > > -		/*
> > > > -		 * Move ownership of this pin to the scheduling disable which is
> > > > -		 * an async operation. When that operation completes the above
> > > > -		 * intel_context_sched_disable_unpin is called potentially
> > > > -		 * unpinning the context.
> > > > -		 */
> > > > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > > -				ce->ops->sched_disable(ce);
> > > > -				break;
> > > > -			}
> > > > -		}
> > > > -	}
> > > > -}
> > > > +void intel_context_unpin(struct intel_context *ce);
> > > >  
> > > >  void intel_context_enter_engine(struct intel_context *ce);
> > > >  void intel_context_exit_engine(struct intel_context *ce);
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 66b22b370a72..eb82be15b7a2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > > >  
> > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > >  
> > > > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > > > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > +	int (*pin)(struct intel_context *ce);
> > > >  	void (*unpin)(struct intel_context *ce);
> > > >  	void (*post_unpin)(struct intel_context *ce);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > index baa1797af1c8..fc74ca28f245 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > > >  static int
> > > >  __execlists_context_pre_pin(struct intel_context *ce,
> > > >  			    struct intel_engine_cs *engine,
> > > > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > > > +			    struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	int err;
> > > >  
> > > > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > > +	err = lrc_pre_pin(ce, engine, ww);
> > > >  	if (err)
> > > >  		return err;
> > > >  
> > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > > -		lrc_init_state(ce, engine, *vaddr);
> > > > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > > >  
> > > >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > > >  	}
> > > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > > >  }
> > > >  
> > > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > > -				     struct i915_gem_ww_ctx *ww,
> > > > -				     void **vaddr)
> > > > +				     struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> > > >  }
> > > >  
> > > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int execlists_context_pin(struct intel_context *ce)
> > > >  {
> > > > -	return lrc_pin(ce, ce->engine, vaddr);
> > > > +	return lrc_pin(ce, ce->engine);
> > > >  }
> > > >  
> > > >  static int execlists_context_alloc(struct intel_context *ce)
> > > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > > >  }
> > > >  
> > > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > > -				   struct i915_gem_ww_ctx *ww,
> > > > -				   void **vaddr)
> > > > +				   struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > >  
> > > >  	 /* Note: we must use a real engine class for setting up reg state */
> > > > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > > >  }
> > > >  
> > > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int virtual_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > >  
> > > > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > > > +	return lrc_pin(ce, ve->siblings[0]);
> > > >  }
> > > >  
> > > >  static void virtual_context_enter(struct intel_context *ce)
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > index bb4af4977920..c466fc966005 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > > >  int
> > > >  lrc_pre_pin(struct intel_context *ce,
> > > >  	    struct intel_engine_cs *engine,
> > > > -	    struct i915_gem_ww_ctx *ww,
> > > > -	    void **vaddr)
> > > > +	    struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > +	void *vaddr;
> > > >  	GEM_BUG_ON(!ce->state);
> > > >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > > >  
> > > > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > -					 i915_coherent_map_type(ce->engine->i915,
> > > > -								ce->state->obj,
> > > > -								false) |
> > > > -					 I915_MAP_OVERRIDE);
> > > > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > +					i915_coherent_map_type(ce->engine->i915,
> > > > +							       ce->state->obj,
> > > > +							       false) |
> > > > +					I915_MAP_OVERRIDE);
> > > >  
> > > > -	return PTR_ERR_OR_ZERO(*vaddr);
> > > > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > +
> > > > +	return PTR_ERR_OR_ZERO(vaddr);
> > > >  }
> > > >  
> > > >  int
> > > >  lrc_pin(struct intel_context *ce,
> > > > -	struct intel_engine_cs *engine,
> > > > -	void *vaddr)
> > > > +	struct intel_engine_cs *engine)
> > > >  {
> > > > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > -
> > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > > -		lrc_init_state(ce, engine, vaddr);
> > > > +		lrc_init_state(ce, engine,
> > > > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > > >  
> > > >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > > >  	return 0;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > index 7f697845c4cf..837fcf00270d 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > > >  int
> > > >  lrc_pre_pin(struct intel_context *ce,
> > > >  	    struct intel_engine_cs *engine,
> > > > -	    struct i915_gem_ww_ctx *ww,
> > > > -	    void **vaddr);
> > > > +	    struct i915_gem_ww_ctx *ww);
> > > >  int
> > > >  lrc_pin(struct intel_context *ce,
> > > > -	struct intel_engine_cs *engine,
> > > > -	void *vaddr);
> > > > +	struct intel_engine_cs *engine);
> > > >  void lrc_unpin(struct intel_context *ce);
> > > >  void lrc_post_unpin(struct intel_context *ce);
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > index 2958e2fae380..f4f301bfb9f7 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > > >  }
> > > >  
> > > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > > -				struct i915_gem_ww_ctx *ww,
> > > > -				void **unused)
> > > > +				struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct i915_address_space *vm;
> > > >  	int err = 0;
> > > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > > +static int ring_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > index 2c1af030310c..826b5d7a4573 100644
> > > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > > >  }
> > > >  
> > > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > > -				struct i915_gem_ww_ctx *ww, void **unused)
> > > > +				struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > > +static int mock_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index dec757d319a2..c5c73c42bcf7 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >  
> > > >  	GEM_BUG_ON(!engine->mask);
> > > >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > >  
> > > >  	/*
> > > >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >  
> > > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > > >  				 struct intel_engine_cs *engine,
> > > > -				 struct i915_gem_ww_ctx *ww,
> > > > -				 void **vaddr)
> > > > +				 struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > > > +	return lrc_pre_pin(ce, engine, ww);
> > > >  }
> > > >  
> > > >  static int __guc_context_pin(struct intel_context *ce,
> > > > -			     struct intel_engine_cs *engine,
> > > > -			     void *vaddr)
> > > > +			     struct intel_engine_cs *engine)
> > > >  {
> > > >  	if (i915_ggtt_offset(ce->state) !=
> > > >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > > >  	 * explaination of why.
> > > >  	 */
> > > >  
> > > > -	return lrc_pin(ce, engine, vaddr);
> > > > +	return lrc_pin(ce, engine);
> > > > +}
> > > > +
> > > > +static void __guc_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	lrc_unpin(ce);
> > > > +}
> > > > +
> > > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	lrc_post_unpin(ce);
> > > >  }
> > > >  
> > > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > > -			       struct i915_gem_ww_ctx *ww,
> > > > -			       void **vaddr)
> > > > +			       struct i915_gem_ww_ctx *ww)
> > > >  {
> > > > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> > > >  }
> > > >  
> > > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int guc_context_pin(struct intel_context *ce)
> > > >  {
> > > > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > +	int ret;
> > > >  
> > > > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > > +		   intel_context_is_child(ce));
> > > > +
> > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > >  		intel_engine_pm_get(ce->engine);
> > > >  
> > > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  	GEM_BUG_ON(context_enabled(ce));
> > > >  
> > > >  	unpin_guc_id(guc, ce, true);
> > > > -	lrc_unpin(ce);
> > > > +	__guc_context_unpin(ce);
> > > >  
> > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > >  		intel_engine_pm_put(ce->engine);
> > > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > > >  
> > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > >  {
> > > > -	lrc_post_unpin(ce);
> > > > +	__guc_context_post_unpin(ce);
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > > +				      struct i915_gem_ww_ctx *ww)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +	int err, i = 0, j = 0;
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = i915_active_acquire(&child->active);
> > > > +		if (unlikely(err))
> > > > +			goto unwind_active;
> > > > +		++i;
> > > > +	}
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > > > +		if (unlikely(err))
> > > > +			goto unwind_pre_pin;
> > > > +		++j;
> > > > +	}
> > > > +
> > > > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > > +	if (unlikely(err))
> > > > +		goto unwind_pre_pin;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind_pre_pin:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!j--)
> > > > +			break;
> > > > +		__guc_context_post_unpin(child);
> > > > +	}
> > > > +
> > > > +unwind_active:
> > > > +	for_each_child(ce, child) {
> > > > +		if (!i--)
> > > > +			break;
> > > > +		i915_active_release(&child->active);
> > > > +	}
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		__guc_context_post_unpin(child);
> > > > +	__guc_context_post_unpin(ce);
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		intel_context_get(child);
> > > > +		i915_active_release(&child->active);
> > > > +		intel_context_put(child);
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > > +{
> > > > +	int ret, i = 0, j = 0;
> > > > +	struct intel_context *child;
> > > > +	struct intel_engine_cs *engine;
> > > > +	intel_engine_mask_t tmp;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	for_each_child(ce, child) {
> > > > +		ret = __guc_context_pin(child, child->engine);
> > > > +		if (unlikely(ret))
> > > > +			goto unwind_pin;
> > > > +		++i;
> > > > +	}
> > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > +	if (unlikely(ret))
> > > > +		goto unwind_pin;
> > > > +
> > > > +	for_each_child(ce, child)
> > > > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > > +			break;
> > > > +		}
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > +			       ce->engine->mask, tmp)
> > > > +		intel_engine_pm_get(engine);
> > > > +	for_each_child(ce, child)
> > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > +				       child->engine->mask, tmp)
> > > > +			intel_engine_pm_get(engine);
> > > > +
> > > > +	return 0;
> > > > +
> > > > +unwind_pin:
> > > > +	for_each_child(ce, child) {
> > > > +		if (++j > i)
> > > > +			break;
> > > > +		__guc_context_unpin(child);
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/* Future patches will use this function */
> > > > +__maybe_unused
> > > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > > +{
> > > > +	struct intel_context *child;
> > > > +	struct intel_engine_cs *engine;
> > > > +	intel_engine_mask_t tmp;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > +
> > > > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > > > +	for_each_child(ce, child)
> > > > +		__guc_context_unpin(child);
> > > > +	__guc_context_unpin(ce);
> > > > +
> > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > +			       ce->engine->mask, tmp)
> > > > +		intel_engine_pm_put(engine);
> > > > +	for_each_child(ce, child)
> > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > +				       child->engine->mask, tmp)
> > > > +			intel_engine_pm_put(engine);
> > > >  }
> > > >  
> > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > > >  }
> > > >  
> > > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > > -				       struct i915_gem_ww_ctx *ww,
> > > > -				       void **vaddr)
> > > > +				       struct i915_gem_ww_ctx *ww)
> > > >  {
> > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > >  
> > > > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > > +	return __guc_context_pre_pin(ce, engine, ww);
> > > >  }
> > > >  
> > > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > > >  {
> > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > +	int ret = __guc_context_pin(ce, engine);
> > > >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > >  
> > > >  	if (likely(!ret))
> > > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > >  
> > > >  	unpin_guc_id(guc, ce, true);
> > > > -	lrc_unpin(ce);
> > > > +	__guc_context_unpin(ce);
> > > >  
> > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > >  		intel_engine_pm_put(engine);
> > > > -- 
> > > > 2.28.0
> > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-11 17:43         ` Matthew Brost
@ 2021-08-12 14:04           ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-12 14:04 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Wed, Aug 11, 2021 at 05:43:23PM +0000, Matthew Brost wrote:
> On Wed, Aug 11, 2021 at 11:55:48AM +0200, Daniel Vetter wrote:
> > On Mon, Aug 09, 2021 at 07:32:26PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 09, 2021 at 07:17:27PM +0200, Daniel Vetter wrote:
> > > > On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> > > > > Some workloads use lots of contexts that continually pin / unpin
> > > > > contexts. With GuC submission an unpin translates to a schedule disable
> > > > > H2G which puts pressure on both the i915 and GuC. A schedule disable can
> > > > > also block future requests from being submitted until the operation
> > > > > completes. None of this is ideal.
> > > > > 
> > > > > Add a configurable, via debugfs, delay period before the schedule
> > > > > disable is issued. Default delay period is 1 second. The delay period is
> > > > > skipped if more than 3/4 of the guc_ids are in use.
> > > > > 
> > > > > This patch also updates the selftests to turn off this delay period as
> > > > > this extra time would likely cause many selftests to fail. Follow up
> > > > > patches will fix all the selftests and enable the delay period.
> > > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > I think this is more evidence that we should just pin/unpin context at
> > > > create/destruction time. The current scheme doesn't really work that well
> > > > and causes way more pain than benefits it seems.
> > > > 
> > > 
> > > Well that choice is above my pay grade, but for what it is worth it
> > > would simplify the GuC backend quite a bit if we perma-pin contexts. By
> > > quite a bit, I actually mean a lot of complexity goes away.
> > > 
> > > In the meantime I think we probably need this code though to avoid
> > > trashes on the scheduling enable / disable.
> > 
> > The trouble is that you muck around with the context close state bit,
> 
> This really doesn't mess this bit anymore that what is there, it just
> adds callback to the backend.
> 
> > which is one of these lockless trickeries where my cursory analysis (just
> > a few days in total of randomly stumbling over it when reading other code)
> > strongly suggests it's busted.
> > 
> > I really don't want to build more on top, especially not without careful
> > review and all that.
> > 
> > Also since this is a perf claim, the commit message needs some numbers.
> >
> 
> This was basically just visual inspection of ftrace of a media workload
> that uses lots of contexts. The contexts were repeatedly pinned /
> unpinned. Disabling / enabling scheduling is a rather expensive
> operation so we really shouldn't be doing it all the time. We visually
> observed an ftrace after this change and all this unnecessary traffic
> went away.

That's the kinds of stuff that should be included in the commit message,
ideally with some numbers (like how many you manage to remove or whatever
metric it is you picked, something quick like done with grep and line
counting is good enough).

> > Finally even if we decide to make contexts properly evictable, we need a
> > different scheme anyway. As you realized the current active tracking is
> > kinda backwards because it unpins immediately when no longer in use.
> 
> Right, this basically just works around the fact that contexts are
> immediately unpinned when not in use. As stated before if we perma-pin
> contexts all this goes away.

Yeah, sounds all good then. Well, more or less.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > 
> > > Matt
> > > 
> > > > If anyone screams, and that's a big if aside of some igts, we can come up
> > > > with a proper scheme to evict contexts without pin/unpin and layer hacks
> > > > over that misdesign.
> > > > -Daniel
> > > > 
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
> > > > >  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
> > > > >  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
> > > > >  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
> > > > >  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
> > > > >  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
> > > > >  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
> > > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
> > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
> > > > >  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
> > > > >  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
> > > > >  drivers/gpu/drm/i915/i915_trace.h             |  10 +
> > > > >  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
> > > > >  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
> > > > >  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
> > > > >  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
> > > > >  18 files changed, 405 insertions(+), 20 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > > index b199d59bd2c4..1553287e5491 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > > > @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
> > > > >  		int err;
> > > > >  
> > > > >  		/* serialises with execbuf */
> > > > > -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > > +		intel_context_close(ce);
> > > > >  		if (!intel_context_pin_if_active(ce))
> > > > >  			continue;
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > > index 13b088cc787e..a666d7e610f5 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> > > > > @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
> > > > >  		SUBTEST(igt_gem_coherency),
> > > > >  	};
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > > index ffae7df5e4d7..2c92afa9d608 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> > > > > @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
> > > > >  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
> > > > >  	};
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > > index b20f5621f62b..4745c78a48de 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > > > > @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
> > > > >  		SUBTEST(igt_mmap_gpu),
> > > > >  	};
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > > index 740ee8086a27..ae1361c7c4cf 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> > > > > @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
> > > > >  		SUBTEST(igt_gem_huge),
> > > > >  	};
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > index 8e90a4a0b7b0..96643040defd 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
> > > > >  	ce->guc_id = GUC_INVALID_LRC_ID;
> > > > >  	INIT_LIST_HEAD(&ce->guc_id_link);
> > > > >  
> > > > > +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> > > > > +
> > > > >  	mutex_init(&ce->parallel_submit);
> > > > >  	ce->fence_context = dma_fence_context_alloc(1);
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > index a302599e436a..f4c9036f7f03 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
> > > > >  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
> > > > >  }
> > > > >  
> > > > > +static inline void intel_context_close(struct intel_context *ce)
> > > > > +{
> > > > > +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > > +
> > > > > +	trace_intel_context_close(ce);
> > > > > +	if (ce->ops->close)
> > > > > +		ce->ops->close(ce);
> > > > > +}
> > > > > +
> > > > >  static inline bool intel_context_is_closed(const struct intel_context *ce)
> > > > >  {
> > > > >  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > index 8af9ace4c052..53f00657a45c 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > @@ -11,6 +11,7 @@
> > > > >  #include <linux/list.h>
> > > > >  #include <linux/mutex.h>
> > > > >  #include <linux/types.h>
> > > > > +#include <linux/ktime.h>
> > > > >  
> > > > >  #include "i915_active_types.h"
> > > > >  #include "i915_sw_fence.h"
> > > > > @@ -38,6 +39,7 @@ struct intel_context_ops {
> > > > >  	int (*alloc)(struct intel_context *ce);
> > > > >  
> > > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > > > +	void (*close)(struct intel_context *ce);
> > > > >  
> > > > >  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > >  	int (*pin)(struct intel_context *ce);
> > > > > @@ -203,6 +205,12 @@ struct intel_context {
> > > > >  	 */
> > > > >  	struct list_head guc_id_link;
> > > > >  
> > > > > +	/*
> > > > > +	 * GuC schedule disable link / time
> > > > > +	 */
> > > > > +	struct list_head guc_sched_disable_link;
> > > > > +	ktime_t guc_sched_disable_time;
> > > > > +
> > > > >  	/* GuC context blocked fence */
> > > > >  	struct i915_sw_fence guc_blocked;
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > index 30a0f364db8f..90b5b657d411 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > @@ -60,6 +60,7 @@ struct intel_guc {
> > > > >  	struct ida guc_ids;
> > > > >  	u32 num_guc_ids;
> > > > >  	u32 max_guc_ids;
> > > > > +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
> > > > >  	unsigned long *guc_ids_bitmap;
> > > > >  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
> > > > >  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> > > > > @@ -69,6 +70,12 @@ struct intel_guc {
> > > > >  	struct list_head destroyed_contexts;
> > > > >  	struct intel_gt_pm_unpark_work destroy_worker;
> > > > >  
> > > > > +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> > > > > +	struct list_head sched_disable_list;
> > > > > +	struct hrtimer sched_disable_timer;
> > > > > +#define SCHED_DISABLE_DELAY_NS	1000000000
> > > > > +	u64 sched_disable_delay_ns;
> > > > > +
> > > > >  	bool submission_supported;
> > > > >  	bool submission_selected;
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > > index 7c479c5e7b3a..53a6f3da6cce 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> > > > > @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
> > > > >  }
> > > > >  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
> > > > >  
> > > > > +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> > > > > +{
> > > > > +	struct intel_guc *guc = data;
> > > > > +
> > > > > +	if (!intel_guc_submission_is_used(guc))
> > > > > +		return -ENODEV;
> > > > > +
> > > > > +	*val = guc->sched_disable_delay_ns;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> > > > > +{
> > > > > +	struct intel_guc *guc = data;
> > > > > +
> > > > > +	if (!intel_guc_submission_is_used(guc))
> > > > > +		return -ENODEV;
> > > > > +
> > > > > +	guc->sched_disable_delay_ns = val;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> > > > > +			guc_sched_disable_delay_ns_get,
> > > > > +			guc_sched_disable_delay_ns_set, "%lld\n");
> > > > > +
> > > > >  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
> > > > >  {
> > > > >  	static const struct debugfs_gt_file files[] = {
> > > > >  		{ "guc_info", &guc_info_fops, NULL },
> > > > >  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
> > > > >  		{ "guc_num_id", &guc_num_id_fops, NULL },
> > > > > +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
> > > > >  	};
> > > > >  
> > > > >  	if (!intel_guc_is_supported(guc))
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index cd1893edf43a..dc0d6a099bee 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> > > > >  	return (timeout < 0) ? timeout : 0;
> > > > >  }
> > > > >  
> > > > > +static void sched_disable_contexts_flush(struct intel_guc *guc);
> > > > > +
> > > > >  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
> > > > >  {
> > > > >  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
> > > > >  		return 0;
> > > > >  
> > > > > +	sched_disable_contexts_flush(guc);
> > > > > +
> > > > >  	return intel_guc_wait_for_pending_msg(guc,
> > > > >  					      &guc->outstanding_submission_g2h,
> > > > >  					      true, timeout);
> > > > > @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
> > > > >  static void guc_signal_context_fence(struct intel_context *ce);
> > > > >  static void guc_cancel_context_requests(struct intel_context *ce);
> > > > >  static void guc_blocked_fence_complete(struct intel_context *ce);
> > > > > +static void sched_disable_context_delete(struct intel_context *ce);
> > > > >  
> > > > >  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > > >  {
> > > > > @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
> > > > >  		deregister = context_wait_for_deregister_to_register(ce);
> > > > >  		banned = context_banned(ce);
> > > > >  		init_sched_state(ce);
> > > > > +		sched_disable_context_delete(ce);
> > > > >  
> > > > >  		if (pending_enable || destroyed || deregister) {
> > > > >  			atomic_dec(&guc->outstanding_submission_g2h);
> > > > > @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > > >  
> > > > >  	intel_gt_park_heartbeats(guc_to_gt(guc));
> > > > >  	disable_submission(guc);
> > > > > +	hrtimer_cancel(&guc->sched_disable_timer);
> > > > >  	guc->interrupts.disable(guc);
> > > > >  
> > > > >  	/* Flush IRQ handler */
> > > > > @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
> > > > >  
> > > > >  static void destroy_worker_func(struct work_struct *w);
> > > > >  
> > > > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> > > > > +
> > > > >  /*
> > > > >   * Set up the memory resources to be shared with the GuC (via the GGTT)
> > > > >   * at firmware loading time.
> > > > > @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > > > >  	INIT_LIST_HEAD(&guc->destroyed_contexts);
> > > > >  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
> > > > >  
> > > > > +	spin_lock_init(&guc->sched_disable_lock);
> > > > > +	INIT_LIST_HEAD(&guc->sched_disable_list);
> > > > > +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> > > > > +		     HRTIMER_MODE_REL);
> > > > > +	guc->sched_disable_timer.function = sched_disable_timer_func;
> > > > > +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> > > > > +
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > > >  	if (unlikely(ret < 0))
> > > > >  		return ret;
> > > > >  
> > > > > +	if (intel_context_is_parent(ce))
> > > > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > > > +			order_base_2(ce->guc_number_children + 1);
> > > > > +	else
> > > > > +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> > > > > +
> > > > >  	ce->guc_id = ret;
> > > > >  	return 0;
> > > > >  }
> > > > > @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
> > > > >  {
> > > > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > > > >  	if (!context_guc_id_invalid(ce)) {
> > > > > -		if (intel_context_is_parent(ce))
> > > > > +		if (intel_context_is_parent(ce)) {
> > > > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > > > +				order_base_2(ce->guc_number_children + 1);
> > > > >  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
> > > > >  					      order_base_2(ce->guc_number_children
> > > > >  							   + 1));
> > > > > -		else
> > > > > +		} else {
> > > > > +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
> > > > >  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> > > > > +		}
> > > > >  		clr_lrc_desc_registered(guc, ce->guc_id);
> > > > > +
> > > > >  		set_context_guc_id_invalid(ce);
> > > > >  	}
> > > > >  	if (!list_empty(&ce->guc_id_link))
> > > > > @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
> > > > >  			 * from another context that has more guc_id that itself.
> > > > >  			 */
> > > > >  			if (cn_o2 != ce_o2) {
> > > > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> > > > > +					order_base_2(cn->guc_number_children + 1);
> > > > >  				bitmap_release_region(guc->guc_ids_bitmap,
> > > > >  						      cn->guc_id,
> > > > >  						      cn_o2);
> > > > > +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> > > > > +					order_base_2(ce->guc_number_children + 1);
> > > > >  				bitmap_allocate_region(guc->guc_ids_bitmap,
> > > > >  						       ce->guc_id,
> > > > >  						       ce_o2);
> > > > > @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > >  	__guc_context_unpin(ce);
> > > > >  
> > > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > > > -		intel_engine_pm_put(ce->engine);
> > > > > +		intel_engine_pm_put_async(ce->engine);
> > > > >  }
> > > > >  
> > > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > > @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
> > > > >  
> > > > >  	for_each_engine_masked(engine, ce->engine->gt,
> > > > >  			       ce->engine->mask, tmp)
> > > > > -		intel_engine_pm_put(engine);
> > > > > +		intel_engine_pm_put_async(engine);
> > > > >  	for_each_child(ce, child)
> > > > >  		for_each_engine_masked(engine, child->engine->gt,
> > > > >  				       child->engine->mask, tmp)
> > > > > -			intel_engine_pm_put(engine);
> > > > > +			intel_engine_pm_put_async(engine);
> > > > >  }
> > > > >  
> > > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > > @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
> > > > >  
> > > > >  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > >  
> > > > > +	sched_disable_context_delete(ce);
> > > > > +
> > > > >  	with_intel_runtime_pm(runtime_pm, wakeref)
> > > > >  		__guc_context_sched_disable(guc, ce, guc_id);
> > > > >  
> > > > > @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
> > > > >  								     1);
> > > > >  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > >  	}
> > > > > +
> > > > > +	sched_disable_context_delete(ce);
> > > > > +}
> > > > > +
> > > > > +#define next_sched_disable_time(guc, now, ce) \
> > > > > +	(guc->sched_disable_delay_ns - \
> > > > > +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> > > > > +static void ____sched_disable_context_delete(struct intel_guc *guc,
> > > > > +					     struct intel_context *ce)
> > > > > +{
> > > > > +	bool is_first;
> > > > > +
> > > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> > > > > +
> > > > > +	is_first = list_is_first(&ce->guc_sched_disable_link,
> > > > > +				 &guc->sched_disable_list);
> > > > > +	list_del_init(&ce->guc_sched_disable_link);
> > > > > +	if (list_empty(&guc->sched_disable_list)) {
> > > > > +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > > > +	} else if (is_first) {
> > > > > +		struct intel_context *first =
> > > > > +			list_first_entry(&guc->sched_disable_list,
> > > > > +					 typeof(*first),
> > > > > +					 guc_sched_disable_link);
> > > > > +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> > > > > +							first);
> > > > > +
> > > > > +		hrtimer_start(&guc->sched_disable_timer,
> > > > > +			      ns_to_ktime(next_time),
> > > > > +			      HRTIMER_MODE_REL_PINNED);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void __sched_disable_context_delete(struct intel_guc *guc,
> > > > > +					   struct intel_context *ce)
> > > > > +{
> > > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > > +		intel_context_sched_disable_unpin(ce);
> > > > > +		____sched_disable_context_delete(guc, ce);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void sched_disable_context_delete(struct intel_context *ce)
> > > > > +{
> > > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > > +		__sched_disable_context_delete(guc, ce);
> > > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void sched_disable_context_add(struct intel_guc *guc,
> > > > > +				      struct intel_context *ce)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > > > > +
> > > > > +	ce->guc_sched_disable_time = ktime_get();
> > > > > +
> > > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > > +	if (list_empty(&guc->sched_disable_list))
> > > > > +		hrtimer_start(&guc->sched_disable_timer,
> > > > > +			      ns_to_ktime(guc->sched_disable_delay_ns),
> > > > > +			      HRTIMER_MODE_REL_PINNED);
> > > > > +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> > > > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > > +}
> > > > > +
> > > > > +static void sched_disable_contexts_flush(struct intel_guc *guc)
> > > > > +{
> > > > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > > > +	struct intel_context *ce, *cn;
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > > +
> > > > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > > > +					 guc_sched_disable_link) {
> > > > > +		intel_wakeref_t wakeref;
> > > > > +		bool enabled;
> > > > > +		u16 guc_id;
> > > > > +
> > > > > +		list_del_init(&ce->guc_sched_disable_link);
> > > > > +
> > > > > +		spin_lock(&ce->guc_state.lock);
> > > > > +		enabled = context_enabled(ce);
> > > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > > +			if (enabled)
> > > > > +				clr_context_enabled(ce);
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			intel_context_sched_disable_unpin(ce);
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			continue;
> > > > > +		}
> > > > > +		guc_id = prep_context_pending_disable(ce);
> > > > > +		spin_unlock(&ce->guc_state.lock);
> > > > > +
> > > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > > +	}
> > > > > +
> > > > > +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> > > > > +
> > > > > +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > >  }
> > > > >  
> > > > > +#define should_sched_be_disabled(guc, now, ce) \
> > > > > +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> > > > > +	(guc->sched_disable_delay_ns / 4) * 3)
> > > > > +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> > > > > +{
> > > > > +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> > > > > +					     sched_disable_timer);
> > > > > +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> > > > > +	struct intel_context *ce, *cn;
> > > > > +	unsigned long flags;
> > > > > +	ktime_t now;
> > > > > +
> > > > > +	if (list_empty(&guc->sched_disable_list))
> > > > > +		return HRTIMER_NORESTART;
> > > > > +
> > > > > +	now = ktime_get();
> > > > > +
> > > > > +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > > +
> > > > > +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> > > > > +					 guc_sched_disable_link) {
> > > > > +		intel_wakeref_t wakeref;
> > > > > +		bool enabled;
> > > > > +		u16 guc_id;
> > > > > +
> > > > > +		/*
> > > > > +		 * If a context has been waiting for 3/4 of its delay or more,
> > > > > +		 * issue the schedule disable. Using this heuristic allows more
> > > > > +		 * than 1 context to have its scheduling disabled when this
> > > > > +		 * timer is run.
> > > > > +		 */
> > > > > +		if (!should_sched_be_disabled(guc, now, ce))
> > > > > +			break;
> > > > > +
> > > > > +		list_del_init(&ce->guc_sched_disable_link);
> > > > > +
> > > > > +		spin_lock(&ce->guc_state.lock);
> > > > > +		enabled = context_enabled(ce);
> > > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > > +			if (enabled)
> > > > > +				clr_context_enabled(ce);
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			intel_context_sched_disable_unpin(ce);
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			continue;
> > > > > +		}
> > > > > +		guc_id = prep_context_pending_disable(ce);
> > > > > +		spin_unlock(&ce->guc_state.lock);
> > > > > +
> > > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > > +	}
> > > > > +
> > > > > +	if (!list_empty(&guc->sched_disable_list)) {
> > > > > +		struct intel_context *first =
> > > > > +			list_first_entry(&guc->sched_disable_list,
> > > > > +					 typeof(*first),
> > > > > +					 guc_sched_disable_link);
> > > > > +		u64 next_time = next_sched_disable_time(guc, now, first);
> > > > > +
> > > > > +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> > > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > > +
> > > > > +		return HRTIMER_RESTART;
> > > > > +	} else {
> > > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > > +
> > > > > +		return HRTIMER_NORESTART;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
> > > > >  static void guc_context_sched_disable(struct intel_context *ce)
> > > > >  {
> > > > >  	struct intel_guc *guc = ce_to_guc(ce);
> > > > > @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > > > >  	intel_wakeref_t wakeref;
> > > > >  	u16 guc_id;
> > > > >  	bool enabled;
> > > > > +	int guc_id_index = intel_context_is_parent(ce) ?
> > > > > +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> > > > > +	int max_guc_ids = intel_context_is_parent(ce) ?
> > > > > +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> > > > > +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> > > > >  
> > > > >  	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> > > > >  
> > > > >  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> > > > >  	    !lrc_desc_registered(guc, ce->guc_id)) {
> > > > > @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
> > > > >  	if (!context_enabled(ce))
> > > > >  		goto unpin;
> > > > >  
> > > > > +	/*
> > > > > +	 * If no guc_id pressure and the context isn't closed we delay the
> > > > > +	 * schedule disable to not to continuously disable / enable scheduling
> > > > > +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> > > > > +	 * debugfs, default 1s.
> > > > > +	 */
> > > > > +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> > > > > +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> > > > > +		sched_disable_context_add(guc, ce);
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > > >  
> > > > >  	/*
> > > > > @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
> > > > >  	i915_request_notify_execute_cb_imm(rq);
> > > > >  }
> > > > >  
> > > > > +static void __guc_context_close(struct intel_guc *guc,
> > > > > +				struct intel_context *ce)
> > > > > +{
> > > > > +	lockdep_assert_held(&guc->sched_disable_lock);
> > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > +
> > > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > > +		struct intel_runtime_pm *runtime_pm =
> > > > > +			ce->engine->uncore->rpm;
> > > > > +		intel_wakeref_t wakeref;
> > > > > +		bool enabled;
> > > > > +		u16 guc_id;
> > > > > +
> > > > > +		spin_lock(&ce->guc_state.lock);
> > > > > +		enabled = context_enabled(ce);
> > > > > +		if (unlikely(!enabled || submission_disabled(guc))) {
> > > > > +			if (enabled)
> > > > > +				clr_context_enabled(ce);
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			intel_context_sched_disable_unpin(ce);
> > > > > +			goto update_list;
> > > > > +		}
> > > > > +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> > > > > +			spin_unlock(&ce->guc_state.lock);
> > > > > +			goto update_list;
> > > > > +		}
> > > > > +		guc_id = prep_context_pending_disable(ce);
> > > > > +		spin_unlock(&ce->guc_state.lock);
> > > > > +
> > > > > +		with_intel_runtime_pm(runtime_pm, wakeref)
> > > > > +			__guc_context_sched_disable(guc, ce, guc_id);
> > > > > +update_list:
> > > > > +		____sched_disable_context_delete(guc, ce);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void guc_context_close(struct intel_context *ce)
> > > > > +{
> > > > > +	struct intel_guc *guc = ce_to_guc(ce);
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	/*
> > > > > +	 * If we close the context and a schedule disable is pending a delay, do
> > > > > +	 * it immediately.
> > > > > +	 */
> > > > > +	if (!list_empty(&ce->guc_sched_disable_link)) {
> > > > > +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> > > > > +		__guc_context_close(guc, ce);
> > > > > +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  static struct intel_context *
> > > > >  guc_create_parallel(struct intel_engine_cs **engines,
> > > > >  		    unsigned int num_siblings,
> > > > > @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
> > > > >  	.post_unpin = guc_context_post_unpin,
> > > > >  
> > > > >  	.ban = guc_context_ban,
> > > > > +	.close = guc_context_close,
> > > > >  
> > > > >  	.cancel_request = guc_context_cancel_request,
> > > > >  
> > > > > @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
> > > > >  
> > > > >  	rq->reserved_space -= GUC_REQUEST_SIZE;
> > > > >  
> > > > > +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> > > > > +		   atomic_read(&ce->pin_count) < 3);
> > > > > +	sched_disable_context_delete(ce);
> > > > > +
> > > > >  	/*
> > > > >  	 * guc_ids are exhausted or a heuristic is met indicating too many
> > > > >  	 * guc_ids are waiting on requests with submission dependencies (not
> > > > > @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > >  	__guc_context_unpin(ce);
> > > > >  
> > > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > > -		intel_engine_pm_put(engine);
> > > > > +		intel_engine_pm_put_async(engine);
> > > > >  }
> > > > >  
> > > > >  static void guc_virtual_context_enter(struct intel_context *ce)
> > > > > @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
> > > > >  	.post_unpin = guc_context_post_unpin,
> > > > >  
> > > > >  	.ban = guc_context_ban,
> > > > > +	.close = guc_context_close,
> > > > >  
> > > > >  	.cancel_request = guc_context_cancel_request,
> > > > >  
> > > > > @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
> > > > >  	.post_unpin = guc_parent_context_post_unpin,
> > > > >  
> > > > >  	.ban = guc_context_ban,
> > > > > +	.close = guc_context_close,
> > > > >  
> > > > >  	.enter = guc_virtual_context_enter,
> > > > >  	.exit = guc_virtual_context_exit,
> > > > > @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
> > > > >  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
> > > > >  		   atomic_read(&guc->outstanding_submission_g2h));
> > > > >  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> > > > > -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> > > > > +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> > > > > +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> > > > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> > > > > +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> > > > > +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
> > > > >  	drm_printf(p, "GuC max context registered: %u\n\n",
> > > > >  		   guc->lrcd_reg.max_idx);
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > > index 9cfecf9d368e..ad70b3159ce4 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> > > > > @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
> > > > >  #define NUM_RQ_PER_CONTEXT	2
> > > > >  #define HEARTBEAT_INTERVAL	1500
> > > > >  
> > > > > -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> > > > > +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> > > > > +					bool hang, bool sched_disable_delay)
> > > > >  {
> > > > >  	struct intel_gt *gt = arg;
> > > > >  	struct intel_guc *guc = &gt->uc.guc;
> > > > > @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > > >  	if (limit_guc_ids)
> > > > >  		guc->num_guc_ids = NUM_GUC_ID;
> > > > >  
> > > > > +	if (sched_disable_delay)
> > > > > +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> > > > > +
> > > > >  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
> > > > >  	if (IS_ERR(ce)) {
> > > > >  		ret = PTR_ERR(ce);
> > > > > @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > > >  	guc->num_guc_ids = guc->max_guc_ids;
> > > > >  	guc->gse_hang_expected = false;
> > > > >  	guc->inject_bad_sched_disable = false;
> > > > > +	guc->sched_disable_delay_ns = 0;
> > > > >  	kfree(contexts);
> > > > >  
> > > > >  	return ret;
> > > > > @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
> > > > >  
> > > > >  static int intel_guc_flow_control_guc_ids(void *arg)
> > > > >  {
> > > > > -	return __intel_guc_flow_control_guc(arg, true, false);
> > > > > +	return __intel_guc_flow_control_guc(arg, true, false, false);
> > > > > +}
> > > > > +
> > > > > +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> > > > > +{
> > > > > +	return __intel_guc_flow_control_guc(arg, true, false, true);
> > > > >  }
> > > > >  
> > > > >  static int intel_guc_flow_control_lrcd_reg(void *arg)
> > > > >  {
> > > > > -	return __intel_guc_flow_control_guc(arg, false, false);
> > > > > +	return __intel_guc_flow_control_guc(arg, false, false, false);
> > > > >  }
> > > > >  
> > > > >  static int intel_guc_flow_control_hang_state_machine(void *arg)
> > > > >  {
> > > > > -	return __intel_guc_flow_control_guc(arg, true, true);
> > > > > +	return __intel_guc_flow_control_guc(arg, true, true, false);
> > > > >  }
> > > > >  
> > > > >  #define NUM_RQ_STRESS_CTBS	0x4000
> > > > > @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
> > > > >  	static const struct i915_subtest tests[] = {
> > > > >  		SUBTEST(intel_guc_flow_control_stress_ctbs),
> > > > >  		SUBTEST(intel_guc_flow_control_guc_ids),
> > > > > +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
> > > > >  		SUBTEST(intel_guc_flow_control_lrcd_reg),
> > > > >  		SUBTEST(intel_guc_flow_control_hang_state_machine),
> > > > >  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> > > > > diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> > > > > index f54de0499be7..bf464db7affe 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_selftest.h
> > > > > +++ b/drivers/gpu/drm/i915/i915_selftest.h
> > > > > @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
> > > > >  			T, ARRAY_SIZE(T), data)
> > > > >  #define i915_live_subtests(T, data) ({ \
> > > > >  	typecheck(struct drm_i915_private *, data); \
> > > > > +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
> > > > >  	__i915_subtests(__func__, \
> > > > >  			__i915_live_setup, __i915_live_teardown, \
> > > > >  			T, ARRAY_SIZE(T), data); \
> > > > >  })
> > > > >  #define intel_gt_live_subtests(T, data) ({ \
> > > > >  	typecheck(struct intel_gt *, data); \
> > > > > +	(data)->uc.guc.sched_disable_delay_ns = 0; \
> > > > >  	__i915_subtests(__func__, \
> > > > >  			__intel_gt_live_setup, __intel_gt_live_teardown, \
> > > > >  			T, ARRAY_SIZE(T), data); \
> > > > > diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> > > > > index 806ad688274b..57ba7065d5ab 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_trace.h
> > > > > +++ b/drivers/gpu/drm/i915/i915_trace.h
> > > > > @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
> > > > >  	     TP_ARGS(ce)
> > > > >  );
> > > > >  
> > > > > +DEFINE_EVENT(intel_context, intel_context_close,
> > > > > +	     TP_PROTO(struct intel_context *ce),
> > > > > +	     TP_ARGS(ce)
> > > > > +);
> > > > > +
> > > > >  DEFINE_EVENT(intel_context, intel_context_ban,
> > > > >  	     TP_PROTO(struct intel_context *ce),
> > > > >  	     TP_ARGS(ce)
> > > > > @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
> > > > >  {
> > > > >  }
> > > > >  
> > > > > +static inline void
> > > > > +trace_intel_context_close(struct intel_context *ce)
> > > > > +{
> > > > > +}
> > > > > +
> > > > >  static inline void
> > > > >  trace_intel_context_ban(struct intel_context *ce)
> > > > >  {
> > > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > > index f843a5040706..d54c280217fe 100644
> > > > > --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > > +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> > > > > @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
> > > > >  
> > > > >  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > > index 9e9a6cb1d9e5..86bad00cca95 100644
> > > > > --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > > +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > > > > @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
> > > > >  	if (err)
> > > > >  		return err;
> > > > >  
> > > > > -	err = i915_subtests(tests, i915);
> > > > > +	err = i915_live_subtests(tests, i915);
> > > > >  
> > > > >  	destroy_empty_config(&i915->perf);
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > > index d67710d10615..afbf88865a8b 100644
> > > > > --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > > +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> > > > > @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
> > > > >  	if (intel_gt_is_wedged(&i915->gt))
> > > > >  		return 0;
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > >  
> > > > >  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> > > > > diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > > index dd0607254a95..f4b157451851 100644
> > > > > --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > > +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > > > > @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
> > > > >  		SUBTEST(igt_vma_remapped_gtt),
> > > > >  	};
> > > > >  
> > > > > -	return i915_subtests(tests, i915);
> > > > > +	return i915_live_subtests(tests, i915);
> > > > >  }
> > > > > -- 
> > > > > 2.28.0
> > > > > 
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-11 18:06           ` Matthew Brost
@ 2021-08-12 14:45             ` Daniel Vetter
  2021-08-12 14:52               ` Daniel Vetter
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Vetter @ 2021-08-12 14:45 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Wed, Aug 11, 2021 at 06:06:36PM +0000, Matthew Brost wrote:
> On Tue, Aug 10, 2021 at 11:07:55AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 10:53:37AM +0200, Daniel Vetter wrote:
> > > On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> > > > On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > > > > Implement GuC parent-child context pin / unpin functions in which in any
> > > > > > contexts in the relationship are pinned all the contexts are pinned. The
> > > > > > parent owns most of the pinning / unpinning process and the children
> > > > > > direct any pins / unpins to the parent.
> > > > > > 
> > > > > > Patch implements a number of unused functions that will be connected
> > > > > > later in the series.
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > > > > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > > > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > > > > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > > > > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > > > > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > > > > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > > > > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > > > > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > index 8cb92b10b547..bb4c14656067 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > > > > >  	intel_ring_unpin(ring);
> > > > > >  }
> > > > > >  
> > > > > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > > > > -				 struct i915_gem_ww_ctx *ww)
> > > > > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > > > > +				   struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	int err;
> > > > > >  
> > > > > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > > > > >  	return err;
> > > > > >  }
> > > > > >  
> > > > > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > > > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	if (ce->state)
> > > > > >  		__context_unpin_state(ce->state);
> > > > > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > > > > >  	__ring_retire(ce->ring);
> > > > > >  }
> > > > > >  
> > > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > -			      struct i915_gem_ww_ctx *ww)
> > > > > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > > > > +				 struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > > -	bool handoff = false;
> > > > > > -	void *vaddr;
> > > > > > +	struct intel_context *child;
> > > > > > +	int err, i = 0;
> > > > > > +
> > > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		err = __intel_context_pre_pin(child, ww);
> > > > > > +		if (unlikely(err))
> > > > > > +			goto unwind;
> > > > > > +		++i;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = __intel_context_pre_pin(ce, ww);
> > > > > > +	if (unlikely(err))
> > > > > > +		goto unwind;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +unwind:
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		if (!i--)
> > > > > > +			break;
> > > > > > +		__intel_context_post_unpin(ce);
> > > > > > +	}
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	struct intel_context *child;
> > > > > > +
> > > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > +
> > > > > > +	for_each_child(ce, child)
> > > > > > +		__intel_context_post_unpin(child);
> > > > > > +
> > > > > > +	__intel_context_post_unpin(ce);
> > > > > > +}
> > > > > > +
> > > > > > +static int __do_ww_lock(struct intel_context *ce,
> > > > > > +			struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > > +
> > > > > > +	if (!err && ce->ring->vma->obj)
> > > > > > +		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > > +	if (!err && ce->state)
> > > > > > +		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +static int do_ww_lock(struct intel_context *ce,
> > > > > > +		      struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	struct intel_context *child;
> > > > > >  	int err = 0;
> > > > > >  
> > > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		err = __do_ww_lock(child, ww);
> > > > > > +		if (unlikely(err))
> > > > > > +			return err;
> > > > > > +	}
> > > > > > +
> > > > > > +	return __do_ww_lock(ce, ww);
> > > > > > +}
> > > > > > +
> > > > > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > +				     struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	bool handoff = false;
> > > > > > +	int err;
> > > > > > +
> > > > > >  	if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > > > > >  		err = intel_context_alloc_state(ce);
> > > > > >  		if (err)
> > > > > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > >  	 * refcount for __intel_context_active(), which prevent a lock
> > > > > >  	 * inversion of ce->pin_mutex vs dma_resv_lock().
> > > > > >  	 */
> > > > > > +	err = do_ww_lock(ce, ww);
> > > > > > +	if (err)
> > > > > > +		return err;
> > > > > >  
> > > > > > -	err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > > -	if (!err && ce->ring->vma->obj)
> > > > > > -		err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > > -	if (!err && ce->state)
> > > > > > -		err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > > -	if (!err)
> > > > > > -		err = intel_context_pre_pin(ce, ww);
> > > > > > +	err = intel_context_pre_pin(ce, ww);
> > > > > >  	if (err)
> > > > > >  		return err;
> > > > > >  
> > > > > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > >  	if (err)
> > > > > >  		goto err_ctx_unpin;
> > > > > >  
> > > > > > -	err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > > > > +	err = ce->ops->pre_pin(ce, ww);
> > > > > >  	if (err)
> > > > > >  		goto err_release;
> > > > > >  
> > > > > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > >  		if (unlikely(err))
> > > > > >  			goto err_unlock;
> > > > > >  
> > > > > > -		err = ce->ops->pin(ce, vaddr);
> > > > > > +		err = ce->ops->pin(ce);
> > > > > >  		if (err) {
> > > > > >  			intel_context_active_release(ce);
> > > > > >  			goto err_unlock;
> > > > > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > >  	return err;
> > > > > >  }
> > > > > >  
> > > > > > -int __intel_context_do_pin(struct intel_context *ce)
> > > > > > +static int __intel_context_do_pin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	struct i915_gem_ww_ctx ww;
> > > > > >  	int err;
> > > > > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > > > > >  		 intel_context_get_avg_runtime_ns(ce));
> > > > > >  
> > > > > >  	set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > > > > -	intel_context_post_unpin(ce);
> > > > > > +	__intel_context_post_unpin(ce);
> > > > > >  	intel_context_put(ce);
> > > > > >  }
> > > > > >  
> > > > > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > > > >  	child->parent = parent;
> > > > > >  }
> > > > > >  
> > > > > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	return __intel_context_do_pin(ce);
> > > > > > +}
> > > > > > +
> > > > > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > > > > +					 struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	if (likely(intel_context_pin_if_active(ce)))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	return __intel_context_do_pin_ww(ce, ww);
> > > > > > +}
> > > > > > +
> > > > > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	if (!ce->ops->sched_disable) {
> > > > > > +		__intel_context_do_unpin(ce, 1);
> > > > > > +	} else {
> > > > > > +		/*
> > > > > > +		 * Move ownership of this pin to the scheduling disable which is
> > > > > > +		 * an async operation. When that operation completes the above
> > > > > > +		 * intel_context_sched_disable_unpin is called potentially
> > > > > > +		 * unpinning the context.
> > > > > > +		 */
> > > > > > +		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > > +			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > 
> > Just as an example of what I mean here on the code review side. This is an
> > endless loop, and you need to prove that there's no livelock or starvation
> > issues. Or explain how else you handle that if there is one.
> > 
> 
> If we pop into the while loop the pin_count = 1, so in all likelihood the
> follow if evaluates to true and the loop is broken. Only way it
> evaluates to false is the context gets pinned between the while & if, so
> on the next pass the while statement should evaluate to false breaking
> the looping unless of course the contexts gets unpinned again... In
> pratice this should be at most 3 atomic operations unless the loop is
> broken.

That's not really how this works ... E.g. the linux spinlocks are
ticketed/queued locks, exactly because the "this likely doesn't happen"
argument is not a very good one.

Can't we at least convert intel_context->pin_count into a normal counter
with a spinlock around it and then just make all this stuff a lot more
reasonable?

Yes this doesn't help the situation overall much, but at least we're not
spreading dubious hand-roll locking patterns all over the place for not
much good reasons.

Hand-crafted artisanal locking is never a quality sign.
-Daniel

> 
> Matt
> 
> > Because unlike hand-rolled stuff linux kernel spinlocks are not dumb
> > spinlocks, but ticketed/queued locks and therefor starvation proof. But
> > this stuff actually matters on todays multi-core and not-so-uniform (even
> > without fully NUMA) architectures.
> > 
> > Also I've just found another lockless retry loop which does actually
> > degenerate into a full endless loop (if you're sufficiently unlucky in
> > your races), so this really isn't academic at all.
> > -Daniel
> > 
> > > > > 
> > > > > Uh man lockless algorithms.
> > > > > 
> > > > > Unless this comes:
> > > > > - with essentially an academic looking paper that describes the abstract
> > > > >   model of the lockless algorithm and proves it against the linux kernel
> > > > >   meory model.
> > > > > 
> > > > > - lockless stuff generally needs barriers, and those barriers must be all
> > > > >   documented. This means a) a comment next to each barrier in the code b)
> > > > >   pointing to its counterparty c) with the overall design also explained
> > > > >   in the kerneldoc for those datastructres.
> > > > > 
> > > > >   If you don't know where your barriers are, see above point about "it
> > > > >   should look more like an academic paper in the commit message"
> > > > > 
> > > > > - hard perf data about how this is absolutely required, based on a
> > > > >   real-world use-case (which then sometimes justifies a microbenchmark
> > > > >   metric for the details, but it always needs to be real-world based). And
> > > > >   also a throughrough explainer how the perf issue isn't fixable through
> > > > >   better design. If that's not doable, just protect the state machine with
> > > > >   a big dumb lock and move on.
> > > > > 
> > > > > - Also, because the current code is in such bad shape wrt lockless
> > > > >   algorithms and premature optimizations: Overall complexity should go
> > > > >   down (it's way too high right now), so pay down your new lockless trick
> > > > >   by removing one of the existing ones that we only have because we can.
> > > > > 
> > > > > Yes this is steep, but we're way out in the woods here and need to smoehow
> > > > > get back.
> > > > 
> > > > See below FIXME. At one point all of this was hidden in the backend but
> > > > the dma-resv patches that landed upstream completely broke the layering,
> > > > hence the need for the code here.
> > > > 
> > > > I guess I don't really understand what mean when you say lockless alg
> > > > needs barriers, if the atomic functions are not really atomic wouldn't
> > > > the world be broken?
> > > 
> > > They unordered atomics by default. Which means they're atomic itself, but
> > > entirely unordered with anything else that's going on. Except when you
> > > have one of the atomic ops which already guarantee a barrier, or you
> > > manually add the barriers yourself. And yes there's enormous amounts of
> > > bugs, and with our dgpu potentially running on non-IA cpus those bugs
> > > matter.
> > > 
> > > Note that in C++ atomics the default behaviour is strongly ordered atomics
> > > with full barriers by default, because those are much easier to program
> > > against. Kernel isn't like that and defaults to "you need to add all the
> > > barriers yourself".
> > > 
> > > I have a full lenght rant in the works and will work that through all
> > > channels, but essentially locking is really hard to get right. And
> > > lockless tricks practically need an academic paper with a formal
> > > correctness proof against the linux memory model, or you do have bugs.
> > > 
> > > And I know that the current code is choke full of this stuff, so it's
> > > tempting to just add more, but we really cant. The amount of locking
> > > trickery we have in the codebase must go down substantially. My take is
> > > that any code that adds anything trick needs to fully justify it against
> > > the above list, _and_ also clean up some of the existing nonsense so that
> > > overall complexity doesn't increase.
> > > 
> > > I'll share the full length rant with you internally, it's not yet ready
> > > for publishing (but that's planned too).
> > > 
> > > 
> > > > Also here I don't think it is really as simple as grab big dump lock for
> > > > a variety of reasons, at least with the current dynamic pin / unpin code
> > > > in place. If we move a perma-pinned contexts this could be cleaned up
> > > > then.
> > > 
> > > Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
> > > context can fix this I think we should do this asap. I'd say for parallel
> > > context we should just do it outright (special case them or whatever) so
> > > that we don't have to add even more very tricky code and tech debt.
> > > 
> > > Doable?
> > > 
> > > Cheers, Daniel
> > > 
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > +				ce->ops->sched_disable(ce);
> > > > > > +				break;
> > > > > > +			}
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > > > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > > > > + * configured for parallel submission, are pinned all the contexts need to be
> > > > > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > > > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > > > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > > > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > > > > + * is in place works but is not ideal.
> > > > > > + */
> > > > > > +int intel_context_pin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	if (intel_context_is_child(ce)) {
> > > > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > > +			return ____intel_context_pin(ce->parent);
> > > > > > +		else
> > > > > > +			return 0;
> > > > > > +	} else {
> > > > > > +		return ____intel_context_pin(ce);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > > +			 struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	if (intel_context_is_child(ce)) {
> > > > > > +		if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > > +			return __intel_context_pin_ww(ce->parent, ww);
> > > > > > +		else
> > > > > > +			return 0;
> > > > > > +	} else {
> > > > > > +		return __intel_context_pin_ww(ce, ww);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +void intel_context_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	if (intel_context_is_child(ce)) {
> > > > > > +		if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > > > > +			__intel_context_unpin(ce->parent);
> > > > > > +	} else {
> > > > > > +		__intel_context_unpin(ce);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > > > >  #include "selftest_context.c"
> > > > > >  #endif
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > index ad6ce5ac4824..c208691fc87d 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > > > > >  	mutex_unlock(&ce->pin_mutex);
> > > > > >  }
> > > > > >  
> > > > > > -int __intel_context_do_pin(struct intel_context *ce);
> > > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > -			      struct i915_gem_ww_ctx *ww);
> > > > > > -
> > > > > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > > > > >  {
> > > > > >  	return atomic_inc_not_zero(&ce->pin_count);
> > > > > >  }
> > > > > >  
> > > > > > -static inline int intel_context_pin(struct intel_context *ce)
> > > > > > -{
> > > > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > > > -		return 0;
> > > > > > -
> > > > > > -	return __intel_context_do_pin(ce);
> > > > > > -}
> > > > > > -
> > > > > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > > > > -				       struct i915_gem_ww_ctx *ww)
> > > > > > -{
> > > > > > -	if (likely(intel_context_pin_if_active(ce)))
> > > > > > -		return 0;
> > > > > > +int intel_context_pin(struct intel_context *ce);
> > > > > >  
> > > > > > -	return __intel_context_do_pin_ww(ce, ww);
> > > > > > -}
> > > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > > +			 struct i915_gem_ww_ctx *ww);
> > > > > >  
> > > > > >  static inline void __intel_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > > > > >  
> > > > > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > > > > >  {
> > > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > >  	__intel_context_do_unpin(ce, 2);
> > > > > >  }
> > > > > >  
> > > > > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > > > > -{
> > > > > > -	if (!ce->ops->sched_disable) {
> > > > > > -		__intel_context_do_unpin(ce, 1);
> > > > > > -	} else {
> > > > > > -		/*
> > > > > > -		 * Move ownership of this pin to the scheduling disable which is
> > > > > > -		 * an async operation. When that operation completes the above
> > > > > > -		 * intel_context_sched_disable_unpin is called potentially
> > > > > > -		 * unpinning the context.
> > > > > > -		 */
> > > > > > -		while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > > -			if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > > > > -				ce->ops->sched_disable(ce);
> > > > > > -				break;
> > > > > > -			}
> > > > > > -		}
> > > > > > -	}
> > > > > > -}
> > > > > > +void intel_context_unpin(struct intel_context *ce);
> > > > > >  
> > > > > >  void intel_context_enter_engine(struct intel_context *ce);
> > > > > >  void intel_context_exit_engine(struct intel_context *ce);
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > index 66b22b370a72..eb82be15b7a2 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > > > > >  
> > > > > >  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > > > >  
> > > > > > -	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > > > > -	int (*pin)(struct intel_context *ce, void *vaddr);
> > > > > > +	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > > > +	int (*pin)(struct intel_context *ce);
> > > > > >  	void (*unpin)(struct intel_context *ce);
> > > > > >  	void (*post_unpin)(struct intel_context *ce);
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > index baa1797af1c8..fc74ca28f245 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > > > > >  static int
> > > > > >  __execlists_context_pre_pin(struct intel_context *ce,
> > > > > >  			    struct intel_engine_cs *engine,
> > > > > > -			    struct i915_gem_ww_ctx *ww, void **vaddr)
> > > > > > +			    struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	int err;
> > > > > >  
> > > > > > -	err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > > +	err = lrc_pre_pin(ce, engine, ww);
> > > > > >  	if (err)
> > > > > >  		return err;
> > > > > >  
> > > > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > > > > -		lrc_init_state(ce, engine, *vaddr);
> > > > > > +		lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > > > > +			       LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > > > > >  
> > > > > >  		 __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > > > > >  	}
> > > > > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > > > > >  }
> > > > > >  
> > > > > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > > > > -				     struct i915_gem_ww_ctx *ww,
> > > > > > -				     void **vaddr)
> > > > > > +				     struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > > -	return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > > +	return __execlists_context_pre_pin(ce, ce->engine, ww);
> > > > > >  }
> > > > > >  
> > > > > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > +static int execlists_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > > -	return lrc_pin(ce, ce->engine, vaddr);
> > > > > > +	return lrc_pin(ce, ce->engine);
> > > > > >  }
> > > > > >  
> > > > > >  static int execlists_context_alloc(struct intel_context *ce)
> > > > > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > > > > >  }
> > > > > >  
> > > > > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > > > > -				   struct i915_gem_ww_ctx *ww,
> > > > > > -				   void **vaddr)
> > > > > > +				   struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > > >  
> > > > > >  	 /* Note: we must use a real engine class for setting up reg state */
> > > > > > -	return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > > > > +	return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > > > > >  }
> > > > > >  
> > > > > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > +static int virtual_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > > >  
> > > > > > -	return lrc_pin(ce, ve->siblings[0], vaddr);
> > > > > > +	return lrc_pin(ce, ve->siblings[0]);
> > > > > >  }
> > > > > >  
> > > > > >  static void virtual_context_enter(struct intel_context *ce)
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > index bb4af4977920..c466fc966005 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > > > > >  int
> > > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > > >  	    struct intel_engine_cs *engine,
> > > > > > -	    struct i915_gem_ww_ctx *ww,
> > > > > > -	    void **vaddr)
> > > > > > +	    struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > > +	void *vaddr;
> > > > > >  	GEM_BUG_ON(!ce->state);
> > > > > >  	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > > > > >  
> > > > > > -	*vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > > -					 i915_coherent_map_type(ce->engine->i915,
> > > > > > -								ce->state->obj,
> > > > > > -								false) |
> > > > > > -					 I915_MAP_OVERRIDE);
> > > > > > +	vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > > +					i915_coherent_map_type(ce->engine->i915,
> > > > > > +							       ce->state->obj,
> > > > > > +							       false) |
> > > > > > +					I915_MAP_OVERRIDE);
> > > > > >  
> > > > > > -	return PTR_ERR_OR_ZERO(*vaddr);
> > > > > > +	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > > +
> > > > > > +	return PTR_ERR_OR_ZERO(vaddr);
> > > > > >  }
> > > > > >  
> > > > > >  int
> > > > > >  lrc_pin(struct intel_context *ce,
> > > > > > -	struct intel_engine_cs *engine,
> > > > > > -	void *vaddr)
> > > > > > +	struct intel_engine_cs *engine)
> > > > > >  {
> > > > > > -	ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > > -
> > > > > >  	if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > > > > -		lrc_init_state(ce, engine, vaddr);
> > > > > > +		lrc_init_state(ce, engine,
> > > > > > +			       (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > > > > >  
> > > > > >  	ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > > > > >  	return 0;
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > index 7f697845c4cf..837fcf00270d 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > > > > >  int
> > > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > > >  	    struct intel_engine_cs *engine,
> > > > > > -	    struct i915_gem_ww_ctx *ww,
> > > > > > -	    void **vaddr);
> > > > > > +	    struct i915_gem_ww_ctx *ww);
> > > > > >  int
> > > > > >  lrc_pin(struct intel_context *ce,
> > > > > > -	struct intel_engine_cs *engine,
> > > > > > -	void *vaddr);
> > > > > > +	struct intel_engine_cs *engine);
> > > > > >  void lrc_unpin(struct intel_context *ce);
> > > > > >  void lrc_post_unpin(struct intel_context *ce);
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > index 2958e2fae380..f4f301bfb9f7 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > > > > >  }
> > > > > >  
> > > > > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > > > > -				struct i915_gem_ww_ctx *ww,
> > > > > > -				void **unused)
> > > > > > +				struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	struct i915_address_space *vm;
> > > > > >  	int err = 0;
> > > > > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > > > > +static int ring_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	return 0;
> > > > > >  }
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > index 2c1af030310c..826b5d7a4573 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > > > > >  }
> > > > > >  
> > > > > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > > > > -				struct i915_gem_ww_ctx *ww, void **unused)
> > > > > > +				struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > > > > +static int mock_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	return 0;
> > > > > >  }
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > index dec757d319a2..c5c73c42bcf7 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > > >  
> > > > > >  	GEM_BUG_ON(!engine->mask);
> > > > > >  	GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > > > +	GEM_BUG_ON(intel_context_is_child(ce));
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * Ensure LRC + CT vmas are is same region as write barrier is done
> > > > > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > > >  
> > > > > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > > > > >  				 struct intel_engine_cs *engine,
> > > > > > -				 struct i915_gem_ww_ctx *ww,
> > > > > > -				 void **vaddr)
> > > > > > +				 struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > > -	return lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > > +	return lrc_pre_pin(ce, engine, ww);
> > > > > >  }
> > > > > >  
> > > > > >  static int __guc_context_pin(struct intel_context *ce,
> > > > > > -			     struct intel_engine_cs *engine,
> > > > > > -			     void *vaddr)
> > > > > > +			     struct intel_engine_cs *engine)
> > > > > >  {
> > > > > >  	if (i915_ggtt_offset(ce->state) !=
> > > > > >  	    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > > > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > > > > >  	 * explaination of why.
> > > > > >  	 */
> > > > > >  
> > > > > > -	return lrc_pin(ce, engine, vaddr);
> > > > > > +	return lrc_pin(ce, engine);
> > > > > > +}
> > > > > > +
> > > > > > +static void __guc_context_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	lrc_unpin(ce);
> > > > > > +}
> > > > > > +
> > > > > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	lrc_post_unpin(ce);
> > > > > >  }
> > > > > >  
> > > > > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > > > > -			       struct i915_gem_ww_ctx *ww,
> > > > > > -			       void **vaddr)
> > > > > > +			       struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > > -	return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > > +	return __guc_context_pre_pin(ce, ce->engine, ww);
> > > > > >  }
> > > > > >  
> > > > > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > +static int guc_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > > -	int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > > > +	int ret;
> > > > > >  
> > > > > > +	GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > > > > +		   intel_context_is_child(ce));
> > > > > > +
> > > > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > > >  	if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > > >  		intel_engine_pm_get(ce->engine);
> > > > > >  
> > > > > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > > >  	GEM_BUG_ON(context_enabled(ce));
> > > > > >  
> > > > > >  	unpin_guc_id(guc, ce, true);
> > > > > > -	lrc_unpin(ce);
> > > > > > +	__guc_context_unpin(ce);
> > > > > >  
> > > > > >  	if (likely(!intel_context_is_barrier(ce)))
> > > > > >  		intel_engine_pm_put(ce->engine);
> > > > > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > > >  
> > > > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > > >  {
> > > > > > -	lrc_post_unpin(ce);
> > > > > > +	__guc_context_post_unpin(ce);
> > > > > > +}
> > > > > > +
> > > > > > +/* Future patches will use this function */
> > > > > > +__maybe_unused
> > > > > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > > > > +				      struct i915_gem_ww_ctx *ww)
> > > > > > +{
> > > > > > +	struct intel_context *child;
> > > > > > +	int err, i = 0, j = 0;
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		err = i915_active_acquire(&child->active);
> > > > > > +		if (unlikely(err))
> > > > > > +			goto unwind_active;
> > > > > > +		++i;
> > > > > > +	}
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		err = __guc_context_pre_pin(child, child->engine, ww);
> > > > > > +		if (unlikely(err))
> > > > > > +			goto unwind_pre_pin;
> > > > > > +		++j;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > > > > +	if (unlikely(err))
> > > > > > +		goto unwind_pre_pin;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +unwind_pre_pin:
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		if (!j--)
> > > > > > +			break;
> > > > > > +		__guc_context_post_unpin(child);
> > > > > > +	}
> > > > > > +
> > > > > > +unwind_active:
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		if (!i--)
> > > > > > +			break;
> > > > > > +		i915_active_release(&child->active);
> > > > > > +	}
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/* Future patches will use this function */
> > > > > > +__maybe_unused
> > > > > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	struct intel_context *child;
> > > > > > +
> > > > > > +	for_each_child(ce, child)
> > > > > > +		__guc_context_post_unpin(child);
> > > > > > +	__guc_context_post_unpin(ce);
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		intel_context_get(child);
> > > > > > +		i915_active_release(&child->active);
> > > > > > +		intel_context_put(child);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/* Future patches will use this function */
> > > > > > +__maybe_unused
> > > > > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	int ret, i = 0, j = 0;
> > > > > > +	struct intel_context *child;
> > > > > > +	struct intel_engine_cs *engine;
> > > > > > +	intel_engine_mask_t tmp;
> > > > > > +
> > > > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > > +
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		ret = __guc_context_pin(child, child->engine);
> > > > > > +		if (unlikely(ret))
> > > > > > +			goto unwind_pin;
> > > > > > +		++i;
> > > > > > +	}
> > > > > > +	ret = __guc_context_pin(ce, ce->engine);
> > > > > > +	if (unlikely(ret))
> > > > > > +		goto unwind_pin;
> > > > > > +
> > > > > > +	for_each_child(ce, child)
> > > > > > +		if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > > > > +			set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +
> > > > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > > > +			       ce->engine->mask, tmp)
> > > > > > +		intel_engine_pm_get(engine);
> > > > > > +	for_each_child(ce, child)
> > > > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > > > +				       child->engine->mask, tmp)
> > > > > > +			intel_engine_pm_get(engine);
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +unwind_pin:
> > > > > > +	for_each_child(ce, child) {
> > > > > > +		if (++j > i)
> > > > > > +			break;
> > > > > > +		__guc_context_unpin(child);
> > > > > > +	}
> > > > > > +
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +/* Future patches will use this function */
> > > > > > +__maybe_unused
> > > > > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > > > > +{
> > > > > > +	struct intel_context *child;
> > > > > > +	struct intel_engine_cs *engine;
> > > > > > +	intel_engine_mask_t tmp;
> > > > > > +
> > > > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > > +	GEM_BUG_ON(context_enabled(ce));
> > > > > > +
> > > > > > +	unpin_guc_id(ce_to_guc(ce), ce, true);
> > > > > > +	for_each_child(ce, child)
> > > > > > +		__guc_context_unpin(child);
> > > > > > +	__guc_context_unpin(ce);
> > > > > > +
> > > > > > +	for_each_engine_masked(engine, ce->engine->gt,
> > > > > > +			       ce->engine->mask, tmp)
> > > > > > +		intel_engine_pm_put(engine);
> > > > > > +	for_each_child(ce, child)
> > > > > > +		for_each_engine_masked(engine, child->engine->gt,
> > > > > > +				       child->engine->mask, tmp)
> > > > > > +			intel_engine_pm_put(engine);
> > > > > >  }
> > > > > >  
> > > > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > > > > >  }
> > > > > >  
> > > > > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > > > > -				       struct i915_gem_ww_ctx *ww,
> > > > > > -				       void **vaddr)
> > > > > > +				       struct i915_gem_ww_ctx *ww)
> > > > > >  {
> > > > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > > >  
> > > > > > -	return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > > > > +	return __guc_context_pre_pin(ce, engine, ww);
> > > > > >  }
> > > > > >  
> > > > > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > > > > >  {
> > > > > >  	struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > > > -	int ret = __guc_context_pin(ce, engine, vaddr);
> > > > > > +	int ret = __guc_context_pin(ce, engine);
> > > > > >  	intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > > >  
> > > > > >  	if (likely(!ret))
> > > > > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > > >  	GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > > >  
> > > > > >  	unpin_guc_id(guc, ce, true);
> > > > > > -	lrc_unpin(ce);
> > > > > > +	__guc_context_unpin(ce);
> > > > > >  
> > > > > >  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > > >  		intel_engine_pm_put(engine);
> > > > > > -- 
> > > > > > 2.28.0
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions
  2021-08-12 14:45             ` Daniel Vetter
@ 2021-08-12 14:52               ` Daniel Vetter
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-12 14:52 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Thu, Aug 12, 2021 at 4:45 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Aug 11, 2021 at 06:06:36PM +0000, Matthew Brost wrote:
> > On Tue, Aug 10, 2021 at 11:07:55AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 10, 2021 at 10:53:37AM +0200, Daniel Vetter wrote:
> > > > On Mon, Aug 09, 2021 at 06:58:23PM +0000, Matthew Brost wrote:
> > > > > On Mon, Aug 09, 2021 at 05:17:34PM +0200, Daniel Vetter wrote:
> > > > > > On Tue, Aug 03, 2021 at 03:29:13PM -0700, Matthew Brost wrote:
> > > > > > > Implement GuC parent-child context pin / unpin functions in which in any
> > > > > > > contexts in the relationship are pinned all the contexts are pinned. The
> > > > > > > parent owns most of the pinning / unpinning process and the children
> > > > > > > direct any pins / unpins to the parent.
> > > > > > >
> > > > > > > Patch implements a number of unused functions that will be connected
> > > > > > > later in the series.
> > > > > > >
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/i915/gt/intel_context.c       | 187 ++++++++++++++++--
> > > > > > >  drivers/gpu/drm/i915/gt/intel_context.h       |  43 +---
> > > > > > >  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
> > > > > > >  .../drm/i915/gt/intel_execlists_submission.c  |  25 ++-
> > > > > > >  drivers/gpu/drm/i915/gt/intel_lrc.c           |  26 +--
> > > > > > >  drivers/gpu/drm/i915/gt/intel_lrc.h           |   6 +-
> > > > > > >  .../gpu/drm/i915/gt/intel_ring_submission.c   |   5 +-
> > > > > > >  drivers/gpu/drm/i915/gt/mock_engine.c         |   4 +-
> > > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 183 +++++++++++++++--
> > > > > > >  9 files changed, 371 insertions(+), 112 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > > index 8cb92b10b547..bb4c14656067 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > > > > @@ -158,8 +158,8 @@ static void __ring_retire(struct intel_ring *ring)
> > > > > > >     intel_ring_unpin(ring);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int intel_context_pre_pin(struct intel_context *ce,
> > > > > > > -                            struct i915_gem_ww_ctx *ww)
> > > > > > > +static int __intel_context_pre_pin(struct intel_context *ce,
> > > > > > > +                              struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     int err;
> > > > > > >
> > > > > > > @@ -190,7 +190,7 @@ static int intel_context_pre_pin(struct intel_context *ce,
> > > > > > >     return err;
> > > > > > >  }
> > > > > > >
> > > > > > > -static void intel_context_post_unpin(struct intel_context *ce)
> > > > > > > +static void __intel_context_post_unpin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     if (ce->state)
> > > > > > >             __context_unpin_state(ce->state);
> > > > > > > @@ -199,13 +199,85 @@ static void intel_context_post_unpin(struct intel_context *ce)
> > > > > > >     __ring_retire(ce->ring);
> > > > > > >  }
> > > > > > >
> > > > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > > -                         struct i915_gem_ww_ctx *ww)
> > > > > > > +static int intel_context_pre_pin(struct intel_context *ce,
> > > > > > > +                            struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > > -   bool handoff = false;
> > > > > > > -   void *vaddr;
> > > > > > > +   struct intel_context *child;
> > > > > > > +   int err, i = 0;
> > > > > > > +
> > > > > > > +   GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           err = __intel_context_pre_pin(child, ww);
> > > > > > > +           if (unlikely(err))
> > > > > > > +                   goto unwind;
> > > > > > > +           ++i;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   err = __intel_context_pre_pin(ce, ww);
> > > > > > > +   if (unlikely(err))
> > > > > > > +           goto unwind;
> > > > > > > +
> > > > > > > +   return 0;
> > > > > > > +
> > > > > > > +unwind:
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           if (!i--)
> > > > > > > +                   break;
> > > > > > > +           __intel_context_post_unpin(ce);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   return err;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void intel_context_post_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   struct intel_context *child;
> > > > > > > +
> > > > > > > +   GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > > +
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           __intel_context_post_unpin(child);
> > > > > > > +
> > > > > > > +   __intel_context_post_unpin(ce);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int __do_ww_lock(struct intel_context *ce,
> > > > > > > +                   struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   int err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > > > +
> > > > > > > +   if (!err && ce->ring->vma->obj)
> > > > > > > +           err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > > > +   if (!err && ce->state)
> > > > > > > +           err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > > > +
> > > > > > > +   return err;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int do_ww_lock(struct intel_context *ce,
> > > > > > > +                 struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   struct intel_context *child;
> > > > > > >     int err = 0;
> > > > > > >
> > > > > > > +   GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           err = __do_ww_lock(child, ww);
> > > > > > > +           if (unlikely(err))
> > > > > > > +                   return err;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   return __do_ww_lock(ce, ww);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > > +                                struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   bool handoff = false;
> > > > > > > +   int err;
> > > > > > > +
> > > > > > >     if (unlikely(!test_bit(CONTEXT_ALLOC_BIT, &ce->flags))) {
> > > > > > >             err = intel_context_alloc_state(ce);
> > > > > > >             if (err)
> > > > > > > @@ -217,14 +289,11 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > >      * refcount for __intel_context_active(), which prevent a lock
> > > > > > >      * inversion of ce->pin_mutex vs dma_resv_lock().
> > > > > > >      */
> > > > > > > +   err = do_ww_lock(ce, ww);
> > > > > > > +   if (err)
> > > > > > > +           return err;
> > > > > > >
> > > > > > > -   err = i915_gem_object_lock(ce->timeline->hwsp_ggtt->obj, ww);
> > > > > > > -   if (!err && ce->ring->vma->obj)
> > > > > > > -           err = i915_gem_object_lock(ce->ring->vma->obj, ww);
> > > > > > > -   if (!err && ce->state)
> > > > > > > -           err = i915_gem_object_lock(ce->state->obj, ww);
> > > > > > > -   if (!err)
> > > > > > > -           err = intel_context_pre_pin(ce, ww);
> > > > > > > +   err = intel_context_pre_pin(ce, ww);
> > > > > > >     if (err)
> > > > > > >             return err;
> > > > > > >
> > > > > > > @@ -232,7 +301,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > >     if (err)
> > > > > > >             goto err_ctx_unpin;
> > > > > > >
> > > > > > > -   err = ce->ops->pre_pin(ce, ww, &vaddr);
> > > > > > > +   err = ce->ops->pre_pin(ce, ww);
> > > > > > >     if (err)
> > > > > > >             goto err_release;
> > > > > > >
> > > > > > > @@ -250,7 +319,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > >             if (unlikely(err))
> > > > > > >                     goto err_unlock;
> > > > > > >
> > > > > > > -           err = ce->ops->pin(ce, vaddr);
> > > > > > > +           err = ce->ops->pin(ce);
> > > > > > >             if (err) {
> > > > > > >                     intel_context_active_release(ce);
> > > > > > >                     goto err_unlock;
> > > > > > > @@ -290,7 +359,7 @@ int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > >     return err;
> > > > > > >  }
> > > > > > >
> > > > > > > -int __intel_context_do_pin(struct intel_context *ce)
> > > > > > > +static int __intel_context_do_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     struct i915_gem_ww_ctx ww;
> > > > > > >     int err;
> > > > > > > @@ -337,7 +406,7 @@ static void __intel_context_retire(struct i915_active *active)
> > > > > > >              intel_context_get_avg_runtime_ns(ce));
> > > > > > >
> > > > > > >     set_bit(CONTEXT_VALID_BIT, &ce->flags);
> > > > > > > -   intel_context_post_unpin(ce);
> > > > > > > +   __intel_context_post_unpin(ce);
> > > > > > >     intel_context_put(ce);
> > > > > > >  }
> > > > > > >
> > > > > > > @@ -562,6 +631,88 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > > > > >     child->parent = parent;
> > > > > > >  }
> > > > > > >
> > > > > > > +static inline int ____intel_context_pin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   if (likely(intel_context_pin_if_active(ce)))
> > > > > > > +           return 0;
> > > > > > > +
> > > > > > > +   return __intel_context_do_pin(ce);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline int __intel_context_pin_ww(struct intel_context *ce,
> > > > > > > +                                    struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   if (likely(intel_context_pin_if_active(ce)))
> > > > > > > +           return 0;
> > > > > > > +
> > > > > > > +   return __intel_context_do_pin_ww(ce, ww);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline void __intel_context_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   if (!ce->ops->sched_disable) {
> > > > > > > +           __intel_context_do_unpin(ce, 1);
> > > > > > > +   } else {
> > > > > > > +           /*
> > > > > > > +            * Move ownership of this pin to the scheduling disable which is
> > > > > > > +            * an async operation. When that operation completes the above
> > > > > > > +            * intel_context_sched_disable_unpin is called potentially
> > > > > > > +            * unpinning the context.
> > > > > > > +            */
> > > > > > > +           while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > > > +                   if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > >
> > > Just as an example of what I mean here on the code review side. This is an
> > > endless loop, and you need to prove that there's no livelock or starvation
> > > issues. Or explain how else you handle that if there is one.
> > >
> >
> > If we pop into the while loop the pin_count = 1, so in all likelihood the
> > follow if evaluates to true and the loop is broken. Only way it
> > evaluates to false is the context gets pinned between the while & if, so
> > on the next pass the while statement should evaluate to false breaking
> > the looping unless of course the contexts gets unpinned again... In
> > pratice this should be at most 3 atomic operations unless the loop is
> > broken.
>
> That's not really how this works ... E.g. the linux spinlocks are
> ticketed/queued locks, exactly because the "this likely doesn't happen"
> argument is not a very good one.
>
> Can't we at least convert intel_context->pin_count into a normal counter
> with a spinlock around it and then just make all this stuff a lot more
> reasonable?
>
> Yes this doesn't help the situation overall much, but at least we're not
> spreading dubious hand-roll locking patterns all over the place for not
> much good reasons.
>
> Hand-crafted artisanal locking is never a quality sign.

Or the simple plan, just pin the contexts always. That still leaves
you with some fun on final unpin, because that's called in an awkward
place, but that can be fixed a work_struct.
-Daniel

> -Daniel
>
> >
> > Matt
> >
> > > Because unlike hand-rolled stuff linux kernel spinlocks are not dumb
> > > spinlocks, but ticketed/queued locks and therefor starvation proof. But
> > > this stuff actually matters on todays multi-core and not-so-uniform (even
> > > without fully NUMA) architectures.
> > >
> > > Also I've just found another lockless retry loop which does actually
> > > degenerate into a full endless loop (if you're sufficiently unlucky in
> > > your races), so this really isn't academic at all.
> > > -Daniel
> > >
> > > > > >
> > > > > > Uh man lockless algorithms.
> > > > > >
> > > > > > Unless this comes:
> > > > > > - with essentially an academic looking paper that describes the abstract
> > > > > >   model of the lockless algorithm and proves it against the linux kernel
> > > > > >   meory model.
> > > > > >
> > > > > > - lockless stuff generally needs barriers, and those barriers must be all
> > > > > >   documented. This means a) a comment next to each barrier in the code b)
> > > > > >   pointing to its counterparty c) with the overall design also explained
> > > > > >   in the kerneldoc for those datastructres.
> > > > > >
> > > > > >   If you don't know where your barriers are, see above point about "it
> > > > > >   should look more like an academic paper in the commit message"
> > > > > >
> > > > > > - hard perf data about how this is absolutely required, based on a
> > > > > >   real-world use-case (which then sometimes justifies a microbenchmark
> > > > > >   metric for the details, but it always needs to be real-world based). And
> > > > > >   also a throughrough explainer how the perf issue isn't fixable through
> > > > > >   better design. If that's not doable, just protect the state machine with
> > > > > >   a big dumb lock and move on.
> > > > > >
> > > > > > - Also, because the current code is in such bad shape wrt lockless
> > > > > >   algorithms and premature optimizations: Overall complexity should go
> > > > > >   down (it's way too high right now), so pay down your new lockless trick
> > > > > >   by removing one of the existing ones that we only have because we can.
> > > > > >
> > > > > > Yes this is steep, but we're way out in the woods here and need to smoehow
> > > > > > get back.
> > > > >
> > > > > See below FIXME. At one point all of this was hidden in the backend but
> > > > > the dma-resv patches that landed upstream completely broke the layering,
> > > > > hence the need for the code here.
> > > > >
> > > > > I guess I don't really understand what mean when you say lockless alg
> > > > > needs barriers, if the atomic functions are not really atomic wouldn't
> > > > > the world be broken?
> > > >
> > > > They unordered atomics by default. Which means they're atomic itself, but
> > > > entirely unordered with anything else that's going on. Except when you
> > > > have one of the atomic ops which already guarantee a barrier, or you
> > > > manually add the barriers yourself. And yes there's enormous amounts of
> > > > bugs, and with our dgpu potentially running on non-IA cpus those bugs
> > > > matter.
> > > >
> > > > Note that in C++ atomics the default behaviour is strongly ordered atomics
> > > > with full barriers by default, because those are much easier to program
> > > > against. Kernel isn't like that and defaults to "you need to add all the
> > > > barriers yourself".
> > > >
> > > > I have a full lenght rant in the works and will work that through all
> > > > channels, but essentially locking is really hard to get right. And
> > > > lockless tricks practically need an academic paper with a formal
> > > > correctness proof against the linux memory model, or you do have bugs.
> > > >
> > > > And I know that the current code is choke full of this stuff, so it's
> > > > tempting to just add more, but we really cant. The amount of locking
> > > > trickery we have in the codebase must go down substantially. My take is
> > > > that any code that adds anything trick needs to fully justify it against
> > > > the above list, _and_ also clean up some of the existing nonsense so that
> > > > overall complexity doesn't increase.
> > > >
> > > > I'll share the full length rant with you internally, it's not yet ready
> > > > for publishing (but that's planned too).
> > > >
> > > >
> > > > > Also here I don't think it is really as simple as grab big dump lock for
> > > > > a variety of reasons, at least with the current dynamic pin / unpin code
> > > > > in place. If we move a perma-pinned contexts this could be cleaned up
> > > > > then.
> > > >
> > > > Yes it's a disaster, but we need to stop the bleeding. If perma-pinned
> > > > context can fix this I think we should do this asap. I'd say for parallel
> > > > context we should just do it outright (special case them or whatever) so
> > > > that we don't have to add even more very tricky code and tech debt.
> > > >
> > > > Doable?
> > > >
> > > > Cheers, Daniel
> > > >
> > > >
> > > > >
> > > > > Matt
> > > > >
> > > > > > -Daniel
> > > > > >
> > > > > > > +                           ce->ops->sched_disable(ce);
> > > > > > > +                           break;
> > > > > > > +                   }
> > > > > > > +           }
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * FIXME: This is ugly, these branches are only needed for parallel contexts in
> > > > > > > + * GuC submission. Basically the idea is if any of the contexts, that are
> > > > > > > + * configured for parallel submission, are pinned all the contexts need to be
> > > > > > > + * pinned in order to register these contexts with the GuC. We are adding the
> > > > > > > + * layer here while it should probably be pushed to the backend via a vfunc. But
> > > > > > > + * since we already have ce->pin + a layer atop it is confusing. Definitely
> > > > > > > + * needs a bit of rework how to properly layer / structure this code path. What
> > > > > > > + * is in place works but is not ideal.
> > > > > > > + */
> > > > > > > +int intel_context_pin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   if (intel_context_is_child(ce)) {
> > > > > > > +           if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > > > +                   return ____intel_context_pin(ce->parent);
> > > > > > > +           else
> > > > > > > +                   return 0;
> > > > > > > +   } else {
> > > > > > > +           return ____intel_context_pin(ce);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > > > +                    struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   if (intel_context_is_child(ce)) {
> > > > > > > +           if (!atomic_fetch_add(1, &ce->pin_count))
> > > > > > > +                   return __intel_context_pin_ww(ce->parent, ww);
> > > > > > > +           else
> > > > > > > +                   return 0;
> > > > > > > +   } else {
> > > > > > > +           return __intel_context_pin_ww(ce, ww);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +void intel_context_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   if (intel_context_is_child(ce)) {
> > > > > > > +           if (atomic_fetch_add(-1, &ce->pin_count) == 1)
> > > > > > > +                   __intel_context_unpin(ce->parent);
> > > > > > > +   } else {
> > > > > > > +           __intel_context_unpin(ce);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > >  #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > > > > >  #include "selftest_context.c"
> > > > > > >  #endif
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > > index ad6ce5ac4824..c208691fc87d 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> > > > > > > @@ -110,31 +110,15 @@ static inline void intel_context_unlock_pinned(struct intel_context *ce)
> > > > > > >     mutex_unlock(&ce->pin_mutex);
> > > > > > >  }
> > > > > > >
> > > > > > > -int __intel_context_do_pin(struct intel_context *ce);
> > > > > > > -int __intel_context_do_pin_ww(struct intel_context *ce,
> > > > > > > -                         struct i915_gem_ww_ctx *ww);
> > > > > > > -
> > > > > > >  static inline bool intel_context_pin_if_active(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     return atomic_inc_not_zero(&ce->pin_count);
> > > > > > >  }
> > > > > > >
> > > > > > > -static inline int intel_context_pin(struct intel_context *ce)
> > > > > > > -{
> > > > > > > -   if (likely(intel_context_pin_if_active(ce)))
> > > > > > > -           return 0;
> > > > > > > -
> > > > > > > -   return __intel_context_do_pin(ce);
> > > > > > > -}
> > > > > > > -
> > > > > > > -static inline int intel_context_pin_ww(struct intel_context *ce,
> > > > > > > -                                  struct i915_gem_ww_ctx *ww)
> > > > > > > -{
> > > > > > > -   if (likely(intel_context_pin_if_active(ce)))
> > > > > > > -           return 0;
> > > > > > > +int intel_context_pin(struct intel_context *ce);
> > > > > > >
> > > > > > > -   return __intel_context_do_pin_ww(ce, ww);
> > > > > > > -}
> > > > > > > +int intel_context_pin_ww(struct intel_context *ce,
> > > > > > > +                    struct i915_gem_ww_ctx *ww);
> > > > > > >
> > > > > > >  static inline void __intel_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > > @@ -146,28 +130,11 @@ void __intel_context_do_unpin(struct intel_context *ce, int sub);
> > > > > > >
> > > > > > >  static inline void intel_context_sched_disable_unpin(struct intel_context *ce)
> > > > > > >  {
> > > > > > > +   GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > >     __intel_context_do_unpin(ce, 2);
> > > > > > >  }
> > > > > > >
> > > > > > > -static inline void intel_context_unpin(struct intel_context *ce)
> > > > > > > -{
> > > > > > > -   if (!ce->ops->sched_disable) {
> > > > > > > -           __intel_context_do_unpin(ce, 1);
> > > > > > > -   } else {
> > > > > > > -           /*
> > > > > > > -            * Move ownership of this pin to the scheduling disable which is
> > > > > > > -            * an async operation. When that operation completes the above
> > > > > > > -            * intel_context_sched_disable_unpin is called potentially
> > > > > > > -            * unpinning the context.
> > > > > > > -            */
> > > > > > > -           while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
> > > > > > > -                   if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
> > > > > > > -                           ce->ops->sched_disable(ce);
> > > > > > > -                           break;
> > > > > > > -                   }
> > > > > > > -           }
> > > > > > > -   }
> > > > > > > -}
> > > > > > > +void intel_context_unpin(struct intel_context *ce);
> > > > > > >
> > > > > > >  void intel_context_enter_engine(struct intel_context *ce);
> > > > > > >  void intel_context_exit_engine(struct intel_context *ce);
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > > index 66b22b370a72..eb82be15b7a2 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > > > > @@ -39,8 +39,8 @@ struct intel_context_ops {
> > > > > > >
> > > > > > >     void (*ban)(struct intel_context *ce, struct i915_request *rq);
> > > > > > >
> > > > > > > -   int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
> > > > > > > -   int (*pin)(struct intel_context *ce, void *vaddr);
> > > > > > > +   int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
> > > > > > > +   int (*pin)(struct intel_context *ce);
> > > > > > >     void (*unpin)(struct intel_context *ce);
> > > > > > >     void (*post_unpin)(struct intel_context *ce);
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > > index baa1797af1c8..fc74ca28f245 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > > > @@ -2554,16 +2554,17 @@ static void execlists_submit_request(struct i915_request *request)
> > > > > > >  static int
> > > > > > >  __execlists_context_pre_pin(struct intel_context *ce,
> > > > > > >                         struct intel_engine_cs *engine,
> > > > > > > -                       struct i915_gem_ww_ctx *ww, void **vaddr)
> > > > > > > +                       struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     int err;
> > > > > > >
> > > > > > > -   err = lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > > > +   err = lrc_pre_pin(ce, engine, ww);
> > > > > > >     if (err)
> > > > > > >             return err;
> > > > > > >
> > > > > > >     if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags)) {
> > > > > > > -           lrc_init_state(ce, engine, *vaddr);
> > > > > > > +           lrc_init_state(ce, engine, ce->lrc_reg_state -
> > > > > > > +                          LRC_STATE_OFFSET / sizeof(*ce->lrc_reg_state));
> > > > > > >
> > > > > > >              __i915_gem_object_flush_map(ce->state->obj, 0, engine->context_size);
> > > > > > >     }
> > > > > > > @@ -2572,15 +2573,14 @@ __execlists_context_pre_pin(struct intel_context *ce,
> > > > > > >  }
> > > > > > >
> > > > > > >  static int execlists_context_pre_pin(struct intel_context *ce,
> > > > > > > -                                struct i915_gem_ww_ctx *ww,
> > > > > > > -                                void **vaddr)
> > > > > > > +                                struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > > -   return __execlists_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > > > +   return __execlists_context_pre_pin(ce, ce->engine, ww);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int execlists_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > > +static int execlists_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > > -   return lrc_pin(ce, ce->engine, vaddr);
> > > > > > > +   return lrc_pin(ce, ce->engine);
> > > > > > >  }
> > > > > > >
> > > > > > >  static int execlists_context_alloc(struct intel_context *ce)
> > > > > > > @@ -3570,20 +3570,19 @@ static int virtual_context_alloc(struct intel_context *ce)
> > > > > > >  }
> > > > > > >
> > > > > > >  static int virtual_context_pre_pin(struct intel_context *ce,
> > > > > > > -                              struct i915_gem_ww_ctx *ww,
> > > > > > > -                              void **vaddr)
> > > > > > > +                              struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > > > >
> > > > > > >      /* Note: we must use a real engine class for setting up reg state */
> > > > > > > -   return __execlists_context_pre_pin(ce, ve->siblings[0], ww, vaddr);
> > > > > > > +   return __execlists_context_pre_pin(ce, ve->siblings[0], ww);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > > +static int virtual_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
> > > > > > >
> > > > > > > -   return lrc_pin(ce, ve->siblings[0], vaddr);
> > > > > > > +   return lrc_pin(ce, ve->siblings[0]);
> > > > > > >  }
> > > > > > >
> > > > > > >  static void virtual_context_enter(struct intel_context *ce)
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > > index bb4af4977920..c466fc966005 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > > > > > @@ -947,30 +947,30 @@ void lrc_reset(struct intel_context *ce)
> > > > > > >  int
> > > > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > > > >         struct intel_engine_cs *engine,
> > > > > > > -       struct i915_gem_ww_ctx *ww,
> > > > > > > -       void **vaddr)
> > > > > > > +       struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > > +   void *vaddr;
> > > > > > >     GEM_BUG_ON(!ce->state);
> > > > > > >     GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
> > > > > > >
> > > > > > > -   *vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > > > -                                    i915_coherent_map_type(ce->engine->i915,
> > > > > > > -                                                           ce->state->obj,
> > > > > > > -                                                           false) |
> > > > > > > -                                    I915_MAP_OVERRIDE);
> > > > > > > +   vaddr = i915_gem_object_pin_map(ce->state->obj,
> > > > > > > +                                   i915_coherent_map_type(ce->engine->i915,
> > > > > > > +                                                          ce->state->obj,
> > > > > > > +                                                          false) |
> > > > > > > +                                   I915_MAP_OVERRIDE);
> > > > > > >
> > > > > > > -   return PTR_ERR_OR_ZERO(*vaddr);
> > > > > > > +   ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > > > +
> > > > > > > +   return PTR_ERR_OR_ZERO(vaddr);
> > > > > > >  }
> > > > > > >
> > > > > > >  int
> > > > > > >  lrc_pin(struct intel_context *ce,
> > > > > > > -   struct intel_engine_cs *engine,
> > > > > > > -   void *vaddr)
> > > > > > > +   struct intel_engine_cs *engine)
> > > > > > >  {
> > > > > > > -   ce->lrc_reg_state = vaddr + LRC_STATE_OFFSET;
> > > > > > > -
> > > > > > >     if (!__test_and_set_bit(CONTEXT_INIT_BIT, &ce->flags))
> > > > > > > -           lrc_init_state(ce, engine, vaddr);
> > > > > > > +           lrc_init_state(ce, engine,
> > > > > > > +                          (void *)ce->lrc_reg_state - LRC_STATE_OFFSET);
> > > > > > >
> > > > > > >     ce->lrc.lrca = lrc_update_regs(ce, engine, ce->ring->tail);
> > > > > > >     return 0;
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > > index 7f697845c4cf..837fcf00270d 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
> > > > > > > @@ -38,12 +38,10 @@ void lrc_destroy(struct kref *kref);
> > > > > > >  int
> > > > > > >  lrc_pre_pin(struct intel_context *ce,
> > > > > > >         struct intel_engine_cs *engine,
> > > > > > > -       struct i915_gem_ww_ctx *ww,
> > > > > > > -       void **vaddr);
> > > > > > > +       struct i915_gem_ww_ctx *ww);
> > > > > > >  int
> > > > > > >  lrc_pin(struct intel_context *ce,
> > > > > > > -   struct intel_engine_cs *engine,
> > > > > > > -   void *vaddr);
> > > > > > > +   struct intel_engine_cs *engine);
> > > > > > >  void lrc_unpin(struct intel_context *ce);
> > > > > > >  void lrc_post_unpin(struct intel_context *ce);
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > > index 2958e2fae380..f4f301bfb9f7 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> > > > > > > @@ -472,8 +472,7 @@ static int ring_context_init_default_state(struct intel_context *ce,
> > > > > > >  }
> > > > > > >
> > > > > > >  static int ring_context_pre_pin(struct intel_context *ce,
> > > > > > > -                           struct i915_gem_ww_ctx *ww,
> > > > > > > -                           void **unused)
> > > > > > > +                           struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     struct i915_address_space *vm;
> > > > > > >     int err = 0;
> > > > > > > @@ -576,7 +575,7 @@ static int ring_context_alloc(struct intel_context *ce)
> > > > > > >     return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -static int ring_context_pin(struct intel_context *ce, void *unused)
> > > > > > > +static int ring_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     return 0;
> > > > > > >  }
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > > index 2c1af030310c..826b5d7a4573 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/mock_engine.c
> > > > > > > @@ -167,12 +167,12 @@ static int mock_context_alloc(struct intel_context *ce)
> > > > > > >  }
> > > > > > >
> > > > > > >  static int mock_context_pre_pin(struct intel_context *ce,
> > > > > > > -                           struct i915_gem_ww_ctx *ww, void **unused)
> > > > > > > +                           struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -static int mock_context_pin(struct intel_context *ce, void *unused)
> > > > > > > +static int mock_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     return 0;
> > > > > > >  }
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > index dec757d319a2..c5c73c42bcf7 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > @@ -1905,6 +1905,7 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > > > >
> > > > > > >     GEM_BUG_ON(!engine->mask);
> > > > > > >     GEM_BUG_ON(context_guc_id_invalid(ce));
> > > > > > > +   GEM_BUG_ON(intel_context_is_child(ce));
> > > > > > >
> > > > > > >     /*
> > > > > > >      * Ensure LRC + CT vmas are is same region as write barrier is done
> > > > > > > @@ -2008,15 +2009,13 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > > > > >
> > > > > > >  static int __guc_context_pre_pin(struct intel_context *ce,
> > > > > > >                              struct intel_engine_cs *engine,
> > > > > > > -                            struct i915_gem_ww_ctx *ww,
> > > > > > > -                            void **vaddr)
> > > > > > > +                            struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > > -   return lrc_pre_pin(ce, engine, ww, vaddr);
> > > > > > > +   return lrc_pre_pin(ce, engine, ww);
> > > > > > >  }
> > > > > > >
> > > > > > >  static int __guc_context_pin(struct intel_context *ce,
> > > > > > > -                        struct intel_engine_cs *engine,
> > > > > > > -                        void *vaddr)
> > > > > > > +                        struct intel_engine_cs *engine)
> > > > > > >  {
> > > > > > >     if (i915_ggtt_offset(ce->state) !=
> > > > > > >         (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK))
> > > > > > > @@ -2027,20 +2026,33 @@ static int __guc_context_pin(struct intel_context *ce,
> > > > > > >      * explaination of why.
> > > > > > >      */
> > > > > > >
> > > > > > > -   return lrc_pin(ce, engine, vaddr);
> > > > > > > +   return lrc_pin(ce, engine);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void __guc_context_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   lrc_unpin(ce);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void __guc_context_post_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   lrc_post_unpin(ce);
> > > > > > >  }
> > > > > > >
> > > > > > >  static int guc_context_pre_pin(struct intel_context *ce,
> > > > > > > -                          struct i915_gem_ww_ctx *ww,
> > > > > > > -                          void **vaddr)
> > > > > > > +                          struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > > -   return __guc_context_pre_pin(ce, ce->engine, ww, vaddr);
> > > > > > > +   return __guc_context_pre_pin(ce, ce->engine, ww);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int guc_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > > +static int guc_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > > -   int ret = __guc_context_pin(ce, ce->engine, vaddr);
> > > > > > > +   int ret;
> > > > > > >
> > > > > > > +   GEM_BUG_ON(intel_context_is_parent(ce) ||
> > > > > > > +              intel_context_is_child(ce));
> > > > > > > +
> > > > > > > +   ret = __guc_context_pin(ce, ce->engine);
> > > > > > >     if (likely(!ret && !intel_context_is_barrier(ce)))
> > > > > > >             intel_engine_pm_get(ce->engine);
> > > > > > >
> > > > > > > @@ -2054,7 +2066,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > > > >     GEM_BUG_ON(context_enabled(ce));
> > > > > > >
> > > > > > >     unpin_guc_id(guc, ce, true);
> > > > > > > -   lrc_unpin(ce);
> > > > > > > +   __guc_context_unpin(ce);
> > > > > > >
> > > > > > >     if (likely(!intel_context_is_barrier(ce)))
> > > > > > >             intel_engine_pm_put(ce->engine);
> > > > > > > @@ -2062,7 +2074,141 @@ static void guc_context_unpin(struct intel_context *ce)
> > > > > > >
> > > > > > >  static void guc_context_post_unpin(struct intel_context *ce)
> > > > > > >  {
> > > > > > > -   lrc_post_unpin(ce);
> > > > > > > +   __guc_context_post_unpin(ce);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Future patches will use this function */
> > > > > > > +__maybe_unused
> > > > > > > +static int guc_parent_context_pre_pin(struct intel_context *ce,
> > > > > > > +                                 struct i915_gem_ww_ctx *ww)
> > > > > > > +{
> > > > > > > +   struct intel_context *child;
> > > > > > > +   int err, i = 0, j = 0;
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           err = i915_active_acquire(&child->active);
> > > > > > > +           if (unlikely(err))
> > > > > > > +                   goto unwind_active;
> > > > > > > +           ++i;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           err = __guc_context_pre_pin(child, child->engine, ww);
> > > > > > > +           if (unlikely(err))
> > > > > > > +                   goto unwind_pre_pin;
> > > > > > > +           ++j;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   err = __guc_context_pre_pin(ce, ce->engine, ww);
> > > > > > > +   if (unlikely(err))
> > > > > > > +           goto unwind_pre_pin;
> > > > > > > +
> > > > > > > +   return 0;
> > > > > > > +
> > > > > > > +unwind_pre_pin:
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           if (!j--)
> > > > > > > +                   break;
> > > > > > > +           __guc_context_post_unpin(child);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +unwind_active:
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           if (!i--)
> > > > > > > +                   break;
> > > > > > > +           i915_active_release(&child->active);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   return err;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Future patches will use this function */
> > > > > > > +__maybe_unused
> > > > > > > +static void guc_parent_context_post_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   struct intel_context *child;
> > > > > > > +
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           __guc_context_post_unpin(child);
> > > > > > > +   __guc_context_post_unpin(ce);
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           intel_context_get(child);
> > > > > > > +           i915_active_release(&child->active);
> > > > > > > +           intel_context_put(child);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Future patches will use this function */
> > > > > > > +__maybe_unused
> > > > > > > +static int guc_parent_context_pin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   int ret, i = 0, j = 0;
> > > > > > > +   struct intel_context *child;
> > > > > > > +   struct intel_engine_cs *engine;
> > > > > > > +   intel_engine_mask_t tmp;
> > > > > > > +
> > > > > > > +   GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > > > +
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           ret = __guc_context_pin(child, child->engine);
> > > > > > > +           if (unlikely(ret))
> > > > > > > +                   goto unwind_pin;
> > > > > > > +           ++i;
> > > > > > > +   }
> > > > > > > +   ret = __guc_context_pin(ce, ce->engine);
> > > > > > > +   if (unlikely(ret))
> > > > > > > +           goto unwind_pin;
> > > > > > > +
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           if (test_bit(CONTEXT_LRCA_DIRTY, &child->flags)) {
> > > > > > > +                   set_bit(CONTEXT_LRCA_DIRTY, &ce->flags);
> > > > > > > +                   break;
> > > > > > > +           }
> > > > > > > +
> > > > > > > +   for_each_engine_masked(engine, ce->engine->gt,
> > > > > > > +                          ce->engine->mask, tmp)
> > > > > > > +           intel_engine_pm_get(engine);
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           for_each_engine_masked(engine, child->engine->gt,
> > > > > > > +                                  child->engine->mask, tmp)
> > > > > > > +                   intel_engine_pm_get(engine);
> > > > > > > +
> > > > > > > +   return 0;
> > > > > > > +
> > > > > > > +unwind_pin:
> > > > > > > +   for_each_child(ce, child) {
> > > > > > > +           if (++j > i)
> > > > > > > +                   break;
> > > > > > > +           __guc_context_unpin(child);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Future patches will use this function */
> > > > > > > +__maybe_unused
> > > > > > > +static void guc_parent_context_unpin(struct intel_context *ce)
> > > > > > > +{
> > > > > > > +   struct intel_context *child;
> > > > > > > +   struct intel_engine_cs *engine;
> > > > > > > +   intel_engine_mask_t tmp;
> > > > > > > +
> > > > > > > +   GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > > > > +   GEM_BUG_ON(context_enabled(ce));
> > > > > > > +
> > > > > > > +   unpin_guc_id(ce_to_guc(ce), ce, true);
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           __guc_context_unpin(child);
> > > > > > > +   __guc_context_unpin(ce);
> > > > > > > +
> > > > > > > +   for_each_engine_masked(engine, ce->engine->gt,
> > > > > > > +                          ce->engine->mask, tmp)
> > > > > > > +           intel_engine_pm_put(engine);
> > > > > > > +   for_each_child(ce, child)
> > > > > > > +           for_each_engine_masked(engine, child->engine->gt,
> > > > > > > +                                  child->engine->mask, tmp)
> > > > > > > +                   intel_engine_pm_put(engine);
> > > > > > >  }
> > > > > > >
> > > > > > >  static void __guc_context_sched_enable(struct intel_guc *guc,
> > > > > > > @@ -2993,18 +3139,17 @@ static int guc_request_alloc(struct i915_request *rq)
> > > > > > >  }
> > > > > > >
> > > > > > >  static int guc_virtual_context_pre_pin(struct intel_context *ce,
> > > > > > > -                                  struct i915_gem_ww_ctx *ww,
> > > > > > > -                                  void **vaddr)
> > > > > > > +                                  struct i915_gem_ww_ctx *ww)
> > > > > > >  {
> > > > > > >     struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > > > >
> > > > > > > -   return __guc_context_pre_pin(ce, engine, ww, vaddr);
> > > > > > > +   return __guc_context_pre_pin(ce, engine, ww);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int guc_virtual_context_pin(struct intel_context *ce, void *vaddr)
> > > > > > > +static int guc_virtual_context_pin(struct intel_context *ce)
> > > > > > >  {
> > > > > > >     struct intel_engine_cs *engine = guc_virtual_get_sibling(ce->engine, 0);
> > > > > > > -   int ret = __guc_context_pin(ce, engine, vaddr);
> > > > > > > +   int ret = __guc_context_pin(ce, engine);
> > > > > > >     intel_engine_mask_t tmp, mask = ce->engine->mask;
> > > > > > >
> > > > > > >     if (likely(!ret))
> > > > > > > @@ -3024,7 +3169,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
> > > > > > >     GEM_BUG_ON(intel_context_is_barrier(ce));
> > > > > > >
> > > > > > >     unpin_guc_id(guc, ce, true);
> > > > > > > -   lrc_unpin(ce);
> > > > > > > +   __guc_context_unpin(ce);
> > > > > > >
> > > > > > >     for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> > > > > > >             intel_engine_pm_put(engine);
> > > > > > > --
> > > > > > > 2.28.0
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > http://blog.ffwll.ch
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts
  2021-08-03 22:29 ` [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts Matthew Brost
  2021-08-09 17:17   ` Daniel Vetter
@ 2021-08-12 19:26   ` Daniel Vetter
  1 sibling, 0 replies; 111+ messages in thread
From: Daniel Vetter @ 2021-08-12 19:26 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Tue, Aug 03, 2021 at 03:29:43PM -0700, Matthew Brost wrote:
> Some workloads use lots of contexts that continually pin / unpin
> contexts. With GuC submission an unpin translates to a schedule disable
> H2G which puts pressure on both the i915 and GuC. A schedule disable can
> also block future requests from being submitted until the operation
> completes. None of this is ideal.
> 
> Add a configurable, via debugfs, delay period before the schedule
> disable is issued. Default delay period is 1 second. The delay period is
> skipped if more than 3/4 of the guc_ids are in use.
> 
> This patch also updates the selftests to turn off this delay period as
> this extra time would likely cause many selftests to fail. Follow up
> patches will fix all the selftests and enable the delay period.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Quick summary of what we just discussed:

- mod_delayed_work (one per context) is your friend.

- if we go with that for the sched_disable work then that can just take
  the ce->pin_mutex, and recheck there. Which takes care of all the races
  (or well, should at least), because mod_delayed_work is making sure you
  never miss an update.

I'm feeling like that maybe this would be a semi-reasonable intermediate
option instead of just hard-pinning contexts completely for their entire.

I think that would be a smaller step than perma-pinnned context with their
guc_id, and it would allow us to clean up a lot of this code here still.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
>  .../i915/gem/selftests/i915_gem_coherency.c   |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_dmabuf.c  |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +-
>  .../drm/i915/gem/selftests/i915_gem_object.c  |   2 +-
>  drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
>  drivers/gpu/drm/i915/gt/intel_context.h       |   9 +
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   8 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   7 +
>  .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  28 ++
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 322 +++++++++++++++++-
>  .../i915/gt/uc/selftest_guc_flow_control.c    |  19 +-
>  drivers/gpu/drm/i915/i915_selftest.h          |   2 +
>  drivers/gpu/drm/i915/i915_trace.h             |  10 +
>  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_perf.c    |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_request.c |   2 +-
>  drivers/gpu/drm/i915/selftests/i915_vma.c     |   2 +-
>  18 files changed, 405 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index b199d59bd2c4..1553287e5491 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -1298,7 +1298,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
>  		int err;
>  
>  		/* serialises with execbuf */
> -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +		intel_context_close(ce);
>  		if (!intel_context_pin_if_active(ce))
>  			continue;
>  
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> index 13b088cc787e..a666d7e610f5 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> @@ -434,5 +434,5 @@ int i915_gem_coherency_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_gem_coherency),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> index ffae7df5e4d7..2c92afa9d608 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
> @@ -474,5 +474,5 @@ int i915_gem_dmabuf_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_dmabuf_import_same_driver_lmem_smem),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> index b20f5621f62b..4745c78a48de 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> @@ -1414,5 +1414,5 @@ int i915_gem_mman_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_mmap_gpu),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> index 740ee8086a27..ae1361c7c4cf 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object.c
> @@ -95,5 +95,5 @@ int i915_gem_object_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_gem_huge),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 8e90a4a0b7b0..96643040defd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -472,6 +472,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
>  	ce->guc_id = GUC_INVALID_LRC_ID;
>  	INIT_LIST_HEAD(&ce->guc_id_link);
>  
> +	INIT_LIST_HEAD(&ce->guc_sched_disable_link);
> +
>  	mutex_init(&ce->parallel_submit);
>  	ce->fence_context = dma_fence_context_alloc(1);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index a302599e436a..f4c9036f7f03 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -215,6 +215,15 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
>  	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
>  }
>  
> +static inline void intel_context_close(struct intel_context *ce)
> +{
> +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +
> +	trace_intel_context_close(ce);
> +	if (ce->ops->close)
> +		ce->ops->close(ce);
> +}
> +
>  static inline bool intel_context_is_closed(const struct intel_context *ce)
>  {
>  	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 8af9ace4c052..53f00657a45c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -11,6 +11,7 @@
>  #include <linux/list.h>
>  #include <linux/mutex.h>
>  #include <linux/types.h>
> +#include <linux/ktime.h>
>  
>  #include "i915_active_types.h"
>  #include "i915_sw_fence.h"
> @@ -38,6 +39,7 @@ struct intel_context_ops {
>  	int (*alloc)(struct intel_context *ce);
>  
>  	void (*ban)(struct intel_context *ce, struct i915_request *rq);
> +	void (*close)(struct intel_context *ce);
>  
>  	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww);
>  	int (*pin)(struct intel_context *ce);
> @@ -203,6 +205,12 @@ struct intel_context {
>  	 */
>  	struct list_head guc_id_link;
>  
> +	/*
> +	 * GuC schedule disable link / time
> +	 */
> +	struct list_head guc_sched_disable_link;
> +	ktime_t guc_sched_disable_time;
> +
>  	/* GuC context blocked fence */
>  	struct i915_sw_fence guc_blocked;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 30a0f364db8f..90b5b657d411 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -60,6 +60,7 @@ struct intel_guc {
>  	struct ida guc_ids;
>  	u32 num_guc_ids;
>  	u32 max_guc_ids;
> +	u32 guc_ids_in_use[GUC_SUBMIT_ENGINE_MAX];
>  	unsigned long *guc_ids_bitmap;
>  #define MAX_GUC_ID_ORDER	(order_base_2(MAX_ENGINE_INSTANCE + 1))
>  	struct list_head guc_id_list_no_ref[MAX_GUC_ID_ORDER + 1];
> @@ -69,6 +70,12 @@ struct intel_guc {
>  	struct list_head destroyed_contexts;
>  	struct intel_gt_pm_unpark_work destroy_worker;
>  
> +	spinlock_t sched_disable_lock;	/* protects schedule disable list */
> +	struct list_head sched_disable_list;
> +	struct hrtimer sched_disable_timer;
> +#define SCHED_DISABLE_DELAY_NS	1000000000
> +	u64 sched_disable_delay_ns;
> +
>  	bool submission_supported;
>  	bool submission_selected;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index 7c479c5e7b3a..53a6f3da6cce 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -80,12 +80,40 @@ static int guc_num_id_set(void *data, u64 val)
>  }
>  DEFINE_SIMPLE_ATTRIBUTE(guc_num_id_fops, guc_num_id_get, guc_num_id_set, "%lld\n");
>  
> +static int guc_sched_disable_delay_ns_get(void *data, u64 *val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	*val = guc->sched_disable_delay_ns;
> +
> +	return 0;
> +}
> +
> +static int guc_sched_disable_delay_ns_set(void *data, u64 val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	guc->sched_disable_delay_ns = val;
> +
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ns_fops,
> +			guc_sched_disable_delay_ns_get,
> +			guc_sched_disable_delay_ns_set, "%lld\n");
> +
>  void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
>  {
>  	static const struct debugfs_gt_file files[] = {
>  		{ "guc_info", &guc_info_fops, NULL },
>  		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
>  		{ "guc_num_id", &guc_num_id_fops, NULL },
> +		{ "guc_sched_disable_delay_ns", &guc_sched_disable_delay_ns_fops, NULL },
>  	};
>  
>  	if (!intel_guc_is_supported(guc))
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cd1893edf43a..dc0d6a099bee 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -654,11 +654,15 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
>  	return (timeout < 0) ? timeout : 0;
>  }
>  
> +static void sched_disable_contexts_flush(struct intel_guc *guc);
> +
>  int intel_guc_wait_for_idle(struct intel_guc *guc, long timeout)
>  {
>  	if (!intel_uc_uses_guc_submission(&guc_to_gt(guc)->uc))
>  		return 0;
>  
> +	sched_disable_contexts_flush(guc);
> +
>  	return intel_guc_wait_for_pending_msg(guc,
>  					      &guc->outstanding_submission_g2h,
>  					      true, timeout);
> @@ -1135,6 +1139,7 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce);
>  static void guc_signal_context_fence(struct intel_context *ce);
>  static void guc_cancel_context_requests(struct intel_context *ce);
>  static void guc_blocked_fence_complete(struct intel_context *ce);
> +static void sched_disable_context_delete(struct intel_context *ce);
>  
>  static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  {
> @@ -1160,6 +1165,7 @@ static void scrub_guc_desc_for_outstanding_g2h(struct intel_guc *guc)
>  		deregister = context_wait_for_deregister_to_register(ce);
>  		banned = context_banned(ce);
>  		init_sched_state(ce);
> +		sched_disable_context_delete(ce);
>  
>  		if (pending_enable || destroyed || deregister) {
>  			atomic_dec(&guc->outstanding_submission_g2h);
> @@ -1299,6 +1305,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  
>  	intel_gt_park_heartbeats(guc_to_gt(guc));
>  	disable_submission(guc);
> +	hrtimer_cancel(&guc->sched_disable_timer);
>  	guc->interrupts.disable(guc);
>  
>  	/* Flush IRQ handler */
> @@ -1656,6 +1663,8 @@ static void guc_lrcd_reg_fini(struct intel_guc *guc);
>  
>  static void destroy_worker_func(struct work_struct *w);
>  
> +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer);
> +
>  /*
>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>   * at firmware loading time.
> @@ -1687,6 +1696,13 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	INIT_LIST_HEAD(&guc->destroyed_contexts);
>  	intel_gt_pm_unpark_work_init(&guc->destroy_worker, destroy_worker_func);
>  
> +	spin_lock_init(&guc->sched_disable_lock);
> +	INIT_LIST_HEAD(&guc->sched_disable_list);
> +	hrtimer_init(&guc->sched_disable_timer, CLOCK_MONOTONIC,
> +		     HRTIMER_MODE_REL);
> +	guc->sched_disable_timer.function = sched_disable_timer_func;
> +	guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS;
> +
>  	return 0;
>  }
>  
> @@ -1852,6 +1868,12 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  	if (unlikely(ret < 0))
>  		return ret;
>  
> +	if (intel_context_is_parent(ce))
> +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> +			order_base_2(ce->guc_number_children + 1);
> +	else
> +		guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]++;
> +
>  	ce->guc_id = ret;
>  	return 0;
>  }
> @@ -1860,13 +1882,18 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>  {
>  	GEM_BUG_ON(intel_context_is_child(ce));
>  	if (!context_guc_id_invalid(ce)) {
> -		if (intel_context_is_parent(ce))
> +		if (intel_context_is_parent(ce)) {
> +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> +				order_base_2(ce->guc_number_children + 1);
>  			bitmap_release_region(guc->guc_ids_bitmap, ce->guc_id,
>  					      order_base_2(ce->guc_number_children
>  							   + 1));
> -		else
> +		} else {
> +			guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]--;
>  			ida_simple_remove(&guc->guc_ids, ce->guc_id);
> +		}
>  		clr_lrc_desc_registered(guc, ce->guc_id);
> +
>  		set_context_guc_id_invalid(ce);
>  	}
>  	if (!list_empty(&ce->guc_id_link))
> @@ -1931,9 +1958,13 @@ static int steal_guc_id(struct intel_guc *guc, struct intel_context *ce,
>  			 * from another context that has more guc_id that itself.
>  			 */
>  			if (cn_o2 != ce_o2) {
> +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] -=
> +					order_base_2(cn->guc_number_children + 1);
>  				bitmap_release_region(guc->guc_ids_bitmap,
>  						      cn->guc_id,
>  						      cn_o2);
> +				guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC] +=
> +					order_base_2(ce->guc_number_children + 1);
>  				bitmap_allocate_region(guc->guc_ids_bitmap,
>  						       ce->guc_id,
>  						       ce_o2);
> @@ -2538,7 +2569,7 @@ static void guc_context_unpin(struct intel_context *ce)
>  	__guc_context_unpin(ce);
>  
>  	if (likely(!intel_context_is_barrier(ce)))
> -		intel_engine_pm_put(ce->engine);
> +		intel_engine_pm_put_async(ce->engine);
>  }
>  
>  static void guc_context_post_unpin(struct intel_context *ce)
> @@ -2665,11 +2696,11 @@ static void guc_parent_context_unpin(struct intel_context *ce)
>  
>  	for_each_engine_masked(engine, ce->engine->gt,
>  			       ce->engine->mask, tmp)
> -		intel_engine_pm_put(engine);
> +		intel_engine_pm_put_async(engine);
>  	for_each_child(ce, child)
>  		for_each_engine_masked(engine, child->engine->gt,
>  				       child->engine->mask, tmp)
> -			intel_engine_pm_put(engine);
> +			intel_engine_pm_put_async(engine);
>  }
>  
>  static void __guc_context_sched_enable(struct intel_guc *guc,
> @@ -2788,6 +2819,8 @@ static struct i915_sw_fence *guc_context_block(struct intel_context *ce)
>  
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  
> +	sched_disable_context_delete(ce);
> +
>  	with_intel_runtime_pm(runtime_pm, wakeref)
>  		__guc_context_sched_disable(guc, ce, guc_id);
>  
> @@ -2914,8 +2947,202 @@ static void guc_context_ban(struct intel_context *ce, struct i915_request *rq)
>  								     1);
>  		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  	}
> +
> +	sched_disable_context_delete(ce);
> +}
> +
> +#define next_sched_disable_time(guc, now, ce) \
> +	(guc->sched_disable_delay_ns - \
> +	 (ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)))
> +static void ____sched_disable_context_delete(struct intel_guc *guc,
> +					     struct intel_context *ce)
> +{
> +	bool is_first;
> +
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(list_empty(&ce->guc_sched_disable_link));
> +
> +	is_first = list_is_first(&ce->guc_sched_disable_link,
> +				 &guc->sched_disable_list);
> +	list_del_init(&ce->guc_sched_disable_link);
> +	if (list_empty(&guc->sched_disable_list)) {
> +		hrtimer_try_to_cancel(&guc->sched_disable_timer);
> +	} else if (is_first) {
> +		struct intel_context *first =
> +			list_first_entry(&guc->sched_disable_list,
> +					 typeof(*first),
> +					 guc_sched_disable_link);
> +		u64 next_time = next_sched_disable_time(guc, ktime_get(),
> +							first);
> +
> +		hrtimer_start(&guc->sched_disable_timer,
> +			      ns_to_ktime(next_time),
> +			      HRTIMER_MODE_REL_PINNED);
> +	}
> +}
> +
> +static void __sched_disable_context_delete(struct intel_guc *guc,
> +					   struct intel_context *ce)
> +{
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		intel_context_sched_disable_unpin(ce);
> +		____sched_disable_context_delete(guc, ce);
> +	}
> +}
> +
> +static void sched_disable_context_delete(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +		__sched_disable_context_delete(guc, ce);
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +	}
> +}
> +
> +static void sched_disable_context_add(struct intel_guc *guc,
> +				      struct intel_context *ce)
> +{
> +	unsigned long flags;
> +
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
> +
> +	ce->guc_sched_disable_time = ktime_get();
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +	if (list_empty(&guc->sched_disable_list))
> +		hrtimer_start(&guc->sched_disable_timer,
> +			      ns_to_ktime(guc->sched_disable_delay_ns),
> +			      HRTIMER_MODE_REL_PINNED);
> +	list_add_tail(&ce->guc_sched_disable_link, &guc->sched_disable_list);
> +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +}
> +
> +static void sched_disable_contexts_flush(struct intel_guc *guc)
> +{
> +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +
> +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> +					 guc_sched_disable_link) {
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		list_del_init(&ce->guc_sched_disable_link);
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			continue;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			continue;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +	}
> +
> +	hrtimer_try_to_cancel(&guc->sched_disable_timer);
> +
> +	spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
>  }
>  
> +#define should_sched_be_disabled(guc, now, ce) \
> +	((ktime_to_ns(now) - ktime_to_ns(ce->guc_sched_disable_time)) > \
> +	(guc->sched_disable_delay_ns / 4) * 3)
> +static enum hrtimer_restart sched_disable_timer_func(struct hrtimer *hrtimer)
> +{
> +	struct intel_guc *guc = container_of(hrtimer, struct intel_guc,
> +					     sched_disable_timer);
> +	struct intel_runtime_pm *runtime_pm = &guc_to_gt(guc)->i915->runtime_pm;
> +	struct intel_context *ce, *cn;
> +	unsigned long flags;
> +	ktime_t now;
> +
> +	if (list_empty(&guc->sched_disable_list))
> +		return HRTIMER_NORESTART;
> +
> +	now = ktime_get();
> +
> +	spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +
> +	list_for_each_entry_safe_reverse(ce, cn, &guc->sched_disable_list,
> +					 guc_sched_disable_link) {
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		/*
> +		 * If a context has been waiting for 3/4 of its delay or more,
> +		 * issue the schedule disable. Using this heuristic allows more
> +		 * than 1 context to have its scheduling disabled when this
> +		 * timer is run.
> +		 */
> +		if (!should_sched_be_disabled(guc, now, ce))
> +			break;
> +
> +		list_del_init(&ce->guc_sched_disable_link);
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			continue;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			continue;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +	}
> +
> +	if (!list_empty(&guc->sched_disable_list)) {
> +		struct intel_context *first =
> +			list_first_entry(&guc->sched_disable_list,
> +					 typeof(*first),
> +					 guc_sched_disable_link);
> +		u64 next_time = next_sched_disable_time(guc, now, first);
> +
> +		hrtimer_forward(hrtimer, now, ns_to_ktime(next_time));
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +
> +		return HRTIMER_RESTART;
> +	} else {
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +
> +		return HRTIMER_NORESTART;
> +	}
> +}
> +
> +#define guc_id_pressure(max, in_use)	(in_use > (max / 4) * 3)
>  static void guc_context_sched_disable(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
> @@ -2924,8 +3151,14 @@ static void guc_context_sched_disable(struct intel_context *ce)
>  	intel_wakeref_t wakeref;
>  	u16 guc_id;
>  	bool enabled;
> +	int guc_id_index = intel_context_is_parent(ce) ?
> +		GUC_SUBMIT_ENGINE_MULTI_LRC : GUC_SUBMIT_ENGINE_SINGLE_LRC;
> +	int max_guc_ids = intel_context_is_parent(ce) ?
> +	       NUMBER_MULTI_LRC_GUC_ID(guc) :
> +	       guc->num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
>  
>  	GEM_BUG_ON(intel_context_is_child(ce));
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link));
>  
>  	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
>  	    !lrc_desc_registered(guc, ce->guc_id)) {
> @@ -2936,6 +3169,18 @@ static void guc_context_sched_disable(struct intel_context *ce)
>  	if (!context_enabled(ce))
>  		goto unpin;
>  
> +	/*
> +	 * If no guc_id pressure and the context isn't closed we delay the
> +	 * schedule disable to not to continuously disable / enable scheduling
> +	 * putting pressure on both the i915 and GuC. Delay is configurable via
> +	 * debugfs, default 1s.
> +	 */
> +	if (!guc_id_pressure(max_guc_ids, guc->guc_ids_in_use[guc_id_index]) &&
> +	    !intel_context_is_closed(ce) && guc->sched_disable_delay_ns) {
> +		sched_disable_context_add(guc, ce);
> +		return;
> +	}
> +
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	/*
> @@ -3294,6 +3539,58 @@ static void remove_from_context(struct i915_request *rq)
>  	i915_request_notify_execute_cb_imm(rq);
>  }
>  
> +static void __guc_context_close(struct intel_guc *guc,
> +				struct intel_context *ce)
> +{
> +	lockdep_assert_held(&guc->sched_disable_lock);
> +	GEM_BUG_ON(intel_context_is_child(ce));
> +
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		struct intel_runtime_pm *runtime_pm =
> +			ce->engine->uncore->rpm;
> +		intel_wakeref_t wakeref;
> +		bool enabled;
> +		u16 guc_id;
> +
> +		spin_lock(&ce->guc_state.lock);
> +		enabled = context_enabled(ce);
> +		if (unlikely(!enabled || submission_disabled(guc))) {
> +			if (enabled)
> +				clr_context_enabled(ce);
> +			spin_unlock(&ce->guc_state.lock);
> +			intel_context_sched_disable_unpin(ce);
> +			goto update_list;
> +		}
> +		if (unlikely(atomic_add_unless(&ce->pin_count, -2, 2))) {
> +			spin_unlock(&ce->guc_state.lock);
> +			goto update_list;
> +		}
> +		guc_id = prep_context_pending_disable(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		with_intel_runtime_pm(runtime_pm, wakeref)
> +			__guc_context_sched_disable(guc, ce, guc_id);
> +update_list:
> +		____sched_disable_context_delete(guc, ce);
> +	}
> +}
> +
> +static void guc_context_close(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	unsigned long flags;
> +
> +	/*
> +	 * If we close the context and a schedule disable is pending a delay, do
> +	 * it immediately.
> +	 */
> +	if (!list_empty(&ce->guc_sched_disable_link)) {
> +		spin_lock_irqsave(&guc->sched_disable_lock, flags);
> +		__guc_context_close(guc, ce);
> +		spin_unlock_irqrestore(&guc->sched_disable_lock, flags);
> +	}
> +}
> +
>  static struct intel_context *
>  guc_create_parallel(struct intel_engine_cs **engines,
>  		    unsigned int num_siblings,
> @@ -3308,6 +3605,7 @@ static const struct intel_context_ops guc_context_ops = {
>  	.post_unpin = guc_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.cancel_request = guc_context_cancel_request,
>  
> @@ -3538,6 +3836,10 @@ static int guc_request_alloc(struct i915_request *rq)
>  
>  	rq->reserved_space -= GUC_REQUEST_SIZE;
>  
> +	GEM_BUG_ON(!list_empty(&ce->guc_sched_disable_link) &&
> +		   atomic_read(&ce->pin_count) < 3);
> +	sched_disable_context_delete(ce);
> +
>  	/*
>  	 * guc_ids are exhausted or a heuristic is met indicating too many
>  	 * guc_ids are waiting on requests with submission dependencies (not
> @@ -3667,7 +3969,7 @@ static void guc_virtual_context_unpin(struct intel_context *ce)
>  	__guc_context_unpin(ce);
>  
>  	for_each_engine_masked(engine, ce->engine->gt, mask, tmp)
> -		intel_engine_pm_put(engine);
> +		intel_engine_pm_put_async(engine);
>  }
>  
>  static void guc_virtual_context_enter(struct intel_context *ce)
> @@ -3708,6 +4010,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
>  	.post_unpin = guc_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.cancel_request = guc_context_cancel_request,
>  
> @@ -3819,6 +4122,7 @@ static const struct intel_context_ops virtual_parent_context_ops = {
>  	.post_unpin = guc_parent_context_post_unpin,
>  
>  	.ban = guc_context_ban,
> +	.close = guc_context_close,
>  
>  	.enter = guc_virtual_context_enter,
>  	.exit = guc_virtual_context_exit,
> @@ -4924,7 +5228,11 @@ void intel_guc_submission_print_info(struct intel_guc *guc,
>  	drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
>  		   atomic_read(&guc->outstanding_submission_g2h));
>  	drm_printf(p, "GuC Number GuC IDs: %d\n", guc->num_guc_ids);
> -	drm_printf(p, "GuC Max Number GuC IDs: %d\n\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC Max Number GuC IDs: %d\n", guc->max_guc_ids);
> +	drm_printf(p, "GuC single-lrc GuC IDs in use: %d\n",
> +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_SINGLE_LRC]);
> +	drm_printf(p, "GuC multi-lrc GuC IDs in use: %d\n",
> +		   guc->guc_ids_in_use[GUC_SUBMIT_ENGINE_MULTI_LRC]);
>  	drm_printf(p, "GuC max context registered: %u\n\n",
>  		   guc->lrcd_reg.max_idx);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> index 9cfecf9d368e..ad70b3159ce4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> +++ b/drivers/gpu/drm/i915/gt/uc/selftest_guc_flow_control.c
> @@ -174,7 +174,8 @@ static int multi_lrc_not_blocked(struct intel_gt *gt, bool flow_control)
>  #define NUM_RQ_PER_CONTEXT	2
>  #define HEARTBEAT_INTERVAL	1500
>  
> -static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang)
> +static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids,
> +					bool hang, bool sched_disable_delay)
>  {
>  	struct intel_gt *gt = arg;
>  	struct intel_guc *guc = &gt->uc.guc;
> @@ -203,6 +204,9 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  	if (limit_guc_ids)
>  		guc->num_guc_ids = NUM_GUC_ID;
>  
> +	if (sched_disable_delay)
> +		guc->sched_disable_delay_ns = SCHED_DISABLE_DELAY_NS / 5;
> +
>  	ce = intel_context_create(intel_selftest_find_any_engine(gt));
>  	if (IS_ERR(ce)) {
>  		ret = PTR_ERR(ce);
> @@ -391,6 +395,7 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  	guc->num_guc_ids = guc->max_guc_ids;
>  	guc->gse_hang_expected = false;
>  	guc->inject_bad_sched_disable = false;
> +	guc->sched_disable_delay_ns = 0;
>  	kfree(contexts);
>  
>  	return ret;
> @@ -398,17 +403,22 @@ static int __intel_guc_flow_control_guc(void *arg, bool limit_guc_ids, bool hang
>  
>  static int intel_guc_flow_control_guc_ids(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, true, false);
> +	return __intel_guc_flow_control_guc(arg, true, false, false);
> +}
> +
> +static int intel_guc_flow_control_guc_ids_sched_disable_delay(void *arg)
> +{
> +	return __intel_guc_flow_control_guc(arg, true, false, true);
>  }
>  
>  static int intel_guc_flow_control_lrcd_reg(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, false, false);
> +	return __intel_guc_flow_control_guc(arg, false, false, false);
>  }
>  
>  static int intel_guc_flow_control_hang_state_machine(void *arg)
>  {
> -	return __intel_guc_flow_control_guc(arg, true, true);
> +	return __intel_guc_flow_control_guc(arg, true, true, false);
>  }
>  
>  #define NUM_RQ_STRESS_CTBS	0x4000
> @@ -861,6 +871,7 @@ int intel_guc_flow_control(struct drm_i915_private *i915)
>  	static const struct i915_subtest tests[] = {
>  		SUBTEST(intel_guc_flow_control_stress_ctbs),
>  		SUBTEST(intel_guc_flow_control_guc_ids),
> +		SUBTEST(intel_guc_flow_control_guc_ids_sched_disable_delay),
>  		SUBTEST(intel_guc_flow_control_lrcd_reg),
>  		SUBTEST(intel_guc_flow_control_hang_state_machine),
>  		SUBTEST(intel_guc_flow_control_multi_lrc_guc_ids),
> diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> index f54de0499be7..bf464db7affe 100644
> --- a/drivers/gpu/drm/i915/i915_selftest.h
> +++ b/drivers/gpu/drm/i915/i915_selftest.h
> @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
>  			T, ARRAY_SIZE(T), data)
>  #define i915_live_subtests(T, data) ({ \
>  	typecheck(struct drm_i915_private *, data); \
> +	(data)->gt.uc.guc.sched_disable_delay_ns = 0; \
>  	__i915_subtests(__func__, \
>  			__i915_live_setup, __i915_live_teardown, \
>  			T, ARRAY_SIZE(T), data); \
>  })
>  #define intel_gt_live_subtests(T, data) ({ \
>  	typecheck(struct intel_gt *, data); \
> +	(data)->uc.guc.sched_disable_delay_ns = 0; \
>  	__i915_subtests(__func__, \
>  			__intel_gt_live_setup, __intel_gt_live_teardown, \
>  			T, ARRAY_SIZE(T), data); \
> diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
> index 806ad688274b..57ba7065d5ab 100644
> --- a/drivers/gpu/drm/i915/i915_trace.h
> +++ b/drivers/gpu/drm/i915/i915_trace.h
> @@ -933,6 +933,11 @@ DEFINE_EVENT(intel_context, intel_context_reset,
>  	     TP_ARGS(ce)
>  );
>  
> +DEFINE_EVENT(intel_context, intel_context_close,
> +	     TP_PROTO(struct intel_context *ce),
> +	     TP_ARGS(ce)
> +);
> +
>  DEFINE_EVENT(intel_context, intel_context_ban,
>  	     TP_PROTO(struct intel_context *ce),
>  	     TP_ARGS(ce)
> @@ -1035,6 +1040,11 @@ trace_intel_context_reset(struct intel_context *ce)
>  {
>  }
>  
> +static inline void
> +trace_intel_context_close(struct intel_context *ce)
> +{
> +}
> +
>  static inline void
>  trace_intel_context_ban(struct intel_context *ce)
>  {
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> index f843a5040706..d54c280217fe 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> @@ -2112,5 +2112,5 @@ int i915_gem_gtt_live_selftests(struct drm_i915_private *i915)
>  
>  	GEM_BUG_ON(offset_in_page(i915->ggtt.vm.total));
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> index 9e9a6cb1d9e5..86bad00cca95 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> @@ -431,7 +431,7 @@ int i915_perf_live_selftests(struct drm_i915_private *i915)
>  	if (err)
>  		return err;
>  
> -	err = i915_subtests(tests, i915);
> +	err = i915_live_subtests(tests, i915);
>  
>  	destroy_empty_config(&i915->perf);
>  
> diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
> index d67710d10615..afbf88865a8b 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_request.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_request.c
> @@ -1693,7 +1693,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
>  	if (intel_gt_is_wedged(&i915->gt))
>  		return 0;
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
>  
>  static int switch_to_kernel_sync(struct intel_context *ce, int err)
> diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> index dd0607254a95..f4b157451851 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> @@ -1085,5 +1085,5 @@ int i915_vma_live_selftests(struct drm_i915_private *i915)
>  		SUBTEST(igt_vma_remapped_gtt),
>  	};
>  
> -	return i915_subtests(tests, i915);
> +	return i915_live_subtests(tests, i915);
>  }
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2021-08-12 19:27 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 22:28 [Intel-gfx] [PATCH 00/46] Parallel submission aka multi-bb execbuf Matthew Brost
2021-08-03 22:28 ` [Intel-gfx] [PATCH 01/46] drm/i915/guc: Allow flexible number of context ids Matthew Brost
2021-08-03 22:28 ` [Intel-gfx] [PATCH 02/46] drm/i915/guc: Connect the number of guc_ids to debugfs Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted Matthew Brost
2021-08-05  8:27   ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids Matthew Brost
2021-08-05  8:29   ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 05/46] drm/i915/guc: Introduce guc_submit_engine object Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 06/46] drm/i915/guc: Check return of __xa_store when registering a context Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 07/46] drm/i915/guc: Non-static lrc descriptor registration buffer Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 08/46] drm/i915/guc: Take GT PM ref when deregistering context Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 09/46] drm/i915: Add GT PM unpark worker Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission Matthew Brost
2021-08-09 14:23   ` Daniel Vetter
2021-08-09 18:11     ` Matthew Brost
2021-08-10  6:43       ` Daniel Vetter
2021-08-10 21:29         ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 11/46] drm/i915/guc: Don't call switch_to_kernel_context " Matthew Brost
2021-08-09 14:27   ` Daniel Vetter
2021-08-09 18:20     ` Matthew Brost
2021-08-10  6:47       ` Daniel Vetter
2021-08-11 17:47         ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 12/46] drm/i915/guc: Selftest for GuC flow control Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 13/46] drm/i915: Add logical engine mapping Matthew Brost
2021-08-09 14:28   ` Daniel Vetter
2021-08-09 18:28     ` Matthew Brost
2021-08-10  6:49       ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 14/46] drm/i915: Expose logical engine instance to user Matthew Brost
2021-08-09 14:30   ` Daniel Vetter
2021-08-09 18:37     ` Matthew Brost
2021-08-10  6:53       ` Daniel Vetter
2021-08-11 17:55         ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 15/46] drm/i915/guc: Introduce context parent-child relationship Matthew Brost
2021-08-09 14:37   ` Daniel Vetter
2021-08-09 14:40     ` Daniel Vetter
2021-08-09 18:45       ` Matthew Brost
2021-08-09 18:44     ` Matthew Brost
2021-08-10  8:45       ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions Matthew Brost
2021-08-09 15:17   ` Daniel Vetter
2021-08-09 18:58     ` Matthew Brost
2021-08-10  8:53       ` Daniel Vetter
2021-08-10  9:07         ` Daniel Vetter
2021-08-11 18:06           ` Matthew Brost
2021-08-12 14:45             ` Daniel Vetter
2021-08-12 14:52               ` Daniel Vetter
2021-08-11 18:23         ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 17/46] drm/i915/guc: Add multi-lrc context registration Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 18/46] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids Matthew Brost
2021-08-09 15:31   ` Daniel Vetter
2021-08-09 19:03     ` Matthew Brost
2021-08-10  9:12       ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine Matthew Brost
2021-08-09 15:35   ` Daniel Vetter
2021-08-09 19:05     ` Matthew Brost
2021-08-10  9:18       ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 21/46] drm/i915/guc: Add guc_child_context_destroy Matthew Brost
2021-08-09 15:36   ` Daniel Vetter
2021-08-09 19:06     ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 22/46] drm/i915/guc: Implement multi-lrc submission Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship Matthew Brost
2021-08-09 16:32   ` Daniel Vetter
2021-08-09 16:39     ` Matthew Brost
2021-08-09 17:03       ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 24/46] drm/i915/guc: Implement multi-lrc reset Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 25/46] drm/i915/guc: Update debugfs for GuC multi-lrc Matthew Brost
2021-08-09 16:36   ` Daniel Vetter
2021-08-09 19:13     ` Matthew Brost
2021-08-10  9:23       ` Daniel Vetter
2021-08-10  9:27         ` Daniel Vetter
2021-08-10 17:29           ` Matthew Brost
2021-08-11 10:04             ` Daniel Vetter
2021-08-11 17:35               ` Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 26/46] drm/i915: Connect UAPI to GuC multi-lrc interface Matthew Brost
2021-08-09 16:37   ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 27/46] drm/i915/doc: Update parallel submit doc to point to i915_drm.h Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 28/46] drm/i915/guc: Add basic GuC multi-lrc selftest Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 29/46] drm/i915/guc: Extend GuC flow control selftest for multi-lrc Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 30/46] drm/i915/guc: Implement no mid batch preemption " Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 31/46] drm/i915: Move secure execbuf check to execbuf2 Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 32/46] drm/i915: Move input/exec fence handling to i915_gem_execbuffer2 Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 33/46] drm/i915: Move output " Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 34/46] drm/i915: Return output fence from i915_gem_do_execbuffer Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 35/46] drm/i915: Store batch index in struct i915_execbuffer Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 36/46] drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 37/46] drm/i915: Teach execbuf there can be more than one batch in the objects list Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 38/46] drm/i915: Only track object dependencies on first request Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 39/46] drm/i915: Force parallel contexts to use copy engine for reloc Matthew Brost
2021-08-09 16:39   ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 40/46] drm/i915: Multi-batch execbuffer2 Matthew Brost
2021-08-09 17:02   ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission Matthew Brost
2021-08-09 17:07   ` Daniel Vetter
2021-08-09 17:12     ` Daniel Vetter
2021-08-03 22:29 ` [Intel-gfx] [PATCH 42/46] drm/i915: Hold all parallel requests until last request, properly handle error Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 43/46] drm/i915/guc: Handle errors in multi-lrc requests Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 44/46] drm/i915: Enable multi-bb execbuf Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 45/46] drm/i915/execlists: Weak parallel submission support for execlists Matthew Brost
2021-08-03 22:29 ` [Intel-gfx] [PATCH 46/46] drm/i915/guc: Add delay before disabling scheduling on contexts Matthew Brost
2021-08-09 17:17   ` Daniel Vetter
2021-08-09 19:32     ` Matthew Brost
2021-08-11  9:55       ` Daniel Vetter
2021-08-11 17:43         ` Matthew Brost
2021-08-12 14:04           ` Daniel Vetter
2021-08-12 19:26   ` Daniel Vetter
2021-08-03 22:51 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Parallel submission aka multi-bb execbuf (rev2) Patchwork
2021-08-03 22:53 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-08-03 22:57 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-08-03 23:19 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-08-05  3:53 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).